The Computational Complexity of Atmospheric Data Assimilation
P. M. Lyster*
NASA Data Assimilation Office (DAO), Goddard Laboratory for Atmospheres
lys@dao.gsfc.nasa.gov
Additional affiliations:
Earth System Science Interdisciplinary Center, University of Maryland, College
Park, MD 20742
and
Department of Meteorology
July 4, 2000
The computational complexity of algorithms for Four Dimensional
Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO)
is discussed. In 4DDA, observations that are taken inhomogeneously
in space and time are assimilated into a dynamical model to generate
best-estimates of the states of the system. The DAO is developing and
using 4DDA algorithms that provide accurate, consistent, gridded
datasets, or analyses, in support of Earth System Science Research.
Two large-scale algorithms are discussed, the first, Goddard Earth Observing
System Data Assimilation System (GEOS DAS), uses an
atmospheric general circulation model (GCM) and an observation-space
based analysis system, the Physical-space Statistical Analysis System (PSAS).
GEOS DAS is very similar to global meteorological forecast data assimilation
systems, but is used at NASA for climate research. Systems of this
size typically run at between 1 and 40 gigaflop/s. A second approach,
the Kalman filter, assimilates observations sequentially with the model at
the corresponding time when they are taken. Associated with the model forecast,
the very large error covariance matrix is evolved dynamically.
For atmospheric assimilation, the gridded dynamical fields have typically
more than 106 variables, therefore the full error covariance matrix may
be in excess of a teraword, evaluated at each timestep.
This problem can easily scale to petaflop/s proportions.
In the Summary we will discuss some of the reasons that
make it difficult to develop parallel scalable versions of
the algorithms.
1 Four Dimensional Data Assimilation
Four Dimensional Data Assimilation (4DDA) is the process of combining observations with a dynamical model to generate a best estimate, or analysis, of the state of the system (Daley 1991). The numerical models that are used have both systematic and random errors. On the other hand, observations represent the closest connection with a particular realization and have unavoidable errors associated with the measuring instruments. 4DDA is used in weather forecasting to initialize model forecasts, for example, at the National Centers for Environmental Prediction (NCEP) (Parrish and Derber 1992, Parrish et al. 1997), and at the European Center for Medium-Range Weather Forecasts (ECMWF) (Courtier et al. 1998, Rabier et al. 1998, Andersson et al. 1998). 4DDA is also used to perform reanalyses of past datasets in order to obtain consistent, gridded, best estimates of the state variables of the atmosphere (e.g., height, wind, moisture ...), for example, at NASA's Data Assimilation Office (DAO) (Schubert et al. 1993, 1995), at NCEP (Kalnay et al. 1996, Kanamitsu et al. 1999, Kistler et al. 2000), and at ECMWF (Gibson et al 1997). Reanalysis data are used by scientists to research the earth system. The DAO develops and uses software for scientific research on methodologies for data assimilation, for real-time satellite mission support, and for scientific reanalysis 4DDA systems under the NASA Office of Earth Science. For climate assimilation, the variables of interest extend to non-conventional datatypes such as trace gases and new land surface data.
This paper discusses the computational complexity of two important 4DDA algorithms in use at the DAO. The first is Goddard Earth Observing System Data Assimilation System (GEOS DAS) which uses a grid-point based atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). Data over a six hour (synoptic) period are aggregated into a single input dataset and assimilated with a model forecast (Bloom et al. 1996). This process is then cycled with the model forecast and new observations to produce gridded assimilation datasets every six hours. This approach has a strong heritage from weather forecasting. In order to obtain an accurate analysis it is important to determine the appropriate balance between forecast and observations. This is achieved through the use of forecast and observation error covariance matrices Pf and R respectively. The PSAS uses modeled covariance matrices whose parameters are determined from prior statistics with appropriate simplifying assumptions such as stationarity. The second part of the paper discusses the Kalman filter which evolves the forecast error covariance matrices in a dynamically consistent manner.
The expression ``complexity'' here indicates the scaling of floating point operations for the core data assimilation systems that run on high-end computers. Where appropriate, estimates of floating point counts may be calculated and this will be pointed out in the text. The paper doesn't address the software complexity or the algorithmic and scientific validation methods. Software complexity is emerging as a key issue because of the need to build portable and extensible code, and because of the difficulties in managing large software projects - the core GEOS DAS algorithm is in excess of 100,000 lines of code and has been used and modified by about 100 staff. The reader is referred to documentation on the DAO Office-Note pages (DAO 2000), in particular to papers by Guo et al. 1998, and Larson et al. 1998. We will also not discuss in depth the end-to-end distributed heterogeneous computing system (also in excess of 100,000 lines of code) that preprocesses and postprocesses observational data and gridded data. The reader is referred to the DAO web pages and the DAO Algorithm Theoretical Basis Document (DAO 2000). Finally, centers such as NCEP and ECMWF have three and four-dimensional data assimilation systems (3DVAR and 4DVAR) in their software suites. The computational complexity of GEOS DAS is similar to that of 3DVAR, however the complexity of 4DVAR is very different and is not discussed here.
1.1 Goddard Earth Observing System Data Assimilation System (GEOS DAS)
The optimal estimate can be obtained by minimizing
the cost function
![]() |
(1) |
where
The cost function represents two penalties for
deviating from the forecast state wf and the observations wo.
In this least squares context the best estimate, or analysis, which
minimizes J is
| (2) |
| (3) |
We will complete the discussion on the complexity of the GCM in Section 2.1.
The algorithm for solving Eqn. (2) at the DAO
is the Physical-space Statistical Analysis System (PSAS)
(Cohn et al. 1998). This
solves
| (HPfHT + R) x = wo - Hwf , | (4) |
| wa - wf = PfHT x. | (5) |
The error covariance matrices
Pf and R are implemented using models for variances and
correlations whose parameters are obtained from prior
statistics and simplifying assumptions such as
stationarity (Daley 1991).
Sophisticated multivariate formulations are used
to improve the quality of the analysis.
Although this has significant impact on the software complexity
(Guo et al. 1998, Larson et al. 1998) it has a secondary impact on
the computational complexity and will not be considered here.
The resulting matrices HPfHT+R and PfHT are dense, however
correlation models with compact support (Gaspari and Cohn 1999)
are used which reduces the computational complexity.
Eqns. (4) and (5) are solved for data
that are aggregated over six hourly intervals.
Currently, the DAO is shortening this period, but
similar to the above discussion we specify a baseline GEOS DAS
to have a six hour interval between analysis times.
For atmospheric assimilation there are typically about
observations
that are made world wide in a 6 hour period.
The PSAS consists of solving one
linear system (Eqn. 4)
for the intermediate vector x using a parallel nested-preconditioned conjugate gradient solver
(Cohn et al. 1998, Golub and van Loan 1989, PSAS 1998). This has complexity
,
where
is
the number iterations of the outer loop of the solver that
are required to reduce the residual error by an order of magnitude
(the complexity of the preconditioners are neglected here) and
is the fraction of nonzero entries in the matrix (sparsity).
Eqn. (5) evaluates the analysis
increment (wa-wf), and this has complexity
.
The analysis increment is evaluated on a
level grid
and these fields are interpolated to the model GCM grid. For the baseline case this
means that the vertical coordinate systems are interpolated from 14to 70 levels. Note that because the GCM and PSAS use different resolution
grids the values of n are context dependent in the order formulae.
The baseline version of the PSAS in GEOS-2 uses multithreading (multitasking)
parallelism on Cray J series and SGI Origin 2000 computers.
A parallel message-passing implementation of the PSAS,
using the Message-Passing Interface (MPI) library, was developed by
Ding and Ferraro (1995).
This implementation has been merged with the production system and will
become operational in 2000.
1.2 The Kalman Filter
The Kalman filter assimilates observations sequentially with
the model at the corresponding time when they
are taken (Jazwinski 1970, Cohn 1977).
In this regard, it is like the PSAS with a shortened update
cycle. The Kalman filter also forecasts not only the state
but the error covariance matrix
| Pf = MPaMT+Q , | (6) |
The analysis error covariance matrix is
| Pa = (I - KH)Pf , | (7) |
2 General Memory and Operations Requirements
2.1 Goddard Earth Observing System Data Assimilation System (GEOS DAS)
GEOS DAS is a cycled algorithm where each six hours the GCM performs the equivalent of nine hours of forecast, and the PSAS performs a single analysis. The Incremental Analysis Update (IAU) algorithm (Bloom et al. 1996) couples the analysis Eqn. (6) with the GCM. Table 1 shows the complexity scaling factors for the GEOS GCM (Takacs 1998),
| Module | H-Scaling | V-Scaling | T-Scaling |
| Dynamics | XY | Z | X |
| Radiation | XY | ZZ | - |
| Convection | XY | Z | - |
| Turbulence | XY | Z | - |
Table 1: Complexity Scaling factors for multiresolution GEOS GCM.
In Table 1, X,Y, and Z correspond to horizontal and vertical scaling factors respectively. All modules scale proportional to the horizontal resolution, while there is an additional resolution/stability scaling in the dynamics due to the requirement that the timestep is reduced in proportion to the horizontal grid spacing. All modules scale linearly with vertical resolution except the radiation which has a nonlinear burden due to layer-to-layer flux calculations. The CFL stability condition requires a high-latitude filter whose logarithmic grid-point (X) dependence is ignored here. The scaling of dynamics is important because as resolution increases the proportion of wall-clock time begins to dominate. The complexity of the dynamics, for a fixed simulation time, is proportional to n4/3. Thus, if the resolution of the GCM is doubled in all three dimensions the complexity the dynamics (excluding the logarithmic contribution of the spectral filter) increases by sixteen fold. The memory requirement for the GCM scales as n - thus the memory requirement in general scales less rapidly than the complexity requirement.
The complexity of the PSAS has two components,
and
,
where the number of gridpoints in the
baseline system is
and the number of observations is
as
described in Section 1. The first term arises
for the conjugate-gradient solver (Eqn. 4) and the second term is
from the analysis (Eqn. 5).
is the number of iterations of the solver
which is chosen so that the residual is reduced by an order of
magnitude, and the parameter
accounts for sparsity. If the
innovation matrix is precalculated, as in the parallel (MPI) algorithm of Ding
and Ferraro (1995) then the memory requirement due to the innovation
matrix is
.
The baseline version of the PSAS in GEOS-2
re-calculates the matrix elements during the conjugate
gradient iteration cycle thus reducing the overall
memory requirement and allowing for scalability
to larger numbers of observations beyond the current values
(Guo et al. 1998, Larson et al. 1998).
Table 2 shows the percentage of time taken by the modules of the baseline GEOS-2 DAS. Note that the time taken for the diagnostics includes the time to write data to disk.
| GEOS-2 DAS Module | Percentage of Wall Clock time |
| 8 Processors of an SGI Origin 2000 | |
| GCM | 45. |
| PSAS | 39. |
| Interface | 2.5 |
| Diagnostics | 13.5 |
Table 2: The percentage of time taken by the modules of parallel multitasking GEOS-2 DAS. Runs performed on 8 processors of an SGI Origin 2000.
The expressions
and
for the
complexity of the PSAS can be very approximately checked by
taking the following values, p = 105, n = 106,
,
s = 0.25. Summing the expressions
(assuming approximately equal weight to each) gives
floating point operations per analysis.
This compares with a value of
floating point
multiplies and
floating point additions
for the total complexity of GEOS DAS (including the GCM,
interface calculations, and the PSAS) per analysis as calculated from
the Hardware Performance Monitors on the Cray J916.
The estimate
for the PSAS is low
but a good order of magnitude -
Table 2 indicates that
is more like the actual number
of flops per analysis for the baseline resolution PSAS.
Table 3 shows the percentage of time taken by the submodules of the GCM Module of GEOS-2. The GCM is run in ``assimilation mode'' using the Matsuno timestepping scheme. The shapiro filter is a spatial local filter. The physics modules includes radiation, convection, and turbulence. Rotate and c_to_a are grid transformations (Sawyer and Wang 1999).
| GCM Module | Percentage of Wall Clock time | |
| 8 Processors of SGI Origin 2000 | ||
| dycore | 15. | |
| shapiro filter | 5. | |
| step | 25. | |
| physics | 40. | |
| rotate | 10. | |
| c_to_a | 5. |
Table 3: The percentage of time taken by the submodules of the GCM (vc6.5)
The percentage of time taken by the submodules of
the PSAS Module of GEOS-2 is shown in Table 4.
For the baseline GEOS-2, the solver (Eqn. 4) with complexity
takes about 35% of the time
while the analysis (Eqn. 5) with complexity
takes 62% of the time.
| PSAS Module | Percentage of Wall Clock time |
| 8 Processors of an SGI Origin 2000 | |
| Solver (Eqn. 3) | 35 |
| Analysis (Eqn. 4) | 62 |
| Utilities | 3 |
Table 4: The percentage of time taken by the submodules of the PSAS Module of GEOS-2.
The Core GEOS DAS is run in a number of production modes (Stobie 1996).
These may be generally categorized as real-time and reanalysis modes.
Real-time requires model forecast and analyses to take place sufficiently
in excess of one day of assimilation per wall-clock day so that
the results may be studied and disseminated to customers such
as satellite instrument teams with real-time needs. Reanalyses
are multi-year studies designed to provide long-term datasets from
a frozen scientific software configuration. For example, the DAO has
completed a reanalysis for the years 1979 to 1995 using the GEOS-1
version of GEOS DAS (Schubert et al. 1993, 1995).
Appendix A summarizes the baseline GEOS-2 DAS system performance and throughput.
GEOS-2 is multitasking parallel runs on Cray J90/C90
and SGI Origin 2000 series computers. During 2000 the
DAO will switch to MPI/shmem parallel GEOS-3.
A lot of the work in support of this effort can be
viewed on the DAO web page
http://dao.gsfc.nasa.gov
as well as the author's personal page
http://dao.gsfc.nasa.gov/DAO_people/lys.
It is worthwhile discussing the data that are used for current
and near-term atmospheric assimilation and the gridded products that
are produced and stored. There are about 2 billion useful meteorological
observations going back 50 years. At present the volume of this data does
not present the greatest computational complexity, and operational centers
are more concerned with the accuracy of these data. Flat files are the
rule, and object-oriented databases are a thing of the future.
Considerable energy is devoted to finding and validating old
observations, i.e., ``data rehabilitation''.
This should change in the next 5 years. For example,
satellite sea-surface wind observations have been shown to be useful in
increasing forecast accuracy of weather analyses. The DAO will
also assimilate increasing amount of non-meteorological data, such
as trace gases, using data from Earth Observing System (EOS).
Currently about 105 observations are produced daily under the
World Weather Watch and transmitted to weather centers and the DAO via the
Global Telecommunications System.
More than 50% of these are produced by satellites, mostly
in the form of temperature retrievals.
At baseline resolution for the DAO General Circulation Model
(2o
2.5o
70 levels)
a day of assimilation produces in excess of 1 gigabyte of data.
Hence data assimilation at real time (one day of assimilation per
wall-clock day) does not stretch the local disk capacity or bandwidth
of most modern computer systems. However, extended runs at higher throughput
than real time creates problems for storage and data processing.
The most severe is for reanalysis projects where
multi-year datasets are analyzed by a fixed Data Assimilation System
and the products are made available to the scientific community.
The standard benchmark is a rate of 30 days of assimilation per
day of wall-clock time (i.e., a fifteen year reanalysis on order
half a year). At 30 days per wall-clock day the DAO Data Assimilation
System produces about 10 terabytes of data per year.
2.2 Kalman Filter
The complexity of the Kalman filter is
,
where
is the number of model timesteps in a fixed interval, and
where
takes account of the size of the
template for finite-difference advection scheme as implemented
by Lyster et al. 1997. Note that this scales quadratically
in the size n of the state space, compared with the advection model
itself which is approximately linear. The memory is
.
The floating point scaling
of the
PSAS refers to p observations
that are aggregated over 6 hourly (synoptic) time intervals. If this
interval is shortened to the timestep of the model, currently
minute, then the cost of the algorithm per timestep
scales more favorably. In this limit, for a fixed time interval,
the computational cost of the PSAS will scale as
.
A two-dimensional (latitude-longitude) Kalman filter for
the assimilation of stratospheric chemical constituents
has been developed by Lyster et al. (1997), and is being used
for scientific study of stratospheric constituent gases
(Ménard et al. 2000a,b).
The model is advective transport (Lin and Rood 1996)
using prescribed winds with
2o
2.5o resolution
(
),
model timestep 15 minutes, and
observations
per timestep. The Kalman filter achieves 150 days of
assimilation per wall-clock day, or 4.1 sustained gigaflop/s,
on 128 processors of the SGI/Cray T3E-600 at NASA Goddard Space
Flight Center. This
was used for the assimilation of retrieved methane from the
CLAES instrument aboard NASA's Upper Atmosphere Research Satellite.
A full Kalman filter based on a tangent-linear three-dimensional GCM
would require considerably more resources. The memory to store the
covariance matrices would be approximately
words at the baseline resolution of
2o
2.5o
70 levels, and the algorithm would
scale to approximately n
250 megaflop/s = 0.25 petaflop/s
(the value 250 megaflop/s is taken from the baseline GEOS-2 DAS
in Appendix A).
3 Summary
Two algorithms for four dimensional data assimilation and their complexities (in terms of the scaling of the floating point operations and the memory requirements) are discussed. The operational Goddard Earth Observing System Data Assimilation System (GEOS DAS) that is used at Goddard Space Flight Center Data Assimilation Office was described. This has diminished complexity compared with the full Kalman filter, but is still a substantial supercomputing problem. The Kalman filter is the most difficult problem and considerable research needs to be done on both the scientific and computing aspects. This represents a problem so large that the model may be considered as a small module, potentially not in need of parallelizing (Lyster et al. 1997). Current research is developing approximate Kalman filters (DAO 2000) whose complexities are more feasible given the computing which will be available for operational data assimilation in the coming years.
For the baseline resolution PSAS the solver (Eqn. 4) with complexity
floating point operations
takes about 35% of the time
while the analysis (Eqn. 5) with complexity
takes 62% of the time.
The value p
105 is the aggregated number
of observations over a 6 hourly interval.
The complexity of the Kalman filter scales approximately as
where p is the number of
observations at each model timestep. A full three-dimensional
filter based on a General Circulation Model will
require petaflop/s (1015 floating point calculations per second)
computing.
There is increasing awareness that Earth Science algorithms such as GEOS DAS are difficult to optimize for parallel scalability and performance. This is particularly topical as programs such as the DOE's Accelerated Strategic Computing Initiative (ASCI) Program are building computers with thousands of closely coupled processors. The DAO customarily runs the parallel (multitasking) operational GEOS-2 software on platforms using between 16 and 64 processors. Certainly much optimization has been done, such as has been promoted by the High Performance Computing and Communications (HPCC) Program. In order to build scalable code, algorithms that carefully manage memory and tasks using, for example, the Message-Passing Interface (MPI) library are important. The new version of GEOS DAS uses MPI and shmem libraries. Several general statements can be made about scalability, some of this has been learned through the HPCC effort at the DAO:
The baseline operational
algorithm GEOS-2 is multitasking parallel, run on Cray J90/C90
and SGI Origin 2000 series computers. During 2000 the
DAO will switch to MPI/shmem parallel GEOS-3.
A lot of the work in support of this effort can be
viewed on the DAO web page
http://dao.gsfc.nasa.gov
as well as the author's personal page
http://dao.gsfc.nasa.gov/DAO_people/lys.
4 Acknowledgments
The author acknowledges discussions with Anthony Colyandro, Chris Ding, Gregory Gaspari, Jing Guo, Dan Kokron, Jay Larson, William Sawyer, Harper Pryor, Richard Rood, Lawrence Takacs, Ricardo Todling, and Philip Webster. This work was funded under the NASA High Performance Computing and Communications Initiative (HPCC) Earth and Space Science (ESS) Program Cooperative Agreement Number NCCS5-150, the NASA EOS Interdisciplinary Science Program, and the NASA Research and Applications Program.
Appendix A: GEOS DAS System Performance and Throughput Requirements
| Baseline GEOS-2: 2o |
|
| 5 days/wallclock day throughput | |
| This corresponds to the baseline operational code | |
| Multitasking GEOS-2 DAS run on 8pe Origin 2000 | |
| Main Memory (GB) | 2.2 (per image) |
| Disk (GB) | 8.0 |
| Mass Storage (GB) | 2300.0 (this is output per year) |
| Volume of Data (GB) | 6.2 (produced per day per image) |
| Gigaflop/s sustained | 0.25 (per image) |
| Duration of Run | 5 days/WCday (continuous operation, single image) |
Appendix B: Surface-To-Volume Limitations on the Scalability of Domain-Decomposed Gridpoint Models
The following calculation is an illustration of the so called surface-to-volume effect for describing the limits to scalability of domain-decomposed gridpoint models. We consider the scalability of a canonical transport algorithm. This is only an illustration in the sense that it does not account for a number of the typical complications that often occur in General Circulation Models, viz:
Define the following symbols:
Ng = Total number of grid points in the domain
d = Dimension of the physical problem
D = Dimension of the parallel decomposition
Np = Total number of processors
M = single processor speed in megaflop/sec
B = interprocessor communication bandwidth in megabytes/sec
F = Number of flops/gridpoint/timestep for the relevant transport
algorithm
G = The number of ``layers'' of guard cells in each dimension of
the parallel decomposition (e.g., G=2 for fourth order finite
difference, not sure what G is for Lin-Rood)
P = The precision of the calculation (i.e., P = 4, or 8)
Typically D=1,2, or 3, and d=2, or 3, while
.
For instance,
the physical problem is three dimensional (i.e, d = 3),
however, we sometimes discuss
scalability in terms of the number of gridpoints
in the latitude-longitude domain (i.e, horizontal transport), which
is the d = 2 case. We will not use the exact number of
gridpoints as in the current DAO models, but
by way of illustration use simple geometric expressions, e.g.,
a two dimensional problem (d = 2) with a checkerboard decomposition
(D = 2), then the number of gridpoints in each dimension in each processor is
.
The communication time per timestep per processor is:
| Tcomm = 2DGPB-1Ng(d-1)/d/Np(D-1)/D . | (8) |
| Tcpu = (F/M)(Ng/Np) . | (9) |
![]() |
(10) |
| (11) |
The terms in
may be characterized as follows:
For typical parameters of problems such as the baseline code we are
currently studying:
,
,
,
G = 2, P = 8, d = D = 2,
Ng = 12960 (e.g.,
these are appropriate for
resolution horizontal transport),
and
Np = 1024 processors, we get
,
i.e., the scalability is limited to
processors.
6 References
http://sunset.usc.edu/COCOMOII/suite.html
http://dao.gsfc.nasa.gov/subpages/atbd.html
http://dao.gsfc.nasa.gov/subpages/office-notes.html
http://dao.gsfc.nasa.gov/subpages/office-notes.html
http://http://dao.gsfc.nasa.gov/DAO_people/lys
http://dao.gsfc.nasa.gov/subpages/office-notes.html
http://dao.gsfc.nasa.gov/subpages/tech-reports.html
http://dao.gsfc.nasa.gov/subpages/office-notes.html
http://dao.gsfc.nasa.gov/subpages/office-notes.html
http://dao.gsfc.nasa.gov/subpages/tech-reports.html