The Computational Complexity of Atmospheric Data Assimilation




P. M. Lyster*
NASA Data Assimilation Office (DAO), Goddard Laboratory for Atmospheres
lys@dao.gsfc.nasa.gov



Additional affiliations:
$\ast$Earth System Science Interdisciplinary Center, University of Maryland, College Park, MD 20742
and
Department of Meteorology

July 4, 2000

Abstract

The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations that are taken inhomogeneously in space and time are assimilated into a dynamical model to generate best-estimates of the states of the system. The DAO is developing and using 4DDA algorithms that provide accurate, consistent, gridded datasets, or analyses, in support of Earth System Science Research. Two large-scale algorithms are discussed, the first, Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological forecast data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 40 gigaflop/s. A second approach, the Kalman filter, assimilates observations sequentially with the model at the corresponding time when they are taken. Associated with the model forecast, the very large error covariance matrix is evolved dynamically. For atmospheric assimilation, the gridded dynamical fields have typically more than 106 variables, therefore the full error covariance matrix may be in excess of a teraword, evaluated at each timestep. This problem can easily scale to petaflop/s proportions. In the Summary we will discuss some of the reasons that make it difficult to develop parallel scalable versions of the algorithms.

1 Four Dimensional Data Assimilation

Four Dimensional Data Assimilation (4DDA) is the process of combining observations with a dynamical model to generate a best estimate, or analysis, of the state of the system (Daley 1991). The numerical models that are used have both systematic and random errors. On the other hand, observations represent the closest connection with a particular realization and have unavoidable errors associated with the measuring instruments. 4DDA is used in weather forecasting to initialize model forecasts, for example, at the National Centers for Environmental Prediction (NCEP) (Parrish and Derber 1992, Parrish et al. 1997), and at the European Center for Medium-Range Weather Forecasts (ECMWF) (Courtier et al. 1998, Rabier et al. 1998, Andersson et al. 1998). 4DDA is also used to perform reanalyses of past datasets in order to obtain consistent, gridded, best estimates of the state variables of the atmosphere (e.g., height, wind, moisture ...), for example, at NASA's Data Assimilation Office (DAO) (Schubert et al. 1993, 1995), at NCEP (Kalnay et al. 1996, Kanamitsu et al. 1999, Kistler et al. 2000), and at ECMWF (Gibson et al 1997). Reanalysis data are used by scientists to research the earth system. The DAO develops and uses software for scientific research on methodologies for data assimilation, for real-time satellite mission support, and for scientific reanalysis 4DDA systems under the NASA Office of Earth Science. For climate assimilation, the variables of interest extend to non-conventional datatypes such as trace gases and new land surface data.

This paper discusses the computational complexity of two important 4DDA algorithms in use at the DAO. The first is Goddard Earth Observing System Data Assimilation System (GEOS DAS) which uses a grid-point based atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). Data over a six hour (synoptic) period are aggregated into a single input dataset and assimilated with a model forecast (Bloom et al. 1996). This process is then cycled with the model forecast and new observations to produce gridded assimilation datasets every six hours. This approach has a strong heritage from weather forecasting. In order to obtain an accurate analysis it is important to determine the appropriate balance between forecast and observations. This is achieved through the use of forecast and observation error covariance matrices Pf and R respectively. The PSAS uses modeled covariance matrices whose parameters are determined from prior statistics with appropriate simplifying assumptions such as stationarity. The second part of the paper discusses the Kalman filter which evolves the forecast error covariance matrices in a dynamically consistent manner.

The expression ``complexity'' here indicates the scaling of floating point operations for the core data assimilation systems that run on high-end computers. Where appropriate, estimates of floating point counts may be calculated and this will be pointed out in the text. The paper doesn't address the software complexity or the algorithmic and scientific validation methods. Software complexity is emerging as a key issue because of the need to build portable and extensible code, and because of the difficulties in managing large software projects - the core GEOS DAS algorithm is in excess of 100,000 lines of code and has been used and modified by about 100 staff. The reader is referred to documentation on the DAO Office-Note pages (DAO 2000), in particular to papers by Guo et al. 1998, and Larson et al. 1998. We will also not discuss in depth the end-to-end distributed heterogeneous computing system (also in excess of 100,000 lines of code) that preprocesses and postprocesses observational data and gridded data. The reader is referred to the DAO web pages and the DAO Algorithm Theoretical Basis Document (DAO 2000). Finally, centers such as NCEP and ECMWF have three and four-dimensional data assimilation systems (3DVAR and 4DVAR) in their software suites. The computational complexity of GEOS DAS is similar to that of 3DVAR, however the complexity of 4DVAR is very different and is not discussed here.

1.1 Goddard Earth Observing System Data Assimilation System (GEOS DAS)

The optimal estimate can be obtained by minimizing the cost function

$\displaystyle J(w) = \frac{1}{2}[{(w^f-w)}^T{(P^f)}^{-1}(w^f-w) +
{(w^o-Hw)}^TR^{-1}(w^o-Hw)]$     (1)

where

The cost function represents two penalties for deviating from the forecast state wf and the observations wo. In this least squares context the best estimate, or analysis, which minimizes J is

\begin{displaymath}w^a\ =\ w^f + K(w^o - Hw^f)\ ,
\end{displaymath} (2)

where the Kalman gain is

\begin{displaymath}K\ =\ P^fH^T{(HP^fH^T+R)}^{-1}\ .
\end{displaymath} (3)

Although the operational algorithms at weather centers and laboratories (Daley 1991) have more constraints and attributes than the simple form of Eqns. (1), (2), and (3), these equations are sufficient for the present discussion on computational complexity. The algorithm at NASA's DAO is the Goddard Earth Observing System Data Assimilation System (GEOS DAS). The forecast fields wf are generated by an atmospheric general circulation model (GCM). For the current operational (GEOS-3) GEOS DAS the GCM (Takacs et al. 1994) comprises a spatial fourth-order-accurate finite-difference dynamical core, and physics modules for moist processes, short and long wave radiation, turbulence, and land-surface processes. A high-latitude spectral filter is used to suppress the CFL timestepping instability. In addition, a global Shapiro filter and polar rotation algorithm provide further smoothing and numerical stability. The state variables for the GCM are horizontal winds, potential temperature, and surface pressure. An extensive land-surface model, with associated prognostic variables has also been implemented in the GCM. This paper will use a baseline model resolution of 2o longitude 2.5o latitude and 70 vertical levels. This corresponds to three-dimensional fields with horizontal resolution 91 gridpoints in latitude and 144 gridpoints in longitude. Currently, the GCM is run at $1^o \times 1^o \times 48$ resolution. The actual resolution is not critical to this paper which discusses scaling properties starting from the baseline resolution of the GEOS DAS. The number of state variables in the baseline resolution is approximately $n \approx 3 \times 91
\times 144 \times 70$ $+ 91 \times 144$ $\approx 2.6 \times 10^6$, corresponding to the 3 upper-air field arrays and 1 surface field array, although in practice up to 14 upper-air field arrays are carried by the algorithm. Note that this is not the same model as the finite-volume fvGCM that is being developed for the new GEOS DAS at the DAO. Some general quantities, such as asymptotic scalability, may be similar but specific values of quantities like the model timestep or wall-clock time of runs are different. These differences will not be discussed in this paper.

We will complete the discussion on the complexity of the GCM in Section 2.1. The algorithm for solving Eqn. (2) at the DAO is the Physical-space Statistical Analysis System (PSAS) (Cohn et al. 1998). This solves

(HPfHT + R) x = wo - Hwf , (4)

and

wa - wf = PfHT x. (5)

The error covariance matrices Pf and R are implemented using models for variances and correlations whose parameters are obtained from prior statistics and simplifying assumptions such as stationarity (Daley 1991). Sophisticated multivariate formulations are used to improve the quality of the analysis. Although this has significant impact on the software complexity (Guo et al. 1998, Larson et al. 1998) it has a secondary impact on the computational complexity and will not be considered here. The resulting matrices HPfHT+R and PfHT are dense, however correlation models with compact support (Gaspari and Cohn 1999) are used which reduces the computational complexity. Eqns. (4) and (5) are solved for data that are aggregated over six hourly intervals. Currently, the DAO is shortening this period, but similar to the above discussion we specify a baseline GEOS DAS to have a six hour interval between analysis times. For atmospheric assimilation there are typically about $p\sim10^5$ observations that are made world wide in a 6 hour period. The PSAS consists of solving one $p \times p$ linear system (Eqn. 4) for the intermediate vector x using a parallel nested-preconditioned conjugate gradient solver (Cohn et al. 1998, Golub and van Loan 1989, PSAS 1998). This has complexity ${\mathcal{O}(\mathcal{N}_isp^2)}$, where $\mathcal{N}_i \approx 10$ is the number iterations of the outer loop of the solver that are required to reduce the residual error by an order of magnitude (the complexity of the preconditioners are neglected here) and $s \approx 0.25$ is the fraction of nonzero entries in the matrix (sparsity). Eqn. (5) evaluates the analysis increment (wa-wf), and this has complexity ${\mathcal{O}(nsp)}$. The analysis increment is evaluated on a $2.5^{o} \times 2^{o} \times 14$ level grid and these fields are interpolated to the model GCM grid. For the baseline case this means that the vertical coordinate systems are interpolated from 14to 70 levels. Note that because the GCM and PSAS use different resolution grids the values of n are context dependent in the order formulae. The baseline version of the PSAS in GEOS-2 uses multithreading (multitasking) parallelism on Cray J series and SGI Origin 2000 computers. A parallel message-passing implementation of the PSAS, using the Message-Passing Interface (MPI) library, was developed by Ding and Ferraro (1995). This implementation has been merged with the production system and will become operational in 2000.

1.2 The Kalman Filter

The Kalman filter assimilates observations sequentially with the model at the corresponding time when they are taken (Jazwinski 1970, Cohn 1977). In this regard, it is like the PSAS with a shortened update cycle. The Kalman filter also forecasts not only the state but the error covariance matrix

Pf = MPaMT+Q , (6)

where M is the linearized model operator and Q is the model error covariance matrix. For parameters associated with large-scale three-dimensional atmospheric assimilation, Pf,a require in excess of a teraword of storage and Eqn. (6) scales to petaflop/s. The full problem has never been address directly, however the algorithm of Lyster et al. (1997) represents a restriction to advection in two horizontal dimensions - the code was developed to perform stratospheric constituent assimilation. The complexity of Eqn. (6) is ${\mathcal{O}(\mathcal{N}_thn^2)}$, where $\mathcal{N}_t$ is the number of timesteps in an interval of interest, $h \approx 10 - 100$ takes account of the size of the finite difference template, and for the two-dimensional problem $n \approx 10^4$.

The analysis error covariance matrix is

Pa = (I - KH)Pf , (7)

where K is the Kalman gain and it appears in Eqn. (2) as a matrix of weights for the innovation vector wo-Hwf. Evaluating the Kalman gain using a direct solver takes ${\mathcal{O}(p^3)}$ operations while the complexity of a conjugate gradient solver was discussed in Section 1.1. The complexity of Eqn. (7) is ${\mathcal{O}(\mathcal{N}_tpn^2)}$, where p is the number of observations in a model timestep. For the PSAS, observations are aggregated over a six hourly interval. Hence for the equivalent problem the value of p for the Kalman filter is smaller than for the PSAS by the number of timesteps in 6 hours. At baseline resolution for the GCM ( $2^o \times 2.5^o$$\times 70$ layers) the timestep is 3 minutes, so p is 120 times smaller than for the PSAS. Only small experiments (e.g., p < 103) could afford to evaluate K directly. A Kalman filter or an approximate Kalman filter for a large-scale multivariate meteorological system would have to use a PSAS-like solver. The memory burden associated with the error covariance matrices Pf, Pa, is n2, and HPfHT+R is p2.

2 General Memory and Operations Requirements

2.1 Goddard Earth Observing System Data Assimilation System (GEOS DAS)

GEOS DAS is a cycled algorithm where each six hours the GCM performs the equivalent of nine hours of forecast, and the PSAS performs a single analysis. The Incremental Analysis Update (IAU) algorithm (Bloom et al. 1996) couples the analysis Eqn. (6) with the GCM. Table 1 shows the complexity scaling factors for the GEOS GCM (Takacs 1998),

Module H-Scaling V-Scaling T-Scaling
Dynamics XY Z X
Radiation XY ZZ -
Convection XY Z -
Turbulence XY Z -

Table 1: Complexity Scaling factors for multiresolution GEOS GCM.

In Table 1, X,Y, and Z correspond to horizontal and vertical scaling factors respectively. All modules scale proportional to the horizontal resolution, while there is an additional resolution/stability scaling in the dynamics due to the requirement that the timestep is reduced in proportion to the horizontal grid spacing. All modules scale linearly with vertical resolution except the radiation which has a nonlinear burden due to layer-to-layer flux calculations. The CFL stability condition requires a high-latitude filter whose logarithmic grid-point (X) dependence is ignored here. The scaling of dynamics is important because as resolution increases the proportion of wall-clock time begins to dominate. The complexity of the dynamics, for a fixed simulation time, is proportional to n4/3. Thus, if the resolution of the GCM is doubled in all three dimensions the complexity the dynamics (excluding the logarithmic contribution of the spectral filter) increases by sixteen fold. The memory requirement for the GCM scales as n - thus the memory requirement in general scales less rapidly than the complexity requirement.

The complexity of the PSAS has two components, ${\mathcal{O}(\mathcal{N}_isp^2)}$ and ${\mathcal{O}(nsp)}$, where the number of gridpoints in the baseline system is $n \approx 10^6$ and the number of observations is $p \approx 10^5$ as described in Section 1. The first term arises for the conjugate-gradient solver (Eqn. 4) and the second term is from the analysis (Eqn. 5). $\mathcal{N}_i \approx 10$ is the number of iterations of the solver which is chosen so that the residual is reduced by an order of magnitude, and the parameter $s \approx 0.25$ accounts for sparsity. If the innovation matrix is precalculated, as in the parallel (MPI) algorithm of Ding and Ferraro (1995) then the memory requirement due to the innovation matrix is $\mathcal{O}(sp^2)$. The baseline version of the PSAS in GEOS-2 re-calculates the matrix elements during the conjugate gradient iteration cycle thus reducing the overall memory requirement and allowing for scalability to larger numbers of observations beyond the current values (Guo et al. 1998, Larson et al. 1998).

Table 2 shows the percentage of time taken by the modules of the baseline GEOS-2 DAS. Note that the time taken for the diagnostics includes the time to write data to disk.

GEOS-2 DAS Module Percentage of Wall Clock time
  8 Processors of an SGI Origin 2000
GCM 45.
PSAS 39.
Interface 2.5
Diagnostics 13.5

Table 2: The percentage of time taken by the modules of parallel multitasking GEOS-2 DAS. Runs performed on 8 processors of an SGI Origin 2000.

The expressions ${\mathcal{O}(\mathcal{N}_isp^2)}$ and ${\mathcal{O}(nsp)}$ for the complexity of the PSAS can be very approximately checked by taking the following values, p = 105, n = 106, $\mathcal{N}_i = 10$, s = 0.25. Summing the expressions (assuming approximately equal weight to each) gives $1.25 \times 10^{11}$ floating point operations per analysis. This compares with a value of $5 \times 10^{11}$ floating point multiplies and $4.5 \times 10^{11}$ floating point additions for the total complexity of GEOS DAS (including the GCM, interface calculations, and the PSAS) per analysis as calculated from the Hardware Performance Monitors on the Cray J916. The estimate $1.25 \times 10^{11}$ for the PSAS is low but a good order of magnitude - Table 2 indicates that $39/100 \times 9.5 \times 10^{11}$ $\approx 3.7 \times 10^{11}$ is more like the actual number of flops per analysis for the baseline resolution PSAS.

Table 3 shows the percentage of time taken by the submodules of the GCM Module of GEOS-2. The GCM is run in ``assimilation mode'' using the Matsuno timestepping scheme. The shapiro filter is a spatial local filter. The physics modules includes radiation, convection, and turbulence. Rotate and c_to_a are grid transformations (Sawyer and Wang 1999).

GCM Module Percentage of Wall Clock time  
  8 Processors of SGI Origin 2000  
dycore 15.  
shapiro filter 5.  
step 25.  
physics 40.  
rotate 10.  
c_to_a 5.  

Table 3: The percentage of time taken by the submodules of the GCM (vc6.5)

The percentage of time taken by the submodules of the PSAS Module of GEOS-2 is shown in Table 4. For the baseline GEOS-2, the solver (Eqn. 4) with complexity ${\mathcal{O}(\mathcal{N}_isp^2)}$ takes about 35% of the time while the analysis (Eqn. 5) with complexity ${\mathcal{O}(nsp)}$ takes 62% of the time.

PSAS Module Percentage of Wall Clock time
  8 Processors of an SGI Origin 2000
Solver (Eqn. 3) 35
Analysis (Eqn. 4) 62
Utilities 3

Table 4: The percentage of time taken by the submodules of the PSAS Module of GEOS-2.

The Core GEOS DAS is run in a number of production modes (Stobie 1996). These may be generally categorized as real-time and reanalysis modes. Real-time requires model forecast and analyses to take place sufficiently in excess of one day of assimilation per wall-clock day so that the results may be studied and disseminated to customers such as satellite instrument teams with real-time needs. Reanalyses are multi-year studies designed to provide long-term datasets from a frozen scientific software configuration. For example, the DAO has completed a reanalysis for the years 1979 to 1995 using the GEOS-1 version of GEOS DAS (Schubert et al. 1993, 1995). Appendix A summarizes the baseline GEOS-2 DAS system performance and throughput. GEOS-2 is multitasking parallel runs on Cray J90/C90 and SGI Origin 2000 series computers. During 2000 the DAO will switch to MPI/shmem parallel GEOS-3. A lot of the work in support of this effort can be viewed on the DAO web page http://dao.gsfc.nasa.gov as well as the author's personal page http://dao.gsfc.nasa.gov/DAO_people/lys.

It is worthwhile discussing the data that are used for current and near-term atmospheric assimilation and the gridded products that are produced and stored. There are about 2 billion useful meteorological observations going back 50 years. At present the volume of this data does not present the greatest computational complexity, and operational centers are more concerned with the accuracy of these data. Flat files are the rule, and object-oriented databases are a thing of the future. Considerable energy is devoted to finding and validating old observations, i.e., ``data rehabilitation''. This should change in the next 5 years. For example, satellite sea-surface wind observations have been shown to be useful in increasing forecast accuracy of weather analyses. The DAO will also assimilate increasing amount of non-meteorological data, such as trace gases, using data from Earth Observing System (EOS). Currently about 105 observations are produced daily under the World Weather Watch and transmitted to weather centers and the DAO via the Global Telecommunications System. More than 50% of these are produced by satellites, mostly in the form of temperature retrievals. At baseline resolution for the DAO General Circulation Model (2o $\times$ 2.5o $\times$ 70 levels) a day of assimilation produces in excess of 1 gigabyte of data. Hence data assimilation at real time (one day of assimilation per wall-clock day) does not stretch the local disk capacity or bandwidth of most modern computer systems. However, extended runs at higher throughput than real time creates problems for storage and data processing. The most severe is for reanalysis projects where multi-year datasets are analyzed by a fixed Data Assimilation System and the products are made available to the scientific community. The standard benchmark is a rate of 30 days of assimilation per day of wall-clock time (i.e., a fifteen year reanalysis on order half a year). At 30 days per wall-clock day the DAO Data Assimilation System produces about 10 terabytes of data per year.

2.2 Kalman Filter

The complexity of the Kalman filter is $\mathcal{O}(\mathcal{N}_t(hn^2+pn^2))$, where $\mathcal{N}_t$is the number of model timesteps in a fixed interval, and where $h \approx 10 - 100$ takes account of the size of the template for finite-difference advection scheme as implemented by Lyster et al. 1997. Note that this scales quadratically in the size n of the state space, compared with the advection model itself which is approximately linear. The memory is $\mathcal{O}(n^2)$. The floating point scaling ${\mathcal{O}(\mathcal{N}_isp^2)}$ of the PSAS refers to p observations that are aggregated over 6 hourly (synoptic) time intervals. If this interval is shortened to the timestep of the model, currently $\mathcal{O}(1)$ minute, then the cost of the algorithm per timestep scales more favorably. In this limit, for a fixed time interval, the computational cost of the PSAS will scale as ${\mathcal{O}(nsp)}$.

A two-dimensional (latitude-longitude) Kalman filter for the assimilation of stratospheric chemical constituents has been developed by Lyster et al. (1997), and is being used for scientific study of stratospheric constituent gases (Ménard et al. 2000a,b). The model is advective transport (Lin and Rood 1996) using prescribed winds with 2o $\times$ 2.5o resolution ( $n = 91 \times 144 = 13104$), model timestep 15 minutes, and $p \approx 15$ observations per timestep. The Kalman filter achieves 150 days of assimilation per wall-clock day, or 4.1 sustained gigaflop/s, on 128 processors of the SGI/Cray T3E-600 at NASA Goddard Space Flight Center. This was used for the assimilation of retrieved methane from the CLAES instrument aboard NASA's Upper Atmosphere Research Satellite. A full Kalman filter based on a tangent-linear three-dimensional GCM would require considerably more resources. The memory to store the covariance matrices would be approximately $n^2 \approx 6.8 \times 10^{12}$ words at the baseline resolution of 2o $\times$ 2.5o $\times$ 70 levels, and the algorithm would scale to approximately n $\times$ 250 megaflop/s = 0.25 petaflop/s (the value 250 megaflop/s is taken from the baseline GEOS-2 DAS in Appendix A).

3 Summary

Two algorithms for four dimensional data assimilation and their complexities (in terms of the scaling of the floating point operations and the memory requirements) are discussed. The operational Goddard Earth Observing System Data Assimilation System (GEOS DAS) that is used at Goddard Space Flight Center Data Assimilation Office was described. This has diminished complexity compared with the full Kalman filter, but is still a substantial supercomputing problem. The Kalman filter is the most difficult problem and considerable research needs to be done on both the scientific and computing aspects. This represents a problem so large that the model may be considered as a small module, potentially not in need of parallelizing (Lyster et al. 1997). Current research is developing approximate Kalman filters (DAO 2000) whose complexities are more feasible given the computing which will be available for operational data assimilation in the coming years.

For the baseline resolution PSAS the solver (Eqn. 4) with complexity ${\mathcal{O}(\mathcal{N}_isp^2)}$ floating point operations takes about 35% of the time while the analysis (Eqn. 5) with complexity ${\mathcal{O}(nsp)}$ takes 62% of the time. The value p $\approx$ 105 is the aggregated number of observations over a 6 hourly interval. The complexity of the Kalman filter scales approximately as $\mathcal{O}(pn^2)$ where p is the number of observations at each model timestep. A full three-dimensional filter based on a General Circulation Model will require petaflop/s (1015 floating point calculations per second) computing.

There is increasing awareness that Earth Science algorithms such as GEOS DAS are difficult to optimize for parallel scalability and performance. This is particularly topical as programs such as the DOE's Accelerated Strategic Computing Initiative (ASCI) Program are building computers with thousands of closely coupled processors. The DAO customarily runs the parallel (multitasking) operational GEOS-2 software on platforms using between 16 and 64 processors. Certainly much optimization has been done, such as has been promoted by the High Performance Computing and Communications (HPCC) Program. In order to build scalable code, algorithms that carefully manage memory and tasks using, for example, the Message-Passing Interface (MPI) library are important. The new version of GEOS DAS uses MPI and shmem libraries. Several general statements can be made about scalability, some of this has been learned through the HPCC effort at the DAO:

The baseline operational algorithm GEOS-2 is multitasking parallel, run on Cray J90/C90 and SGI Origin 2000 series computers. During 2000 the DAO will switch to MPI/shmem parallel GEOS-3. A lot of the work in support of this effort can be viewed on the DAO web page http://dao.gsfc.nasa.gov as well as the author's personal page http://dao.gsfc.nasa.gov/DAO_people/lys.

4 Acknowledgments

The author acknowledges discussions with Anthony Colyandro, Chris Ding, Gregory Gaspari, Jing Guo, Dan Kokron, Jay Larson, William Sawyer, Harper Pryor, Richard Rood, Lawrence Takacs, Ricardo Todling, and Philip Webster. This work was funded under the NASA High Performance Computing and Communications Initiative (HPCC) Earth and Space Science (ESS) Program Cooperative Agreement Number NCCS5-150, the NASA EOS Interdisciplinary Science Program, and the NASA Research and Applications Program.

Appendix A: GEOS DAS System Performance and Throughput Requirements

Baseline GEOS-2: 2o $\times$ 2.5o $\times$ 70 level GCM resolution; 400,000 obs/day
5 days/wallclock day throughput
This corresponds to the baseline operational code
Multitasking GEOS-2 DAS run on 8pe Origin 2000
Main Memory (GB) 2.2 (per image)
Disk (GB) 8.0
Mass Storage (GB) 2300.0 (this is output per year)
Volume of Data (GB) 6.2 (produced per day per image)
Gigaflop/s sustained 0.25 (per image)
Duration of Run 5 days/WCday (continuous operation, single image)

 

Appendix B: Surface-To-Volume Limitations on the Scalability of Domain-Decomposed Gridpoint Models



The following calculation is an illustration of the so called surface-to-volume effect for describing the limits to scalability of domain-decomposed gridpoint models. We consider the scalability of a canonical transport algorithm. This is only an illustration in the sense that it does not account for a number of the typical complications that often occur in General Circulation Models, viz:

In general, the embarrassingly parallel parts of the GCM (e.g, some physics modules) may tend to improve the overall scaling with respect to the present illustration, load imbalance will tend to make the scaling worse, and there are some other algorithms (e.g., parallel rotation) which need a wholly separate analysis. Nevertheless, with only a few approximations, this illustration helps to understand the first-order scalability of the DYCORE in the GEOS-3 GCM. The approximations are:



Define the following symbols:

Ng = Total number of grid points in the domain
d = Dimension of the physical problem
D = Dimension of the parallel decomposition
Np = Total number of processors
M = single processor speed in megaflop/sec
B = interprocessor communication bandwidth in megabytes/sec
F = Number of flops/gridpoint/timestep for the relevant transport algorithm
G = The number of ``layers'' of guard cells in each dimension of the parallel decomposition (e.g., G=2 for fourth order finite difference, not sure what G is for Lin-Rood)
P = The precision of the calculation (i.e., P = 4, or 8)



Typically D=1,2, or 3, and d=2, or 3, while $d \geq D$. For instance, the physical problem is three dimensional (i.e, d = 3), however, we sometimes discuss scalability in terms of the number of gridpoints in the latitude-longitude domain (i.e, horizontal transport), which is the d = 2 case. We will not use the exact number of gridpoints as in the current DAO models, but by way of illustration use simple geometric expressions, e.g., a two dimensional problem (d = 2) with a checkerboard decomposition (D = 2), then the number of gridpoints in each dimension in each processor is $2DN_g^{(d-1)/d}/N_p^{(D-1)/D} \equiv 2DN_g^{\frac{1}{2}}/N_p^{\frac{1}{2}}$.

The communication time per timestep per processor is:

Tcomm = 2DGPB-1Ng(d-1)/d/Np(D-1)/D . (8)

The CPU time per timestep per processor is:

Tcpu = (F/M)(Ng/Np) . (9)

Hence the ratio of communication to CPU time is:

\begin{displaymath}\tau := T_{comm}/T_{cpu} = \frac{M}{B} \frac{2DGP}{F} \frac{N_p^{1/D}}{N_g^{1/d}} ,
\end{displaymath} (10)

and the corresponding parallel speedup is:

\begin{displaymath}SU = N_p/(1 + \tau) .
\end{displaymath} (11)



The terms in $\tau$ may be characterized as follows:



For typical parameters of problems such as the baseline code we are currently studying: $F \approx 50$, $M \approx 50$, $B \approx 10$, G = 2, P = 8, d = D = 2, Ng = 12960 (e.g., these are appropriate for $2^o \times 2.5^o$ resolution horizontal transport), and Np = 1024 processors, we get $\tau \approx 3$, i.e., the scalability is limited to $SU \approx 256$ processors.

6 References

Andersson, E., J. Haseler, P. Undén, P. Courtier, G. Kelly, D. Vasiljevic, C. Brankovic, C. Cardinali, C. Gaffard, A. Hollingsworth, C. Jakob, P. Janssen, E. Klinker, A. Lanzinger, M. Miller, F. Rabier, A. Simmons, B. Strauss, J-N. Thepaut, and P. Viterbo, 1998: The ECMWF implementation of three dimensional variational assimilation (3D-Var). Part III: Experimental Results, Q. J. R. Meteorol. Soc., 124, 1831-1860.

Baker, W.E., S. Bloom, J. Woollen, M. Nestler, E. Brin, T. Schlatter, and T. Branstator, 1987: Experiments with a three-dimensional statistical objective analysis scheme using FGGE data. Mon. Wea. Rev., 115, 272-296.

Bloom, S.C., L.L. Takacs, A.M. da Silva, and D. Ledvina, 1996: Data Assimilation Using Incremental Analysis Updates, Mon. Wea. Rev., 124, 1256-1271.

COCOMO, 2000: The Constructive Cost Model tool. For example
http://sunset.usc.edu/COCOMOII/suite.html

Cohn, S.E., 1993: Dynamics of short-term univariate forecast error covariances, Mon. Wea. Rev., 121, 3123-3149.

Cohn, S.E., A. da Silva, J. Guo, M. Sienkiewicz, and D. Lamich, 1998: Assessing the effects of data selection with the DAO Physical-space Statistical Analysis System. Mon. Wea. Rev., 126, 2913-2926.

Courtier, P., E. Andersson, W. Heckley, J. Pailleux, D. Vasiljevic, M. Hamrud, A. Hollingsworth, F. Rabier, and M. Ficher, 1998: The ECMWF Implementation of three dimensional variational assimilation (3D-Var). Part I: Formulation, Q. J. R. Meteorol. Soc., 124, 1783-1808.

Daley, R., 1991: Atmospheric Data Analysis. Cambridge University Press. New York, 457pp. ISBN 0-521-38215-7.

DAO 2000: Algorithm Theoretical Basis Document Version 2.01, Data Assimilation Office, NASA's Goddard Space Flight Center. Available at:
http://dao.gsfc.nasa.gov/subpages/atbd.html

Ding, H., and R. Ferraro: A General Purpose Parallel Sparse Matrix Solver Package, Proceedings of the 9th International Parallel Processing Symposium, p. 70, April 1995 (other references relating to the application of parallel conjugate gradient solvers to the PSAS can be obtained from the author at lys@dao.gsfc.nasa.gov or C. H. Q. Ding at cding@nersc.gov).

Gaspari, G. and S.E. Cohn, 1999: Construction of correlation functions in two and three dimensions, Quart. J. Roy. Met. Soc., 125, 723-757.

Gibson, J.K., P. Kållberg, S. Uppala, A. Nomura, A. Hernandez, E. Serrano, 1997: ERA Description, ECMWF Re-Analysis Project Report Series, 1.

Golub, G.H. and C.F. van Loan, 1989: Matrix Computations, 2nd Edition, The John Hopkins University Press, 642pp.

Guo, J., J.W. Larson, P.M. Lyster, and G. Gaspari, 1998: Documentation of the Physical-space Statistical Analysis System (PSAS) Part II: The Factored-Operator Error Covariance Model Formulation, NASA Data Assimilation Office, Office Note No. 98-04. Available at
http://dao.gsfc.nasa.gov/subpages/office-notes.html

Jazwinski, A.H., 1970: Stochastic Processes and Filtering Theory. Academic Press, 276pp.

Kalnay, E., et al., 1996: The NMC/NCAR 40-Year reanalysis project, Bull. Amer. Meteor. Soc.

Kanamitsu, M., W. Ebisuzaki, J. Woollen, J. Potter, and M. Fiorino, 1999: An overview of the NCEP/DOE reanalysis 2, in Proceedings of the 2nd International Conference on Reanalysis, Wokefield Park, Reading, UK.

Kistler, R.E., and E. Kalnay et al., 2000: The NCEP/NCAR 50-year reanalysis, to appear in Bull. Am. Meteor. Soc.

Larson, J.W., J. Guo, P.M. Lyster, 1998: Documentation of the Physical-space Statistical Analysis System (PSAS) Part III: Software Implementation, NASA Data Assimilation Office, Office Note No. 98-05. Available at
http://dao.gsfc.nasa.gov/subpages/office-notes.html

Lin, S.-J., and R.B. Rood, 1996: Multidimensional Flux-form semi-Lagrangian transport schemes, Mon. Wea. Rev., 124, 2046-2070.

Lyster, P.M., S.E. Cohn, R. Ménard, L.-P. Chang, S.-J. Lin, and R. Olsen, 1997: Parallel Implementation of a Kalman Filter for Constituent Data Assimilation, Mon. Wea. Rev., 125, 1674-1686.

Lyster, P.M., J. W. Larson, W. Sawyer, C. H. Q. Ding, L.-P. Chang, R. Ménard, R. Rood, S. E. Cohn, 2000: Final Report on the NASA HPCC PI Project: Four Dimensional Data Assimilation. Available at
http://http://dao.gsfc.nasa.gov/DAO_people/lys

Ménard, R., L.-P. Chang, P.M. Lyster, and S.E. Cohn, 2000a: Stratospheric Assimilation of Chemical Tracer Observations Using a Kalman filter. Part I: Formulation, to appear in Mon. Wea. Rev.

Ménard, R., and L.-P. Chang, 2000b: Stratospheric Assimilation of Chemical Tracer Observations Using a Kalman filter. Part II: ${\chi}^2$-Validated Results and Analysis of Variance and Correlation Dynamics, to appear in Mon. Wea. Rev.

Parrish, D.F. and J.C. Derber, 1992: The National Meteorological Center's spectral statistical-interpolation analysis system, Mon. Wea. Rev., 120, 1747-1763.

Parrish, D.F., J.C. Derber, R.J. Purser, W.-S. Wu, and Z.-X. Pu, 1997: The NCEP global analysis system: recent improvements and future plans, J. Met. Soc. Japan, 75, No. 1B, 359-365.

PSAS 1998: the technical documents for the PSAS are da Silva and Guo (1996), Guo et al. (1998), and Larson et al. (1998).

Rabier, F., A. Mc Nally, E. Andersson, P. Courtier, P. Undén, J. Eyre, A. Hollingsworth, F. Bouttier, 1998: The ECMWF implementation of three dimensional variational assimilation (3D-Var). Part II: Structure Functions, Q. J. R. Meteorol. Soc., 124, 1809-1830.

Sawyer, W., A. Wang, 1999: Benchmark and Unit Test Results of the Message-Passing GEOS General Circulation Model, NASA Data Assimilation Office, Office Note No. 99-04. Available at
http://dao.gsfc.nasa.gov/subpages/office-notes.html

Schubert, S.D., J. Pfaendtner, and R. Rood, 1993: An Assimilated Data Set for Earth Science Applications, Bull. Am. Met. Soc., 74, 2331-2342.

Schubert, S.D., C.-K. Park, C-Y. Wu, W. Higgins, Y. Kondratyeva, A. Molod, L.L. Takacs, M. Seablom, and R.B. Rood, 1995: A Multiyear Assimilation with GEOS-1 System: Overview and Results. NASA Tech. Memo. 104606, Vol. 7, 201 pp., Available at http://dao.gsfc.nasa.gov/subpages/tech-reports.html

A. da Silva and J. Guo, 1996: Documentation of the Physical-Space Statistical Analysis System (PSAS) Part I: The Conjugate Gradient Solver Version, PSAS-1.00, NASA Data Assimilation Office Note 96-02. Available at
http://dao.gsfc.nasa.gov/subpages/office-notes.html

Stobie, J.G., 1996: Data Assimilation Computing and Mass Storage Requirements for 1998, NASA Data Assimilation Office, Office Note No. 96-16. Available at
http://dao.gsfc.nasa.gov/subpages/office-notes.html

Takacs, L.L., A. Molod, and T. Wang, 1994: Documentation of the Goddard Earth Observing System (GEOS) General Circulation Model - Version 1. NASA Technical Memorandum 104606, Volume 1, NASA Goddard Space flight Center, Greenbelt, MD 20771. Available at
http://dao.gsfc.nasa.gov/subpages/tech-reports.html

Takacs, L.L., 1997: Impact of Resolution on GEOS GCM Model Development, DAO internal report.



About this document ...
Peter Lyster
2000-07-19