Published in Proceedings of Supercomputing 97

Parallel Computing at the NASA Data Assimilation Office (DAO)

P.M. Lyster1, J. Abeles2, L.-P. Chang3, S.E. Cohn, K. Ekers3, J. Guo3, M. Harber3, D. Lamich3, J.W. Larson1, R. Lucchesi3, R. Ménard4, R. Rood, S. Schubert, W. Sawyer1, M. Sienkiewicz3, A. da Silva, J. Stobie3, S. Swift2, L.L. Takacs3, R. Todling,3, J. Zero3
NASA/Goddard Space Flight Center, Data Assimilation Office
Greenbelt, MD 20706, USA
Email to lys@dao.gsfc.nasa.gov
http://dao.gsfc.nasa.gov/DAO_people/lys
Additional Affiliations
1. Earth Systems Science Interdisciplinary Center (ESSIC), University of Maryland College Park
2. SGI/Cray.
3. General Sciences Corporation, a subsidiary of Science Applications International Corporation
4. Joint Center for Earth Systems technology (JCET), University of Maryland Baltimore County.

C.H.Q. Ding
Lawrence Berkeley National Laboratory,Berkeley, CA 94720

R. Ferraro
Jet Propulsion Laboratory, Pasadena, CA 91109

Abstract:

This presentation discusses the NASA data assimilation project at the Data Assimilation Office at the NASA/Goddard Space Flight Center. The goal is to produce accurate gridded datasets of atmospheric fields by assimilating a range of observations along with physically consistent model forecasts. This work produces datasets that are used by the climate research community. The data come from conventional sources that are used for weather forecasts (e.g., radiosondes, earth-surface measurements, and satellite temperature retrievals), as well as new sources such as satellites that will be launched under the Mission To Planet Earth Enterprise. An end-to-end Goddard Earth Observing System (GEOS) Data Assimilation System (DAS) currently supports stratospheric flight missions and reanalysis projects for NASA. The current Core of this system (Model, and Analysis) is a multitasking algorithm that runs on Cray J90 and C90 computers at Goddard and NASA Ames Research Center. Future Core computing will be carried out at Ames, with a new production system scheduled to be ready for the EOS AM-1 satellite launch in June of 1998. The DAO has acquired SGI Origin 2000 computers, with an aggregate of 160 processors in place at Ames, and more planned for the future. The DAO is currently updating the control scripts and programs, and implementing a modular Fortran 90 Core system. During 1998 the Core system will be migrated to distributed-memory software using the Message Passing Interface. Part of this work is being carried out under the NASA High Performance Computing and Communications Earth and Space Sciences program. The algorithmic and performance issues involved in Core system are the main subject of this presentation.

Note that this version of the paper has a longer list of authors, representing a more complete list of those who contributed to the final oral presentation in November 1997 at Supercomputing97.

Keywords:
Four Dimensional Data Assimilation, Climate Modeling, Forecast, Analysis


Index

1. Introduction
2. The GEOS General Circulation Model
3. The Physical-space Statistical Analysis System (PSAS)
  3.1 The Prototype Parallel PSAS
  3.2 Summary of Timings for the Prototype Parallel PSAS
4. Parallel I/O
5. Timings and Optimization for Core GEOS DAS
6. Advanced Data Assimilation: The Kalman Filter
7. Conclusions and Further Work
Appendix: Web References
References

1. Introduction

Atmospheric data assimilation produces accurate gridded datasets that are used by meteorologists for weather analysis and forecasts; it is also being used as a tool for climate research. The National Aeronautics and Space Administration (NASA) Data Assimilation Office (DAO) at the Goddard Space Flight Center conducts reanalysis of archived earth-science datasets as well as real-time mission support analysis and forecasts for the climate research community. Data assimilation produces accurate gridded datasets of atmospheric fields by assimilating a range of observations along with physically consistent model forecasts. The DAO's Goddard Earth Observing System (GEOS) Data Assimilation System (DAS) is described in the Algorithm Theoretical Basis Document[DAO, 1996]. The DAO is preparing to move GEOS DAS to distributed-memory parallel computing platforms. This will be used for DAO's normal activities, and will also be an important for of the Mission to Planet Earth (MTPE) Enterprise in the coming years.

The end-to-end GEOS DAS is shown schematically in Figure 1. The present DAS, whose Core system runs on multitasking Cray C90 and J90 computers, is fed from the Global Telecommunication System (GTS). Meteorological data, including satellite-retrieved temperature profiles, are given to the Goddard Distributed Active Archive Center (DAAC). These data proceed through a data reduction and preprocessing stage before being ingested by the Core system. In the future, new datatypes will be provided via the EOS Data and Information System (EOSDIS) Core System (ECS). Postprocessing stages include: Quality Assurance of Data Sets (QuADS) that inspects the gridded data sets produced by the Core system for I/O error and general physical consistency; Adaptive Tuning of Error Statistics System (ATESS) that produces observation and forecast error statistics for the GEOS DAS; and DAO On-Line Monitoring System (DOLMS) that monitors observation and gridded data streams and makes them available for graphical presentation. There are a number of different modes of operation for the DAS. Briefly, mission support involves real-time, or First-Look, data assimilation and sometimes the production of up to 10-day model forecasts. For current data sets made available from the Goddard DAAC, this mode of operation ingests about 50 megabytes of data per day into the Core system. In the coming year, satellite-retrieved profiles of atmospheric parameters will be produced as part of the DAS preprocessing system, which will increase the data ingest rate to about 1 gigabyte per day. The output analysis (gridded) datasets are about 1 gigabyte per day in real-time mode, while the production of model-forecast fields can increase this quantity by over an order of magnitude. Periodically the DAO conducts reanalysis projects that involve multi-year analysis whose data sets are then studied and distributed to the climate research community. In this mode of operation, DAO plans for a production rate of 30 days of assimilation per wall-clock day.

The end-to-end GEOS DAS is a complex operation that is being updated using modern Process Engineering methods. There are about 100 people directly involved in DAO research and operations. This need for Process Engineering has become apparent in a complex environment where issues such as software Configuration Management (CM), version control and maintenance, multiple hardware platforms, and the coordination of distributed personnel are important.

GEOS DAS uses a grid-point based atmospheric general circulation model GCM[Takacs et al., 1994]. Other physical processes are computed using parameterized models for turbulence, short- and long-wave radiation, moisture, and land surface processes. Analysis algorithms are used to combine the inherently unstructured data, with (hopefully) known errors, into the structured models whose errors are also quantified. This can be regarded as an problem in the field of stochastic modeling and estimation. One of the difficulties of atmospheric data assimilation arises from the large size of the state space. Current GCMs have more than 107 gridpoints and there are approximately 105 meteorological observations (including satellite-retrieved temperature profiles) reported world wide in a typical six-hourly (or synoptic) assimilation period. The current analysis at the DAO is an on-line Quality Control (QC) algorithm and the Physical-space Statistical Analysis System (PSAS) [Cohn et al.,1997]. A key part of PSAS performs an iterative conjugate-gradient solve of a moderately dense (26% full) 105 × 105 matrix. The heritage code for the Core system is mostly FORTRAN 77.

As part of DAO's preparation to provide support to MTPE in 1998, and as part of a Grand Challenge project that is being funded by the NASA High Performance Computing and Communications (HPCC) Earth and Space Sciences (ESS) program, a Core system that is based on the Message Passing Interface (MPI) parallel library is being developed using a Fortran 90 modular approach -- the Goddard Earth Modeling System (GEMS)[DAO, 1997]. Key components of Core GEOS DAS, including the GCM (in collaboration with Max Suarez at GSFC) and an early version PSAS, have been parallelized [Ding and Ferraro, 1995] [Ding and Ferraro, 1996] (in this paper, we shall refer to this parallel version of the early PSAS as the prototype parallel PSAS). These form the basis for the design of their respective parts of the parallel system. The Core system is being optimized and run on the SGI Origin 2000 systems at Ames Research Center and the Cray T3E testbed computer at GSFC. This paper will focus on the implementation and optimization of the Core distributed-memory (MPI) algorithms, including issues of parallel I/O. Other work at the DAO involves using parallel computing to study advanced methods of data assimilation such as the Kalman filter. This work is being carried out in the context of the assimilation of trace (constituent) gases such as ozone and methane; details of ongoing work will be discussed briefly.

  

figure41

Figure 1: The GEOS End-to-end Data Assimilation System.

2. The GEOS General Circulation Model

The GCM uses a finite-difference algorithm (the dynamical core) to integrate the primitive equations. The model uses a geographic spherical coordinate grid with horizontal spacing of tex2html_wrap_inline364 in latitude by tex2html_wrap_inline366 in longitude, with a 70-level vertical sigma coordinate. Physical processes are parameterized using: a land-surface scheme, moisture, turbulence, and longwave and shortwave radiative transfer schemes [DAO, 1996][Takacs et al., 1994].

The physical parameterizations involve dependencies only in the vertical coordinate, so the GCM can be parallelized most easily with a horizontal domain decomposition (checkerboard)[Dannevik, 1993] [Mirin, et al., 1992] similar to that commonly used in general circulation [Drake et al., 1994] and regional mesoscale circulation models [Foster and Michalakes, 1993]. Prototyping work on a parallel implementation of the GEOS GCM dynamical core has been done successfully with the collaboration of Max Suarez. The remaining work is to incorporate the operational versions of the physics parameterizations into the parallelized dynamical core. The design and status of this work are discussed extensively in Lyster et al., 1997a.

3. The Physical-space Statistical Analysis System (PSAS)

The process of analysis generates a best estimate of the state of the atmosphere. Typically, analyses of meteorological variables are performed for six hourly (synoptic) sets of observations that are collected from the Global Telecommunication System (GTS). The GCM is used to provide a 6 hour forecast. These forecasts (at 0Z, 6Z, 12Z, and 18Z) may be regarded as estimates of the state of the atmosphere that are augmented by observations to form an analysis. Let tex2html_wrap_inline368 be a vector of p observations that passed the quality control tests, and tex2html_wrap_inline372 a vector of n forecast variables produced by the GCM. The objective is to produce an analyzed state vector tex2html_wrap_inline376 (i.e., a vector of physical quantities that describe directly the physical state of the atmosphere). This is accomplished by solving a pair of equations that comprise the core formulation of PSAS [Cohn et al.,1997][da Silva and Guo, 1996]:

The innovation equation


equation186

and the analysis equation


equation188

where tex2html_wrap_inline378 is the specified forecast error covariance matrix, tex2html_wrap_inline380 is the specified observation error covariance matrix, tex2html_wrap_inline382 represents a generalized interpolation from the analysis grid to the observations, and tex2html_wrap_inline384 is an intermediate vector in the observation space. Both the observation and forecast variables are state variables; in the future this will not always be so [Joiner and da Silva, 1996].

The analysis is on a tex2html_wrap_inline364 latitude by tex2html_wrap_inline366 longitude spherical grid. The state variables analyzed include: sea-level pressure tex2html_wrap_inline390, sea-level horizontal winds tex2html_wrap_inline392 and tex2html_wrap_inline394, and the horizontal wind u and v on 18 standard pressure levels (prototype PSAS uses 14 levels), the geopotential height h, and the specific humidity q on standard levels greater than 200 hPa.

PSAS solves a global coupled large matrix problem. The innovation matrix tex2html_wrap_inline406 is dense, although entries associated with locations that are separated by several correlation lengths are negligible (zero for compactly supported correlation functions) [Gaspari and Cohn, 1996] [DAO, 1996]. In the present multi-tasked serial version of PSAS, block structure is introduced in M by dividing the observations into 80 equal-area regions via an icosahedral partitioning scheme [Pfaendtner, 1996]. In order to introduce some sparseness in tex2html_wrap_inline410 and thus save computational effort, the correlations beyond a preset cutoff distance are not included. For a typical 6 hour synoptic (i.e., analysis) period the number of observations accumulated is p105. Hence, with a 6,000 kilometer cutoff between centroids of observation regions, the innovation matrix is of size 105 × 105, and is approximately 26% full. This uses more than 5 gigabytes of storage for a 32-bit representation of the data.

The innovation equation is solved using a preconditioned conjugate gradient (CG) algorithm [Golub and Van Loan, 1989] [da Silva and Guo, 1996]. This is an tex2html_wrap_inline418 operation, where tex2html_wrap_inline420 is the number of iterations of the CG algorithm. Typically, 8 to 12 iterations (x') are needed to produce a CG solution whose residual has been reduced by an order of magnitude. Experiments in which the residuals are reduced by more than two orders of magnitude resulted in differences in the analysis that are smaller than expected analysis errors [DAO, 1996]. The number of operations in the solution of the analysis equation (2) is tex2html_wrap_inline426.

3.1 The Prototype Parallel PSAS

The large memory that is available on parallel computers, such as the Intel Paragon or the Cray T3E, affords the possibility of calculating and storing the entire forecast error covariance matrix once per analysis cycle. This has the potential for considerable speedup over the serial approach, in which the matrix was reevaluated for each iteration of the CG algorithm. This capacity, along with the need to distribute floating-point operations, provides motivation for a parallel implementation of PSAS.

The algorithm of the parallel version of PSAS comprises five steps: 1) partition observations using inertial recursive bisection scheme (this is used for both the parallel online QC and for the parallel PSAS algorithm); 2) block matrix distribution of the innovation matrix M; 3) solve the innovation equation using the CG method; 4) partition of tex2html_wrap_inline430; and finally 5) solve the analysis equation.

The recursive inertial bisection algorithm divides p observations among tex2html_wrap_inline434 regions using bisection on the surface of a sphere. The data are read in as innovations, tex2html_wrap_inline436, and distributed in random order (in equal numbers) on tex2html_wrap_inline438 processors. In the first iteration, the observations are divided along the orthogonal cut of their combined principal axis moment of inertia. Successive iterations are performed in a tree structure that repeats the decomposition on half the remaining processors with approximately half the data. At each stage, the calculation of the principal axis is performed in parallel. As regions replicate, data are moved between different processors using split MPI communicators. Only the geographical locations of the observations, along with the processor on which the data associated with the observation itself resides is communicated during the partitioning process. Only after the partitioning is complete, are the observations and their associated attributes relocated to the processors corresponding to their respective observation regions. This two-step process minimizes communications costs.

The inertial division guarantees some degree of compactness of the resulting decomposition, and results in approximately equal numbers of observations in each region. This algorithm requires that the number of processors is a power of 2, and less than or equal to the number of regions; tex2html_wrap_inline442, otherwise there would be a load imbalance in the solution of the analysis equation (2). Typically Np=256, or 512, and Nr=512. It is possible to modify the power-of-two restriction on numbers of regions and processors by moving the location of the orthogonal cut to some point(s) along the principal axis to give a division of observations other than halving, and maintain approximately equal numbers of observations per region.

In the matrix block distribution, the parallel decomposition of observations is used to generate a list of matrix blocks. This is used to determine on which processor a matrix block will reside. Figure 2 shows a schematic of the block structure of the innovation matrix; this is color coded according to the processor on which each bock resides. A simulated annealing algorithm is used to ensure that both the work in generating matrix blocks and the subsequent matrix-vector multiply are load balanced.

  

figure93

Figure 2: Schematic of the parallel block structure decomposition (color coded) of the innovation matrix.

For the prototype parallel PSAS, there are typically 512 observation regions containing approximately p105 observations. The 6,000 kilometer correlation cutoff condition gives rise to a forecast error covariance matrix that is approximately 26% full. Therefore there are about 5122 × 0.26 × 0.5 ~   35,000 blocks (the factor of 0.5 accounts for the symmetry of the matrix).

Recall that each matrix block in the innovation matrix is associated with one observation region in the case of diagonal blocks, or two regions in the case of off-diagonal blocks. Each block is stored with the associated vector segment of x' that must be used to perform the matrix-vector multiply. Diagonal segments are stored on the processor that stores the observation for that region. For the off-diagonal blocks, the number of vector segments on each processor is minimized by storing the block on one of the two processors that owns the observations corresponding to one of the dimensions of the block. An iterative load balancing algorithm is then used to equalize the memory and floating-point costs on each processor. Application of this regimen results in a load imbalance of approximately tex2html_wrap_inline464.

The CG algorithm uses block preconditioning, and is described in Golub and Van Loan [Golub and Van Loan, 1989] and da Silva and Guo [da Silva and Guo, 1996]. In the parallel partitioner, the innovations are separated into regions that are distributed among the processors. Taken together, these vector segments provide initial condition for the vector iterate x'. Depending on how the off-diagonal blocks are distributed, some of these vector segments must be replicated on multiple processors. At each step in the iteration, the intermediate solution vector x' must be assembled using global communications. Since only about a third of the processors contain data needed to calculate a particular set of vector segments of x', it is more efficient to do explicit message passing to achieve this end rather than use a global summation such as MPI_REDUCE.

The analysis increment (Eq. 2) is tex2html_wrap_inline472, where tex2html_wrap_inline430 represents the forecast error covariance between the observation grid and the analysis grid. The domain decomposition must account for both the unstructured distribution of the observations and the structured analysis grid. The solution vector x from the CG algorithm is decomposed in the same manner as the observations. The decomposition for the analysis grid is based on the condition that gridpoints within the 6,000 km cutoff of the location of an observation are affected by the value of the corresponding coordinate of the vector x.

The first step in the process is an equal-area rectangular partition of the analysis grid into analysis regions. Next, analysis gridpoints are thinned longitudinally, resulting in an equal number of analysis gridpoints in each analysis region. Therefore, tex2html_wrap_inline430 may be generated as a number of equal-sized blocks such that the centroids of their associated regions are within the 6,000 km cutoff distance of the centroids of each region of x. A fixed number of analysis regions per processor (usually 1 or 2), along with a uniform distribution of thinned analysis points renders the partial matrix-vector multiply approximately load-balanced.

A detailed design and the current status of parallel PSAS can be found in Lyster et al., 1997b.

3.2 Summary of Timings for the Prototype Parallel PSAS

Timings were performed for the prototype parallel PSAS for 80,000 observations (analysis grid tex2html_wrap_inline494 levels) on 512 processors of the Intel Paragon at the California Institute of Technology. On this platform, the solver achieves 18.3 gigaflop/s (77 megaflop/s per processor), which is 36% of peak performance for 512 processors. This corresponds to 157 seconds to produce a 6 hour analysis. The calculation of the innovation matrix entries consumes 23.8 seconds, but is performed only once per analysis. The CG solver uses BLAS level 2 library calls and messages to send vector segments of x'. This is iterated tex2html_wrap_inline514 times, taking 36.4 seconds. The number of iterations is more than the value of Ni cited earlier because the version of PSAS that was parallelized had a more stringent convergence criterion than subsequent versions of PSAS. The condition cited in this section is more stringent than was later learned to be necessary. The dominant cost of solving the analysis equation (Eq. 2) is the time spent generating the matrix elements. Since one dimension of tex2html_wrap_inline430 is n which is larger than p this takes longer than the generation of the innovation matrix. The communication (1.5 seconds) involved in reassembling the analysis grid vector is similar to the assembly of vector segments in the innovation equation. In sum, tex2html_wrap_inline528 of the CPU time is spent in communication.

The prototype parallel PSAS has also been implemented on a 512 processor Cray T3D at the NASA Goddard Space Flight Center, and the scaling behavior of the code on this platform is presented in Figure 3. This work was performed as part of the 10 gigaflop/s milestone requirement for the 4DDA Grand Challenge HPCC ESS project [Lyster et al., 1997c]. The dots at 128 and 256 processors depict performance for a run with 51,990 observations, while the diamonds at 256 and 512 nodes are for a run using 79,938 observations. The two different numbers of observations were used because there was not sufficient combined memory on 128 nodes to accommodate the larger number of observations.

  

figure114

Figure 3: Scaling behavior for the prototype parallel PSAS on a 512 node Cray T3D.

4. Parallel I/O

The input and output of data to the GEOS Data Assimilation System (GEOS DAS) is the responsibility of the GEOS Parallel I/O Subsystem (GPIOS). In previous versions of the GEOS system, sequential I/O was used on shared-memory computers. Implementation in a parallel environment with a software system that uses a distributed memory programming model requires more attention to performance details since I/O could generate significant network traffic and cause a bottleneck that interferes with scaling of the application.

The performance requirements for GPIOS are derived from the overarching GEOS throughput requirements along with requirements imposed by the target computing environment and programming model. There are turnaround requirements imposed by the First-Look analysis, an assimilation that will execute close to real-time in support the EOS Instrument teams. The First-Look Analysis will be run once a day starting in June 1998. It includes a 10 day forecast and will produce over 5 gigabytes of data, but this is not the mode of operation that will most heavily stress the I/O system. In reanalysis mode, the GEOS DAS must perform at a rate of 30 days of assimilation per one day of wall-clock time. This will require daily input of more than 1.5 gigabytes and will produce 25 to 35 gigabytes of gridded output data in HDF-EOS format. It is clear that I/O will be a bottleneck for this system unless some type of parallel I/O methods are used.

Traditionally, I/O in the MPI-based GCM was done by the root compute processor. The decomposed data were gathered from the array of processors to the single root processor, where they were stored in global arrays. As the global fields were gathered on the root processor, they were written to disk. The inverse of this process was used for reading the input data. Since the volume of the output data is an order of magnitude greater that the volume of input data, the DAO will concentrate on implementing a parallel I/O scheme for output first. This scheme will use the traditional single-processor output method described above and modify it in two ways. First, there will be more than one processor designated for doing output. Second, the output processors will be in an MPI group distinct from the other compute processors and will be dedicated to I/O and related functions. They will act as a cache for output data, allowing computation to continue once the MPI transfer of data from compute processors to I/O processors is complete; this is shown schematically in Figure 4. In the figure, the top group of processors is dedicated to the primary computation. At designated times as the assimilation moves forward in time, output is performed. Distributed data on the compute processors are passed to some number of I/O processors via an MPI communicator. Once the distributed data have been gathered into global data fields on the I/O processors, the compute processors are released to continue computation. Meanwhile, the I/O processors make data and format transformations and physically write the data to disk files. One advantage of such an approach is that it allows data transformations in preparation for output to occur locally on I/O processors without slowing the forward progression of the assimilation system. Some of these transformations would require extensive global communication if they were done in a distributed environment making it convenient to have the data gathered into global arrays on an I/O processor.

  figure120
Figure 4: Schematic of proposed parallel I/O system.

As more is learned about the support and performance of advanced parallel I/O tools for our hardware platform, DAO will consider the use of those libraries provided their inclusion does not severely affect the portability of code. A leading candidate is MPI-IO, which is included in the recently completed MPI-2 standard.

5. Timings and Optimization for Core GEOS DAS

GEOS DAS must meet performance rates of 50 and 100 gigaflop/s as part of the NASA HPCC project. Already, the prototype parallel PSAS has achieved 11.0 gigaflop/s for realistic problems on 512 processors of the NASA Goddard Cray T3D [Lyster et al., 1997c]. Parallel GEOS GCM has achieved 3.7 gigaflop/s (dynamical core) and 8.0 gigaflop/s (short wave radiation package) on 512 processors of the T3D; other model modules (turbulence, long wave, land surface) are expected to lie somewhere in this range of performance. The increased performance required to achieve the HPCC milestones will be gained through: the use of a Cray T3E whose processor speed is significantly faster than the T3D; higher number of processors (more than 512) on the T3E; improved load balance, communications optimization, and single-processor performance. Communications will be optimized by reducing the net volume where possible and concatenating messages in order to amortize latency. Single-processor performance is particularly important since the previous versions of GEOS GCM were optimized for vector computers. Presently we are working on cache optimization (use of padding, blocking,...), and the use of BLAS and other low-level routines. We will focus on optimizing the time-consuming radiation modules.

The DAO must also meet performance requirements as part of its commitment to Mission To Planet Earth. The hardest requirement is that 30 days of earth data must be assimilated in one wall-clock day. Since there are four analysis cycles (6 hour forecast and analysis) per day, this means that we must perform an analysis cycle on average every 12 minutes. At 18.3 gigaflop/s on the Paragon, the prototype PSAS takes 157.5 seconds. The QC usually takes about 10 percent of the analysis, which would add an additional 15.6 seconds. The model timestep is 3 minutes, while the physics modules are advanced on longer timescales (9 minutes for moist processes, 30 minutes for turbulence, and 3 hours for longwave and shortwave calculations). At present, GEOS GCM takes about 40% of the time of the analysis (PSAS and QC) so, in order to achieve at least 30 days per day turnaround, each timestep must take no more than 1.9 seconds on average. Timings for the present serial (vectorized) version of the GCM indicate that if we can achieve a weighted average 10 gigaflop/s then the 6 hour forecast will take about 106 seconds. This time plus the 173 seconds (at 18.3 gigaflop/s) for the analysis fall well within the requirement of 12 minutes per analysis cycle. It is clear real-time (First-Look) analysis performance will be achieved fairly easily on the Origin 2000 systems that are being installed at Ames. In fact the present multitasking operational system achieves about 4 days of assimilation per wall-clock day on Cray J90s. The challenge will be to achieve the higher performance required of the DAO's reanalysis projects and HPCC milestones.

6. Advanced Data Assimilation: The Kalman Filter

With the increasing importance of climate research, there is the need for accurate assimilation of atmospheric trace gases (constituents) such as ozone or methane. Typically, constituent assimilation uses prescribed winds that are provided from a standard DAS, such as GEOS DAS or from weather centers such as ECMWF and NCEP. Many institutions regularly study trace gases this way, and the DAO is planning on providing gridded ozone datasets as a standard product as part of its support for EOS missions in 1998. Computationally, constituent assimilation is not as demanding as a full DAS. Because of this, the DAO has been able to use the constituent problem as a testbed for research on the Kalman filter (KF). The Kalman filter is one of the advanced methodologies for DAS (others include 4DVAR and Ensemble Filtering) that are being studied worldwide. The main advantage of the KF is the ability to more accurately calculate the forecast error covariance matrix Pf. PSAS (Eq. 1) requires Pf, but the current methodology approximates this quantity using statistical methods. The Kalman filter accurately evaluates Pf based on a dynamical approach [Lyster et al., 1997d]. The equation for the evolution of Pf is:


equation190

where tex2html_wrap_inline590 is the analysis error covariance matrix that is evaluated based on an optimizing principle (such as minimum variance), M represents the linearized model (e.g., transport model) operator, and Q is the model error covariance matrix. Both the forecast and analysis covariance matrices are of size n tex2html_wrap_inline350 n, where n is the size of the state vector tex2html_wrap_inline372; note that n106 for three-dimensional models, and n104 for two-dimensional (horizontal) models. Even for sparse models this problem has memory and complexity that scales as tex2html_wrap_inline606. Full three-dimensional implementations await tera- and petaflop/s capacity. The DAO has developed a two-dimensional Kalman filter[Lyster et al., 1997] to study trace chemicals in the stratosphere, where the model is relatively simple (passive advection with prescribed, analyzed winds), and motion is known to be constrained to isentropic surfaces. The key component of the algorithm is the distributed-memory (MPI) parallel implementation of Eq. 3. Figure 5 shows how the covariance matrix P is first decomposed along rows (i.e., contiguous groups of columns are stored on each processor). The evaluation of tex2html_wrap_inline610 then proceeds in a series of steps that involves a global transpose of the large tex2html_wrap_inline612 matrix MP. Although this involves considerable global communications, an overall performance of more than 1 gigaflop/s was obtained on the 512 processor Intel Paragon computer at the California Institute of Technology, and, more importantly, practical wall-clock times are being achieved for realistic experiments. The parallel Kalman filter is being used to study retrieved constituents from the Upper Atmosphere Research Satellite (UARS)[Ménard et al., 1997].

  

figure136

Figure 5: Schematic of the Parallel Decomposition of a Covariance Matrix.

7. Conclusions and Future Work

Apart from code performance mentioned in the previous section, there are still a number of issues to be resolved. The prototype parallel PSAS is now written in C with some FORTRAN 77 subroutines retained. The version of PSAS under development is written in Fortran 90 using a modular structure with complex data types. Some portions of FORTRAN 77 code will be converted to Fortran 90, and some will be used, unchanged, as low-level subroutines. These changes will result in code that is more modular, as well as safe and stable. The Fortran 90 interface to MPI is still under development [Hennecke, 1996]; this is more of an issue of portability rather than feasibility.

The formulation of PSAS is maturing beyond its statement in equations (1) and (2) to include direct assimilation of non-state variables[Joiner and da Silva, 1996]. This presents the major complication in the process of building the next parallel version of PSAS. In particular, algorithms and data structures must be developed to handle observations that are non-state variables, and their corresponding variables in state space. These changes will render the load balance problems conquered in the prototype parallel algorithm more complex, and it will also be necessary to extend some of the concepts regarding load-balancing in section 3.1 to accommodate these enhancements.

The prototype parallel PSAS is designed to work on multiprocessor configurations in which the number of nodes is a power of two. This restriction can be alleviated through modification of the parallel partition scheme described in Section 3.1. True flexibility would require a more general algorithm capable of partitionings other than bisections; that is, the ability to divide any region into a prime number of roughly equal pieces (in terms of the number of observations in each piece). Then dividing a set of observations into tex2html_wrap_inline620 regions would involve individual partionings corresponding to the prime factorization of tex2html_wrap_inline620.

Once the above challenges are met, a fully functional parallel version of PSAS capable of assimilating non-state data will result. When this is combined with the parallel version of the GEOS GCM, and the parallel QC system under development, a flexible data assimilation system based on a Core MPI algorithm will be possible.


Acknowledgments

The original serial version of PSAS was developed by Jim Pfaendtner. The original design of the parallel PSAS was developed at the Jet Propulsion Laboratory with help from Don Gennery, and Clara Chan. Edith Huang of JPL provided valuable help with parallel programming issues. We would also like to acknowledge useful discussions with Andrea Molod, Sharon Nebuda, Carlos Pabon-Ortiz, Max Suarez, and Dan Schaffer at GSFC. The work of Robert Ferraro (Jet Propulsion Laboratory), Chris H. Q. Ding (JPL and the Lawrence Berkeley National Laboratory), and Peter Lyster, Jay Larson, and Will Sawyer at the Data Assimilation Office was funded by the High Performance Computing and Communications Initiative (HPCC) Earth and Space Science (ESS) program.


Appendix: Web references

The Algorithm Theoretical Basis Document (ATBD) describes the basic algorithms of the Goddard Earth Observing System (GEOS) Data Assimilation System (DAS)

Download a paper [DAO, 1997] that describes the design of the GEOS-3 Data Assimilation System Architectural Design

Download a paper [Lyster et al., 1997a] that describes the design of the Parallel GEOS General Circulation Model (GCM)

Download a paper [Lyster et al., 1997b] that describes the design of the Parallel GEOS Physical-space Statistical Analysis System (PSAS)

Further information about the work on parallel I/O can be found at Rob Lucchesi's GPIOS site

Further information about the HPCC ESS Grand Challenge PI work at the DAO can be found at Peter Lyster's Home Page

Download a paper [Lyster et al., 1997d] on the Parallel Kalman Filter, or go to L.P. Chang's Home Page to view some animations of Kalman filter assimilation runs.




References

Cohn et al.,1997
Cohn, S. E., A. da Silva, J. Guo, M. Sienkiewicz, and D. Lamich (1997). Assessing the Effects of Data Selection with the DAO Physical-space Statistical Analysis System Technical Report DAO Office Note No. 96-08, NASA Goddard Space Flight Center, Greenbelt, Maryland. Submitted to Mon. Wea. Rev.

Dannevik, 1993
Dannevik, W., (1993). Massively Parallel Computing and Large Eddy Simulation in Geophysical Fluid Dynamics, Galperin, B. and S. Orszag (editors), Large Eddy Simulation of Complex Engineering and Geophysical Flows, Cambridge University Press, New York, NY, 600 pp.

DAO, 1996
DAO staff (1996). Algorithm Theoretical Basis Document, version 1.01. Technical report, NASA Goddard Space Flight Center, Greenbelt, Maryland.

DAO, 1997
DAO staff (1997). GEOS-3 Data Assimilation System Architectural Design. DAO Office Note, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Ding and Ferraro, 1995
Ding, C. H. Q.  and R. D. Ferraro (1995). An 18 Gflops parallel data assimilation PSAS package. In Proceedings of the Intel Supercomputer Users Group Conference, page 70.

Ding and Ferraro, 1996
Ding, C. H. Q.  and R. D. Ferraro (1996). Climate data assimilation on a massively parallel computer. In Proceedings of Supercomputing, 96.

Drake et al., 1994
Drake, J., I. Foster, J. Hack,J. Michalakes, B. D. Semeraro, B. Toonen, D. L. Williamson, and P. Worley (1994). PCCM2: A GCM adapted for scalable parallel computers. In Proceedings of the AMS Annual Meeting.

Foster and Michalakes, 1993
Foster, I. and J. Michalakes (1993). MPMM: A massively parallel mesoscale model. In Parallel Computing in Atmospheric Science.

Gaspari and Cohn, 1996
Gaspari, G. and S. E. Cohn (1996). Construction of correlation functions in two and three dimensions. Technical Report DAO Office Note No. 96-03, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Golub and Van Loan, 1989
Golub, G. H. and C. F. Van Loan (1989). Matrix Computations. The Johns Hopkins University Press, Baltimore, second edition.

Hennecke, 1996
Hennecke, M. (1996). A FORTRAN 90 interface to MPI version 1.1. Technical Report Internal Report 63/96, RZ Universitat Karlsruhe, Karlsruhe, Germany.

Joiner and da Silva, 1996
Joiner, J. and A. M. da Silva (1996). Efficient Methods to Assimilate Satellite Retrievals Based on Information Content Technical Report DAO Office Note No. 96-06, NASA Goddard Space Flight Center, Greenbelt, Maryland.

von Laszewski, 1996
von Laszewski, G. (1996). The Parallel Data Assimilation System and its Implications on a Metacomputing Environment. PhD thesis, Syracuse University, Syracuse, New York.

Lyster et al., 1997a
Lyster, P. M., W. Sawyer, and L. Takacs (1997). Design of the Goddard Earth Observing System (GEOS) Parallel General Circulation Model (GCM). Technical Report DAO Office Note No. 97-13, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Lyster et al., 1997b
Lyster, P. M., J. W. Larson, C. H. Q. Ding, J. Guo, W. Sawyer, A. da Silva, I. Stajner (1997). Design of the Goddard Earth Observing System (GEOS) Parallel Physical-space Statistical Analysis System (PSAS). Technical Report DAO Office Note No. 97-05, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Lyster et al., 1997c
Lyster, P. M. (1997). Four Dimensional Data Assimilation: HPCC 10 Gigaflop/s Milestone Submission. Available as Peter Lyster Home page.

Lyster et al., 1997d
Lyster, P. M., S. E. Cohn, R. Ménard, L.-P. Chang, S.-J. Lin, and R. Olsen (1997). Parallel Implementation of a Kalman Filter for Constituent Data Assimilation. Mon. Wea. Rev., 125, 1674-1686. Available also as Technical Report DAO Office Note No. 97-02, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Ménard et al., 1997
Ménard, R., L.-P. Chang, P. M. Lyster, and S. E. Cohn (1997). Stratospheric assimilation of chemical tracer observations using a Kalman filter. Submitted to J. Geophys. Res. Available also as Technical Report DAO Office Note No. 97-14, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Mirin, et al., 1992
Mirin, A., W. Dannevik, P. Eltgroth, M. Wehner, J. Brown, and B. Chan (1992). Performance of a Portable, Parallel Atmospheric General Circulation Model. R. Sincovec, et ali (editors) Proceedings, Sixth SIAM Conference on Parallel Processing for Scientific Computing Vol. 1, SIAM, Philadelphia, 500 pp.

Pfaendtner, 1996
Pfaendtner, J. W. (1996). Notes on the icosohedral domain decomposition in PSAS. Technical Report DAO Office Note 96-04, NASA Goddard Space Flight Center, Greenbelt, Maryland.

da Silva and Guo, 1996
da Silva, A. and J. Guo (1996). Documentation of the physical-space statistical analysis system (PSAS). part I : The conjugate gradient solver, version PSAS-1.00. Technical Report DAO Office Note No. 96-02, NASA Goddard Space Flight Center, Greenbelt, Maryland.

Takacs et al., 1994
Takacs, L. L., A. Molod, and T. Wang (1994). Documentation of the Goddard earth observing system (GEOS) general circulation model-version 1. Technical Report NASA Technical Memorandum 104606, Vol. 1, NASA Goddard Space Flight Center, Greenbelt, Maryland.


Author Biography

Peter Lyster

Peter Lyster is an associate research scientist in the Department of Meteorology at the University of Maryland. He is also affiliated with the NASA/Goddard Space Flight Center (GSFC) Data Assimilation Office (DAO) through the Earth System Science Interdisciplinary Center (ESSIC). He works on earth science applications with emphasis on parallel computing. He is the PI on the HPCC Earth and Space Sciences (ESS) Grand Challenge project: Four Dimensional Data Assimilation. He also works on the theory of data assimilation and initialization. Further details can be found on his home page, or contact him via email at lys@dao.gsfc.nasa.gov




Peter Lyster
Fri Aug 15 21:24:51 EDT 1997
  • About this document ...