Four Dimensional Data Assimilation

Richard Rood, Principal Investigator, NASA/Goddard Space Flight center, Data Assimilation Office, rood@dao.gsfc.nasa.gov, 301-286-8203

Peter Lyster, Co-Investigator, lys@dao.gsfc.nasa.gov, 301-805-6960

1. Executive Summary

The primary achievement of the Goddard Data Assimilation Office (DAO) HPCC Grand Challenge project -- Four Dimensional Data Assimilation (4DDA) -- has been the application of high-speed and large-memory parallel computers to a key problem of Earth science: that of melding the complexity of real-world observations into modern mathematical models. In the process of performing this work, computer and Earth scientists on the project have had to collaborate to make best use of available hardware, software, and algorithmic development. The scientific output, or 'product', are gridded, best-estimate, consistent datasets that can be used by scientists who are studying the Earth's climate. Of particular importance are 'reanalyses' of past decadal datasets up to the present, and the need to do this with rapid turn around.

The main algorithms that have been studied are the Physical space Statistical Analysis system (PSAS) that will be used as the production system for the coming decade, and the developmental Kalman filter assimilation system. At modest resolutions (2 degrees of horizontal resolution by 30 vertical levels, and around 100,000 observations in each synoptic six hourly period) both algorithms are capable of using in excess of gigabytes of main memory and gigaFLOPS of speed. They are also data-driven, and as such the production algorithms stress the whole train of technology from I/O, to memory, communications, and processor speed.

The technical approach and achievements are summarized in Section 2. An executive summary of the achievements and recommendations are presented here (detailed numbers are presented in Section 2):

1.a Summary of Achievements

At the beginning of the project there was no inhouse parallel computing expertise at the DAO. Since the beginning of 1993 the DAO has formed a team of interdisciplinary Earth and computer scientists from NASA/Goddard Space Flight center (GSFC), NASA/Jet Propulsion Laboratory (JPL), University of Maryland College Park (UMCP), and the Northeast Parallel Architectures center (NPAC) at Syracuse University.

Work at the Jet Propulsion Laboratory (JPL) has concentrated on the future production analysis algorithm: Physical space Statistical Analysis System (PSAS). Hong Ding, Robert Ferraro, and Don Gennery have developed a data decomposition algorithm based on message-passing software on the Intel Paragon computer with 512 processors. This algorithm has achieved 18.3 gigaFLOPS for the key conjugate gradient component, and with this it is expected that a 30-fold increase in efficiency can be obtained over the equivalent code run on a single-processor CRAY C90. This is important because the expected increase in input data volume with the Mission to Planet Earth (MTPE) program will stress the ability to maintain present-day metrics such as the number of days assimilated per wall-clock day.

At the University of Maryland and Goddard, Peter Lyster, Richard Rood, Steven Cohn, Richard Menard, and L.-P. Chang have worked on the developmental Kalman filter algorithm; they have also used Intel machines. Starting from scratch, a two-dimensional Kalman filter was developed. This is at present being used in scientific studies and has achieved a speed of 1.3 gigaFLOPS. This is the first full implementation of a Kalman filter, and the availability of large memory, efficient interprocessor communications, and on-processor speed have been instrumental in enabling this work.

Jose Zero at Goddard has ported the sequential PSAS code to a DEC Alpha 2100 in order to study aspects of the evolving (in particular, RISC-based) computing environment. The DEC Alpha yielded a rate of 50 percent of the speed of PSAS on a single processor of a CRAY C90.

Work at Syracuse has established the framework for parallelizing operational statistical analysis algorithms. In particular, the nearly-completed PhD thesis of student Gregor von Laszewski has resulted in a consistent approach for handling the load-balancing problems that result from spatially-inhomogeneous data in the Optimal Interpolation (OI) scheme. This approach applies to message-passing software implementations such as with the Message Passing Interface (MPI). Also at Syracuse, Miloje Makivic has implemented the quality control (QC) part of the analysis algorithm in data-parallel software on Thinking Machines CM-5. For these works speeds of around 0.5 gigaFLOPS were reported on modest sized computers (40 processor IBM SP-2, and 256 processor CM-5).\1.b Discussion of Phase I Experiences

In general the parallel computers that were made available were not configured to handle data-driven applications such as operational 4DDA. This meant that it was impossible to assess an end-to-end operational system from I/O, to memory, communications, and processor speed. In Phase I, key segments of the DAO's Data Assimilation System were analyzed, and it is expected that the work of system integration will be performed in the coming years. The parallel computers that were made available in Phase I were themselves new technology and as such did not always perform to expectations in terms of the operating systems, on-processor FLOPS, interprocessor communications, and I/O. It is also clear, in the light of the multifaceted needs that have been expressed here, that FLOPS are only one of a number of possible metrics that need to be addressed. For applications of data assimilation, the following are at least as important: maintaining a fixed throughput in terms of days assimilated per wall-clock day in the face of increasing volume of observational data and increasing model resolution; improvement in wall-clock time for developmental algorithms so that engineering and scientific analysis of their results can be completed expeditiously; the efficiency and expense of developing and debugging parallel software.

2. The Scientific and Computational Work of the Grand Challenge: Four Dimensional Data Assimilation

Four Dimensional Data Assimilation (4DDA) involves using climate models of the Earth System (atmosphere, land surface, and ocean -- i.e., climate models) and estimation-theoretic methods for melding real-world observations into the model. 4DDA attempts to provide the best estimate of the evolving state of the Earth System, by extracting the maximum amount of information from the available observations.

2.a The Goals

The primary objective is to use the computing load of 4DDA to explore the high-performance computing limits of: processor speed, main-memory volume, memory-access speed (including inter-processor as well as on- processor communication bandwidths), and I/O. In the coming years the computing requirements of the DAO are estimated to be:

Year Sustained gigaFLOPS Storage (terabytes)
1995 0.5 13
1998 50.0 43
2000 150.0 70

Table 1

These numbers were obtained by using present-day performance figures and using conservative projections based on the realistic trends in Mission to Planet Earth (MTPE) and Earth Observing System Data and Information System (EOSDIS) in a budget constrained environment. Figure 1 shows a schematic of the current Data Assimilation System (DAS).

Assimilation System Schematic

Figure 1: A schematic of the DAO data assimilation system. Input and output data types are representative of the current system. Note the larger, more complete output datasets.

In performing the primary goal, the scientific objective is to facilitate the programmatic demands on the Goddard DAO: to provide accurate assimilated Earth science datasets to the scientific community; to incorporate the larger data volumes and data types that are becoming available in MTPE; and to perform research into advanced techniques of 4DDA. Of particular importance are 'reanalyses' of decadal datasets up to the present, and the need to do this with rapid turn around. In 1998 the DAO will deliver an operational data assimilation system to EOSDIS, with continuing support beyond that.

4DDA involves three distinct processes: first, the model performs a 6-hour forecast; then, the observational data are subject to quality control (QC); finally, a statistical analysis is performed to meld the observations with the forecast. The quality control and statistical melding of the data are often referred to as 'the analysis'. The Phase I DAO Grand Challenge project has worked in close collaboration with the parallel model development teams of Suarez and Mechoso. Hence, the present work deals mostly with the (potentially more computationally demanding) requirements of the analysis.

2.b Technical Approach

The work on 4DDA (especially the PSAS and Kalman filter algorithms) is computationally intensive in terms of memory speed, volume, and data throughput. Hence the applications are suited to test most aspects of high-performance computing in the coming years. The Data Assimilation Office under the Richard Rood PI project has formed team of interdisciplinary Earth and computer scientists from NASA/Goddard Space Flight center (GSFC), NASA/Jet Propulsion Laboratory (JPL), University of Maryland College Park (UMCP), and the Northeast Parallel Architectures center (NPAC) at Syracuse University. In support of the HPCC and DAO's goals, scientists have studied identifiable segments of the Data Assimilation System (DAS).

Apart from I/O, the principal segments of the DAS are the models, the data quality control modules, and the statistical analysis-solve routines. For the models the DAO has deferred to the above-mentioned Grand Challenge teams who are using plug-compatible versions of the codes. The DAO Grand Challenge team has studied the parallel efficiency and implementation of semi-Lagrangian models. Jose Zero has coordinated the integration of the models with the analysis algorithms, as well as the work on the semi-Lagrangian models. At JPL Hong Ding, Robert Ferraro, and Don Gennery have worked on the parallel message-passing PSAS operational analysis code, using the Intel computers at California Institute of Technology. At the University of Maryland and GSFC, Peter Lyster, Steven Cohn, Richard Menard, and L.-P. Chang have worked on the developmental Kalman filter algorithm; they have also used the Intel machines. Work on the quality control modules and the production Optimal Interpolation (OI) analysis-solve routines has been performed by a Gregor von Laszewski, a PhD student at Syracuse, who uses message-passing software on IBM SP-2 and DEC alpha clusters. Miloje Makivic has studied the quality-control modules using data-parallel software on the CM-5.

2.c Accomplishments

As of 1992 none of the DAO codes had been ported to massively parallel processors, nor was there any in-house parallel computing expertise. The distributed-management associated with the multi- institutional project was conducted with considerable success.

A summary of significant achievements is:

2.c.1 The advanced Physical space Statistical Analysis System (PSAS) represents a challenge to both the floating-point speed and the physical memory of the future analysis system at the DAO. With the current world-wide observational network that provides about 100,000 observations every six hours, the storage of a correlation matrix requires around 20 gigabytes (double precision). In preliminary work on PSAS at JPL, Hong Ding, Don Gennery, and Robert Ferraro have achieved a speed of 18.3 gigaFLOPS for the key analysis-solve routines (matrix generation and solve) on 512 processors of the Intel Paragon. This equates to at least a 30-fold speedup over the same code on a single processor on the CRAY C90. This is expected to enable the metric of 30 days of assimilated datasets per wall-clock day to be maintained in the coming years.

2.c.2 The Kalman filter (KF) represents a rigorous approach to 4DDA that minimizes the ad hoc approximations. As such, the algorithm dynamically evolves not only the system state (wind, temperature, etc.) but also evolves the error correlations between these quantities. Therefore, this algorithm also makes use of considerable floating-point speed and main memory. A two-dimensional Kalman filter was developed by Peter Lyster, Steven Cohn, Richard Menard, and L.-P. Chang (Lyster et al. 1995) to study the transport of trace chemical constituents in the middle atmosphere. For reference, a single correlation matrix for a application that uses 2 degrees of horizontal (latitude-longitude) resolution takes one gigabyte of memory (single precision). This is the first known implementation of a 'brute-force' Kalman filter. This Kalman filter algorithm represents a key advance in science and technology that could not have been performed without the development of large-memory, high-speed parallel computers. The peak speed at horizontal resolution of 2 degrees on 512 processors of the Paragon is 1.3 gigaFLOPS at single precision (1 hour of wall-clock time per day of assimilation).

Scaling Graph

Figure 2: Scaling for the Kalman filter on the Intel Paragon computer in terms of the gigaFLOPS versus the number of processors for both medium (4 x 5 degrees) and high (2 x 2.5 degrees) horizontal resolution (this uses the optimal form of the Kalman filter).

2.c.3 As part of the interdisciplinary collaboration with Syracuse University's Northeast Parallel Architectures center (NPAC), Gregor von Laszewski is completing a PhD thesis in computer science with Earth science applications. He has studied aspects of the optimization and load-balancing for the operational Optimal Interpolation (OI) analysis system. His algorithm has achieved 400 megaFLOPS for the analysis of 80,000 observations on 40 processors of an IBM SP-2. This work, and a subsequent publication, are expected to form the basis for future parallel analysis algorithms, including ocean data assimilation codes.

2.c.4 Working on the quality control part of the analysis algorithm Miloje Makivic at Syracuse has achieved 430 megaFLOPS (3.1 gigaFLOPS peak) using data-parallel code on the Thinking Machines CM-5. This exceeded the metric that was set in 1992.

Speedup Graph

Figure 3: Speedup curve for the quality control module of the parallelized Optimal Interpolation analysis system, using 80,000 observations of height and wind. This was implemented using Message Passing Interface (MPI) software on the IBM SP-2.

2.c.5 In May 1994 a workshop on the parallelization of semi-Lagrangian General Circulation Models was held at GSFC under the sponsorship of the DAO, Argonne National Laboratory, and the National center for Atmospheric Research (NCAR). Working at GSFC, Jose Zero has shown that flow-dependent interprocessor communications cause a high overhead that does not scale well to large numbers of processors with current technology. Memory management and index computations also led to poor on-processor flop/s performance.

2.c.6 Miloje Makivic parallelized the operational van Leer transport algorithm that is in use at the DAO. Speeds were obtained of 2.5 gigaFLOPS sustained and 6.8 gigaFLOPS peak on a 256-node CM-5, despite high communication latencies and a horizontal resolution of 2 degrees.

2.c.7 Jose Zero at GSFC has ported the sequential PSAS code to a DEC Alpha 2100 in order to study aspects of the evolving (in particular, RISC-based) computing environment. The DEC Alpha yielded a rate of 50 percent of the speed of PSAS on a single processor of a CRAY C90 processor.

2.c.8 A number of modules have been submitted to the HPCC Software Exchange, including data decomposition for data-parallel transport codes and a global matrix solve suitable for the Kalman filter. The latter was a key algorithm that led to the understanding that global transpose operations are not necessarily detrimental to algorithmic performance on parallel computers provided they are performed efficiently. This is borne out in the successful implementation of the Kalman filter and has been commented on by Peter Lyster and other researchers for more general applications at a recent workshop on Numerical Weather Forecasting at the European center for Medium Range Weather Forecasting (Hoffman G-R. and N. Kreitz, 1995).

N2O Analysis Visualization

Figure 4: The Kalman filter is presently being used for studies of nitrous oxide transport in the middle stratosphere using data from the Upper Atmosphere Research Satellite (UARS). The figure shows the nitrous oxide mixing ratio plotted at the 850 Kelvin isentropic surface on September 10, 1992, obtained by using the KF to assimilate UARS data over the 4 previous days. The corresponding relative error of the mixing ratio, an important by-product of the KF, is also contoured in the figure. Consistent and reliable estimation of errors is one of the major scientific motivations behind pursuing the KF and is an essential attribute of datasets that will be used to understand global change.

2.d Status/Plans

The major focus in the coming years will be the development of the entire end-to-end system into a state-of-the-art computing environment in 1998 (refer to Table 1 for the expected computing requirements). The DAO plans to implement a large part of its production codes in RISC systems such as Silicon Graphics, IBM SP-2, and CRAY T3D. Work at Syracuse will focus on data-parallel software aspects of the analysis, while at JPL the message-passing conjugate gradient algorithm will be advanced. The DAO will soon convert to 1 degree horizontal resolution by 60 vertical levels, while keeping the 30 days per day metric in tact. Work on the advanced Kalman filter will involve optimizing and implementing the code on faster machines with larger memory. In doing so the DAO will collaborate with vendors on the most efficient algorithms and advance the science of Kalman filtering.

3. References

Hoffman G-R. and N. Kreitz, Eds., Coming of Age: Proceedings of Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, World Scientific, 303-318, (1995).

Lyster P., S.E. Cohn, R. Menard, L.-P. Chang, S.-J. Lin, and R. Olsen, 1995: An Implementation of a Two-Dimensional Kalman Filter for Atmospheric Chemical Constituent Assimilation on Massively Parallel Computers. Submitted to Mon. Wea. Rev., 26 pp.