P.M. Lyster1, J. Abeles2, L.-P. Chang3, S.E. Cohn, K. Ekers3, J. Guo3, M. Harber3, D. Lamich3, J.W. Larson1, R. Lucchesi3, R. Ménard4, R. Rood, S. Schubert, W. Sawyer1, M. Sienkiewicz3, A. da Silva, J. Stobie3, S. Swift2, L.L. Takacs3, R. Todling,3, J. Zero3
Email to lys@dao.gsfc.nasa.gov
http://dao.gsfc.nasa.gov/DAO_people/lys
Additional Affiliations C.H.Q. Ding
R. Ferraro
This presentation discusses the NASA data assimilation project
at the Data Assimilation Office at
the NASA/Goddard Space Flight Center. The goal is to produce
accurate gridded datasets of atmospheric fields by assimilating
a range of observations along with physically consistent model
forecasts. This work produces
datasets that are used by the climate research community.
The data come from conventional sources that are used for
weather forecasts (e.g., radiosondes, earth-surface measurements, and
satellite temperature retrievals), as well as new sources
such as satellites that will be launched under the Mission To
Planet Earth Enterprise. An end-to-end Goddard Earth
Observing System (GEOS) Data Assimilation System (DAS) currently
supports stratospheric flight missions and reanalysis projects
for NASA. The current Core of this system (Model, and Analysis) is
a multitasking algorithm that runs on Cray J90 and C90 computers
at Goddard and NASA Ames Research Center. Future Core computing will be
carried out at Ames, with a new production system scheduled to
be ready for the EOS AM-1 satellite launch in June of 1998.
The DAO has acquired SGI Origin 2000 computers, with an
aggregate of 160 processors in place at Ames, and more planned for
the future. The DAO is currently updating the control scripts and programs,
and implementing a modular Fortran 90 Core system.
During 1998 the Core system will be migrated to distributed-memory software
using the Message Passing Interface.
Part of this work is being carried out under the NASA High Performance
Computing and Communications Earth and Space Sciences
program.
The algorithmic and performance
issues involved in Core system are the main subject of this presentation.
Note that this version of the paper has a longer list of authors, representing a more complete list of those who contributed to the final oral presentation in November 1997 at Supercomputing97.
1. Introduction
2. The GEOS General Circulation Model
3. The Physical-space Statistical
Analysis System (PSAS)
3.1 The Prototype Parallel
PSAS
3.2 Summary of Timings
for the Prototype Parallel PSAS
4. Parallel I/O
5. Timings and Optimization for Core GEOS DAS
6. Advanced Data Assimilation: The Kalman Filter
7. Conclusions and Further Work
Appendix: Web References
References
Atmospheric data assimilation produces accurate gridded datasets that are
used by meteorologists for weather analysis and forecasts; it is also
being used as a tool for climate research.
The National Aeronautics and Space Administration (NASA) Data Assimilation
Office (DAO) at the Goddard Space Flight Center conducts
reanalysis of archived earth-science datasets as well as real-time mission support
analysis and forecasts for the climate research community.
Data assimilation produces accurate gridded datasets of
atmospheric fields by assimilating a range of observations along with
physically consistent model forecasts. The DAO's Goddard Earth Observing System
(GEOS) Data Assimilation System (DAS) is described in the Algorithm
Theoretical Basis Document[DAO, 1996].
The DAO is preparing to move GEOS DAS
to distributed-memory parallel computing platforms.
This will be used for DAO's normal activities, and will also
be an important for of the Mission to Planet Earth (MTPE)
Enterprise in the coming years.
The end-to-end GEOS DAS is shown schematically in Figure 1.
The present DAS,
whose Core system runs on multitasking Cray C90 and J90 computers, is
fed from the Global Telecommunication System (GTS). Meteorological
data, including satellite-retrieved temperature profiles, are given
to the Goddard Distributed Active Archive Center (DAAC). These
data proceed through a data reduction and preprocessing stage before being
ingested by the Core system. In the future, new datatypes will be
provided via the EOS Data and Information System (EOSDIS) Core System (ECS).
Postprocessing stages include: Quality Assurance of Data Sets (QuADS)
that inspects the gridded data sets produced by the Core system for
I/O error and general physical consistency;
Adaptive Tuning of Error Statistics System (ATESS) that produces observation
and forecast error statistics for the GEOS DAS; and DAO On-Line Monitoring
System (DOLMS) that monitors observation and gridded data streams and
makes them available for graphical presentation. There are a number of
different modes of operation for the DAS. Briefly, mission support
involves real-time, or First-Look, data assimilation and sometimes
the production of up to 10-day model forecasts. For current data sets
made available from the Goddard DAAC, this mode of operation ingests about
50 megabytes of data per day into the Core system. In the coming year,
satellite-retrieved profiles of atmospheric parameters will be produced as
part of the DAS preprocessing system, which will increase the data
ingest rate to about 1 gigabyte per day. The output analysis
(gridded) datasets are about 1 gigabyte per day in real-time mode, while
the production of model-forecast fields can increase this quantity by over
an order of magnitude. Periodically the DAO conducts reanalysis projects that
involve multi-year analysis whose data sets are then studied and distributed to
the climate research community. In this mode of operation, DAO plans for
a production rate of 30 days of assimilation per wall-clock day.
The end-to-end GEOS DAS is a complex operation that is being updated using
modern Process Engineering methods.
There are about 100 people directly involved in DAO
research and operations. This need for Process Engineering has
become apparent in a complex environment where issues such as
software Configuration Management (CM), version control and
maintenance, multiple hardware platforms, and the coordination of
distributed personnel are important.
GEOS DAS uses a grid-point based atmospheric general circulation model
GCM[Takacs et al., 1994].
Other physical processes are computed using
parameterized models for turbulence, short- and long-wave radiation,
moisture, and land surface
processes.
Analysis algorithms are used to combine the inherently unstructured
data, with (hopefully) known errors, into the structured
models whose errors are also quantified. This can be
regarded as an problem in the field of stochastic modeling
and estimation. One of the difficulties of
atmospheric data assimilation arises from the large size of
the state space. Current GCMs have more than 107
gridpoints and there are approximately 105 meteorological observations
(including satellite-retrieved temperature profiles)
reported world wide in a typical six-hourly (or synoptic)
assimilation period.
The current analysis at the DAO is an on-line Quality Control (QC) algorithm
and the Physical-space Statistical Analysis System (PSAS)
[Cohn et al.,1997].
A key part of PSAS performs an iterative
conjugate-gradient solve of a moderately dense (26% full)
105 × 105 matrix.
The heritage code for the Core system is mostly FORTRAN 77.
As part of DAO's preparation to provide support to MTPE in 1998, and as part
of a Grand Challenge project that
is being funded by the NASA High Performance Computing and Communications
(HPCC) Earth and Space Sciences (ESS) program,
a Core system that is based on the Message Passing Interface (MPI)
parallel library is being developed
using a Fortran 90 modular approach -- the Goddard Earth
Modeling System (GEMS)[DAO, 1997].
Key components of Core GEOS DAS, including
the GCM (in collaboration with Max Suarez at GSFC) and an early version
PSAS, have been parallelized
[Ding and Ferraro, 1995]
[Ding and Ferraro, 1996]
(in this paper, we shall refer to this parallel
version of the early PSAS as the prototype parallel PSAS). These form
the basis for the design of their respective parts of the parallel system.
The Core system is being optimized and run on
the SGI Origin 2000 systems at Ames Research Center and the Cray T3E
testbed computer at GSFC. This paper will
focus on the implementation and optimization of the Core distributed-memory (MPI)
algorithms, including issues of parallel I/O.
Other work at the DAO involves using parallel computing to
study advanced methods of data assimilation such
as the Kalman filter.
This work is being carried out in the context of the assimilation of
trace (constituent) gases such as ozone and methane;
details of ongoing work will be discussed briefly.
The GCM uses a finite-difference algorithm (the dynamical core)
to integrate the primitive equations.
The model uses a geographic spherical coordinate
grid with horizontal spacing of
The physical parameterizations involve dependencies only in the vertical
coordinate, so the GCM can be parallelized most easily
with a horizontal domain
decomposition (checkerboard)[Dannevik, 1993] [Mirin, et al., 1992]
similar to that commonly used in general
circulation [Drake et al., 1994] and regional mesoscale circulation models [Foster and Michalakes, 1993].
Prototyping work on a parallel implementation of the GEOS GCM dynamical core
has been done successfully with the collaboration of Max Suarez.
The remaining work
is to incorporate the operational versions of the physics parameterizations
into the parallelized dynamical core. The design and status of this
work are discussed extensively in
Lyster et al., 1997a.
The process of analysis generates
a best estimate of the state of the atmosphere. Typically, analyses of
meteorological variables are performed for six hourly (synoptic)
sets of observations that are collected from the Global Telecommunication
System (GTS). The GCM is used to provide a 6 hour forecast. These forecasts
(at 0Z, 6Z, 12Z, and 18Z) may be regarded as estimates of the state
of the atmosphere that are augmented by observations to form an analysis.
Let
The innovation equation
and the analysis equation
where
The analysis is on a
PSAS solves a global coupled large matrix problem. The innovation matrix
The innovation equation is solved using a preconditioned conjugate
gradient (CG) algorithm [Golub and Van Loan, 1989] [da Silva and Guo, 1996].
This is an
The large memory
that is available on parallel computers,
such as the Intel Paragon or the Cray T3E, affords
the possibility of calculating and storing the entire
forecast error covariance matrix once per analysis
cycle. This has the potential for considerable speedup over
the serial approach, in which the matrix was reevaluated
for each iteration of the CG algorithm. This
capacity, along with the need to distribute floating-point operations,
provides motivation for a parallel implementation
of PSAS.
The algorithm of the parallel version of PSAS comprises five
steps: 1) partition observations using inertial recursive
bisection scheme (this is used for both the parallel online QC and for the
parallel PSAS algorithm); 2) block matrix distribution of the innovation
matrix M; 3) solve the innovation equation using the CG method;
4) partition of
The recursive inertial bisection algorithm divides p observations
among
The inertial division guarantees some degree of compactness
of the resulting decomposition, and results in approximately equal
numbers of observations in each region. This algorithm
requires that the number of processors is a power of 2, and
less than or equal to the number of regions;
In the matrix block distribution, the parallel decomposition of observations
is used
to generate a list of matrix blocks. This is used to determine on which
processor a matrix block will reside. Figure 2
shows a schematic
of the block structure of the innovation matrix; this is color
coded according to the processor on which each bock resides.
A simulated annealing algorithm is used to ensure that both the
work in generating matrix blocks and the subsequent matrix-vector multiply
are load balanced.
For the prototype parallel PSAS, there are typically 512
observation regions containing approximately
p ~ 105
observations. The 6,000 kilometer correlation cutoff
condition gives rise to a forecast error covariance matrix
that is approximately 26% full. Therefore there are
about 5122 × 0.26 × 0.5
~ 35,000
blocks (the factor of 0.5 accounts for the symmetry of
the matrix).
Recall that each matrix block in the innovation matrix is associated
with one observation region in the case of diagonal blocks, or two
regions in the case of off-diagonal blocks. Each block is stored
with the associated vector segment of x' that must
be used to perform the matrix-vector multiply. Diagonal
segments are stored on the processor that stores the observation
for that region. For the off-diagonal blocks,
the number of vector segments on each processor
is minimized by storing the block on one of the two
processors that owns the observations corresponding to
one of the dimensions of the block. An iterative load
balancing algorithm is then used to equalize the memory and floating-point
costs on each processor. Application of this regimen results
in a load imbalance of approximately
The CG algorithm uses block preconditioning,
and is described in Golub and Van Loan [Golub and Van Loan, 1989] and
da Silva and Guo [da Silva and Guo, 1996]. In the parallel
partitioner, the innovations are separated into regions
that are distributed among the processors. Taken together, these
vector segments provide
initial condition for the vector iterate x'.
Depending on how the off-diagonal blocks are distributed,
some of these vector segments must be replicated
on multiple processors. At each step in the iteration, the
intermediate solution vector x' must be assembled using
global communications. Since only about a third of the
processors contain data needed to calculate a particular set of
vector segments of x', it is more efficient to do explicit message
passing to achieve this end rather than use a global summation
such as MPI_REDUCE.
The analysis increment (Eq. 2) is
The first step in the process is an equal-area rectangular
partition of the analysis grid into analysis regions. Next,
analysis gridpoints are thinned longitudinally,
resulting in an equal number of analysis gridpoints in each analysis
region. Therefore,
A detailed design and the current status of parallel PSAS can be found in
Lyster et al., 1997b.
Timings were performed for the prototype parallel PSAS for
80,000 observations (analysis grid
The prototype parallel PSAS has also been implemented on a 512 processor Cray
T3D at the NASA Goddard Space Flight Center, and the scaling behavior of the
code on this platform is presented in Figure 3.
This work was performed as part of the 10 gigaflop/s milestone requirement
for the 4DDA Grand Challenge HPCC ESS project
[Lyster et al., 1997c].
The dots at 128 and 256 processors depict performance for a run with 51,990
observations, while the diamonds at 256 and 512 nodes are for a run
using 79,938 observations. The two different numbers of observations were
used because there was not sufficient combined memory on 128 nodes
to accommodate the larger number of observations.
The input and output of data to the GEOS Data Assimilation System
(GEOS DAS) is the responsibility of the GEOS Parallel I/O Subsystem
(GPIOS). In previous versions of the GEOS system, sequential I/O was
used on shared-memory computers. Implementation in a parallel environment
with a software system that uses a distributed memory programming model
requires more attention to performance details since I/O could generate
significant network traffic and cause a bottleneck that interferes with
scaling of the application.
The performance requirements for GPIOS are derived from the overarching
GEOS throughput requirements along with requirements imposed by the
target computing environment and programming model. There are turnaround
requirements imposed by the First-Look analysis, an
assimilation that will execute close to real-time in support the
EOS Instrument
teams. The First-Look Analysis will be run once a day starting in June 1998.
It includes a 10 day forecast and will produce over 5 gigabytes of data,
but this is not the mode of operation that will most heavily stress
the I/O system. In
reanalysis mode, the GEOS DAS must perform at a rate of 30 days of
assimilation per one day of wall-clock time. This will require daily input
of more than 1.5 gigabytes and will produce 25 to 35 gigabytes of gridded
output data in HDF-EOS format. It is clear that I/O will be a bottleneck for
this system unless some type of parallel I/O methods are used.
Traditionally, I/O in the MPI-based GCM was done by the root compute processor.
The decomposed data were gathered from the array of processors to the single
root processor, where they were stored in global arrays. As the global fields
were gathered on the root processor, they were written to disk. The inverse
of this process was used for reading the input data. Since the volume of the
output data is an order of magnitude greater that the volume of input data, the
DAO will concentrate on implementing a parallel I/O scheme for output first.
This scheme will use the traditional single-processor output method described
above and modify it in two ways. First, there will be more than one processor
designated for doing output. Second, the output processors will be in an MPI
group distinct from the other compute processors and will be dedicated to I/O
and related functions. They will act as a cache for output data, allowing
computation to continue once the MPI transfer of data from compute processors
to I/O processors is complete; this is shown schematically in
Figure 4. In the figure, the top group of processors is dedicated to the
primary computation. At designated times as the assimilation moves forward
in time, output
is performed. Distributed data on the compute processors are passed
to some number of I/O processors via an MPI communicator. Once the
distributed data have been gathered into global data fields on the I/O
processors, the compute processors are released to continue computation.
Meanwhile, the I/O processors make data and format transformations and
physically write the data to disk files.
One advantage of such an approach
is that it allows data transformations in preparation for output to occur
locally on I/O processors without slowing the forward progression of the
assimilation system. Some of these transformations would require extensive
global communication if they were done in a distributed environment making it
convenient to have the data gathered into global arrays on an I/O processor.
As more is learned about the support and performance of advanced parallel
I/O tools for our hardware platform, DAO will consider the use of those
libraries provided their inclusion does not severely
affect the portability of code.
A leading candidate is MPI-IO, which is included in the recently completed
MPI-2 standard.
GEOS DAS must meet performance
rates of 50 and 100 gigaflop/s as part of the NASA HPCC project.
Already, the prototype parallel PSAS has achieved
11.0 gigaflop/s for realistic problems on 512 processors
of the NASA Goddard Cray T3D
[Lyster et al., 1997c].
Parallel GEOS GCM has
achieved 3.7 gigaflop/s (dynamical core) and 8.0 gigaflop/s
(short wave radiation package) on 512 processors of the T3D;
other model modules (turbulence, long wave, land surface) are
expected to lie somewhere in this range of performance.
The increased performance required to achieve the
HPCC milestones will be gained through: the use of
a Cray T3E whose processor speed is significantly
faster than the T3D;
higher number of processors (more than 512) on the T3E;
improved load balance, communications optimization, and
single-processor performance. Communications will be
optimized by reducing the net volume where possible
and concatenating messages in order to amortize latency.
Single-processor performance is particularly important
since the previous versions of GEOS GCM were optimized
for vector computers. Presently we are working on
cache optimization (use of padding, blocking,...),
and the use of BLAS and other low-level
routines. We will focus on optimizing the time-consuming
radiation modules.
The DAO must also meet performance requirements
as part of its commitment to Mission To Planet Earth.
The hardest requirement is that 30 days of earth data must be
assimilated in one wall-clock day. Since there are
four analysis cycles (6 hour forecast and analysis)
per day, this means that we must perform an analysis cycle on average
every 12 minutes. At 18.3 gigaflop/s on the Paragon, the prototype
PSAS takes 157.5 seconds. The QC usually takes
about 10 percent of the analysis, which would add an
additional 15.6 seconds. The model timestep is 3 minutes,
while the physics modules are advanced on longer timescales
(9 minutes for moist processes, 30 minutes for turbulence,
and 3 hours for longwave and shortwave calculations).
At present, GEOS GCM takes about 40% of the time
of the analysis (PSAS and QC) so, in order to achieve
at least 30 days per day turnaround, each timestep must
take no more than 1.9 seconds on average. Timings
for the present serial (vectorized) version of the GCM
indicate that if we can achieve a weighted average 10 gigaflop/s
then the 6 hour forecast will take about 106 seconds. This
time plus the 173 seconds (at 18.3 gigaflop/s) for
the analysis fall well within the requirement of
12 minutes per analysis cycle. It is clear
real-time (First-Look) analysis performance will be achieved
fairly easily on the Origin 2000 systems that are being installed
at Ames. In fact the present multitasking operational system achieves
about 4 days of assimilation per wall-clock day on Cray J90s.
The challenge will be to achieve the higher performance required
of the DAO's reanalysis projects and HPCC milestones.
With the increasing importance of climate research,
there is the need for accurate assimilation of atmospheric
trace gases (constituents) such as ozone or methane.
Typically, constituent assimilation uses prescribed winds that
are provided from a standard DAS, such as GEOS DAS or from
weather centers such as ECMWF and NCEP. Many
institutions regularly study trace gases
this way, and the DAO is planning on providing gridded
ozone datasets as a standard product as part of its support for EOS
missions in 1998. Computationally,
constituent assimilation is not as demanding as a full DAS.
Because of this, the DAO has been able to use the constituent problem as a
testbed for research on the Kalman filter (KF). The Kalman filter is
one of the advanced methodologies for DAS (others include 4DVAR
and Ensemble Filtering) that are being studied worldwide. The main
advantage of the KF is the ability to more accurately calculate the
forecast error covariance matrix Pf. PSAS
(Eq. 1) requires Pf, but the current methodology
approximates this quantity
using statistical methods. The Kalman filter
accurately evaluates Pf based on a dynamical
approach [Lyster et al., 1997d].
The equation for the evolution of Pf is:
where
Apart from code performance mentioned in the previous section, there
are still a number of issues to be resolved.
The prototype parallel PSAS is now
written in C with some FORTRAN 77 subroutines retained.
The version of PSAS under development is written in Fortran
90 using a modular structure with complex data types.
Some portions of
FORTRAN 77 code will be converted to Fortran 90,
and some will be used, unchanged, as low-level subroutines.
These changes will result in code that is more modular, as well
as safe and stable. The Fortran 90 interface to MPI is still under
development [Hennecke, 1996]; this is more of
an issue of portability rather than feasibility.
The formulation of PSAS is maturing beyond its statement in
equations (1) and (2) to include direct assimilation of non-state
variables[Joiner and da Silva, 1996].
This presents the major complication in the process of building
the next parallel version of PSAS. In particular, algorithms and data
structures must be developed to handle observations that are
non-state variables, and their corresponding variables in state space.
These changes will render the load balance problems conquered in the prototype
parallel algorithm more complex, and it will
also be necessary to extend some of the concepts regarding
load-balancing in section 3.1 to accommodate these enhancements.
The prototype parallel PSAS is designed to work on multiprocessor
configurations in which the number of nodes is a power of two. This
restriction can be alleviated through modification of the parallel partition
scheme described in Section 3.1. True flexibility would require a more
general algorithm capable of partitionings other than bisections;
that is, the
ability to divide any region into a prime number of roughly equal pieces
(in terms of the number of observations in each piece). Then dividing a
set of observations into
Once the above challenges are met, a fully functional parallel version
of PSAS capable of assimilating non-state data will result. When this
is combined with the parallel version of the GEOS GCM, and the parallel
QC system under development, a flexible data assimilation system
based on a Core MPI algorithm will be possible.
The original serial version of PSAS was developed by Jim
Pfaendtner. The original design of the parallel PSAS was developed
at the Jet Propulsion Laboratory with help from Don
Gennery, and Clara Chan.
Edith Huang of JPL provided valuable help with parallel programming issues.
We would also like to acknowledge useful
discussions with Andrea Molod, Sharon Nebuda, Carlos Pabon-Ortiz,
Max Suarez, and Dan Schaffer at GSFC. The work of Robert
Ferraro (Jet Propulsion Laboratory),
Chris H. Q. Ding (JPL and the Lawrence Berkeley National Laboratory), and
Peter Lyster, Jay Larson, and Will Sawyer at
the Data Assimilation Office was funded by the High
Performance Computing and Communications Initiative (HPCC)
Earth and Space Science (ESS) program.
The
Download a paper [DAO, 1997] that describes the design of the
Download a paper [Lyster et al., 1997a] that describes the design of the
Download a paper [Lyster et al., 1997b] that describes the design of the
Further information about the work on parallel I/O can be found at
Further information about the HPCC ESS Grand Challenge PI work at the DAO can be found at
Download a paper [Lyster et al., 1997d] on the
1. Introduction

Figure 1: The GEOS End-to-end Data Assimilation System.
2. The GEOS General Circulation Model
in latitude by
in longitude,
with a 70-level vertical sigma coordinate.
Physical processes are parameterized using: a land-surface scheme, moisture,
turbulence, and longwave and shortwave radiative transfer
schemes [DAO, 1996][Takacs et al., 1994].
3. The Physical-space Statistical Analysis System (PSAS)
be a vector of p observations that passed the quality control
tests, and
a vector of n forecast variables produced by the GCM.
The objective is to produce an analyzed
state vector
(i.e., a vector of physical
quantities that describe directly the physical state of the atmosphere).
This is accomplished by solving
a pair of equations that comprise the core formulation of
PSAS [Cohn et al.,1997][da Silva and Guo, 1996]:
![]()
![]()
is the specified
forecast error covariance matrix,
is the specified observation error covariance matrix,
represents a generalized
interpolation from the analysis grid to the observations, and
is an intermediate vector in the observation space. Both the
observation and forecast variables are state variables; in the future
this will not always be so [Joiner and da Silva, 1996].
latitude by
longitude spherical grid.
The state variables analyzed include: sea-level pressure
, sea-level horizontal winds
and
, and
the horizontal wind u and v on 18 standard pressure levels
(prototype PSAS uses 14 levels), the geopotential height h, and
the specific humidity q on standard levels greater than 200 hPa.
is
dense, although entries associated with locations that are separated by
several correlation lengths are negligible (zero for
compactly supported correlation functions) [Gaspari and Cohn, 1996]
[DAO, 1996].
In the present multi-tasked serial version of PSAS, block structure is
introduced in M by dividing the observations into 80
equal-area regions via an icosahedral partitioning scheme
[Pfaendtner, 1996]. In order to
introduce some sparseness in
and thus save computational effort,
the correlations beyond a preset cutoff distance are not included.
For a typical 6 hour synoptic (i.e., analysis) period the number
of observations accumulated is p ~ 105.
Hence, with a 6,000 kilometer cutoff between centroids of observation
regions, the innovation matrix is of size
105 × 105, and is
approximately 26% full. This uses
more than 5 gigabytes of storage for a 32-bit representation of the data.
operation, where
is the number of iterations of the CG algorithm. Typically, 8 to 12
iterations (x')
are needed to produce a CG solution whose residual has been reduced
by an order of magnitude.
Experiments in which the residuals
are reduced by more than two orders of magnitude resulted in differences
in the analysis that are
smaller than expected analysis errors [DAO, 1996].
The number of operations in the solution of the analysis equation (2) is
.
3.1 The Prototype Parallel PSAS
; and finally 5) solve the analysis equation.
regions
using bisection on the surface of a sphere.
The data are read in as innovations,
,
and distributed in random order (in equal
numbers) on
processors.
In the first iteration, the observations
are divided along the
orthogonal cut of their combined principal axis moment
of inertia. Successive iterations are
performed in a tree structure that repeats
the decomposition on half the remaining processors
with approximately half the data. At each stage, the
calculation of the principal axis is performed
in parallel. As regions replicate, data are moved between
different processors using split MPI communicators.
Only the geographical locations of the
observations, along with the processor on which the data associated
with the observation itself
resides is communicated during the partitioning process.
Only after the partitioning is complete, are the observations
and their associated attributes
relocated to the processors corresponding to their
respective observation regions. This two-step process minimizes
communications costs.
, otherwise there would
be a load imbalance in the solution of the analysis equation (2).
Typically Np=256, or
512, and Nr=512.
It is possible to modify the
power-of-two
restriction on numbers of regions and processors by moving the
location of the orthogonal cut to some point(s) along the principal axis
to give a division of observations other than halving, and maintain
approximately equal numbers of observations per region.

Figure 2: Schematic of the parallel block structure
decomposition (color coded) of the innovation matrix.
.
, where
represents
the forecast error covariance between the observation grid
and the analysis grid. The domain decomposition
must account for both the unstructured distribution
of the observations and the structured analysis grid.
The solution vector x from the CG algorithm
is decomposed in the same manner as the observations.
The decomposition for the analysis grid is based on the
condition that gridpoints within the 6,000 km cutoff
of the location of an observation are affected by
the value of the corresponding coordinate of the vector x.
may be generated as a number
of equal-sized blocks such that the centroids of their associated regions
are within the 6,000 km cutoff distance of the centroids of each region of
x. A fixed number of analysis regions per processor (usually
1 or 2), along with a uniform distribution of thinned analysis points
renders the partial matrix-vector multiply approximately load-balanced.
3.2 Summary of Timings for the Prototype Parallel
PSAS
levels) on 512 processors of the Intel Paragon at the California
Institute of Technology. On this platform, the solver achieves 18.3
gigaflop/s (77 megaflop/s per processor), which is 36% of peak
performance for
512 processors. This corresponds to 157 seconds to produce a 6 hour
analysis. The calculation of the innovation matrix entries consumes
23.8 seconds, but is performed only once per analysis. The CG solver uses
BLAS level 2
library calls and messages to send vector segments of x'. This
is iterated
times, taking 36.4 seconds. The number of iterations is more
than the value of Ni cited earlier because the version
of PSAS that was parallelized had a more stringent convergence
criterion than subsequent versions of PSAS. The condition cited
in this section is more stringent than was later learned
to be necessary.
The dominant cost of solving the analysis equation (Eq. 2) is
the time spent generating the matrix elements. Since
one dimension of
is n which is larger than p
this takes longer than the generation of the innovation
matrix. The communication (1.5 seconds) involved in reassembling
the analysis grid vector is similar to the assembly of
vector segments in the innovation equation. In sum,
of the CPU time
is spent in communication.

Figure 3: Scaling behavior for the prototype parallel PSAS on a 512 node Cray T3D.
4. Parallel I/O

Figure 4: Schematic of proposed parallel I/O system.
5. Timings and Optimization for Core GEOS DAS
6. Advanced Data Assimilation: The Kalman Filter
![]()
is the analysis error covariance matrix that is
evaluated based on an optimizing principle (such as
minimum variance), M represents the linearized model (e.g.,
transport model) operator,
and Q is the model error covariance matrix.
Both the forecast and analysis covariance
matrices are of size n
n, where n is the
size of the state vector
; note that
n ~ 106 for three-dimensional models, and
n ~ 104 for two-dimensional (horizontal) models.
Even for sparse models this problem has memory and complexity
that scales as
. Full three-dimensional
implementations await tera- and petaflop/s capacity.
The DAO has developed a two-dimensional Kalman filter[Lyster et al., 1997]
to study trace chemicals in the stratosphere, where the
model is relatively simple (passive advection with
prescribed, analyzed winds), and motion is known to
be constrained to isentropic surfaces.
The key component of the algorithm is the distributed-memory (MPI) parallel
implementation of Eq. 3. Figure 5 shows how
the covariance matrix P is first
decomposed along rows (i.e., contiguous groups of columns are stored
on each processor). The evaluation of
then proceeds in a series of steps that involves a global transpose of
the large
matrix MP. Although this involves considerable
global communications, an overall performance of more than 1 gigaflop/s
was obtained on the 512 processor Intel Paragon computer at the California
Institute of Technology, and, more importantly, practical wall-clock times are
being achieved for realistic experiments. The parallel Kalman filter
is being used to study retrieved constituents from the
Upper Atmosphere Research Satellite (UARS)[Ménard et al., 1997].

Figure 5: Schematic of the Parallel Decomposition of a Covariance Matrix.
7. Conclusions and Future Work
regions would involve individual partionings
corresponding to the prime factorization of
.
Acknowledgments
Appendix: Web references
Algorithm
Theoretical Basis Document (ATBD) describes the basic
algorithms of the Goddard Earth Observing System (GEOS) Data
Assimilation System (DAS)
GEOS-3 Data Assimilation System Architectural Design
Parallel
GEOS General Circulation Model (GCM)
Parallel
GEOS Physical-space Statistical Analysis System (PSAS)
Rob Lucchesi's GPIOS site
Peter Lyster's Home Page
Parallel
Kalman Filter,
or go to L.P. Chang's Home Page to view some animations of Kalman filter assimilation runs.
Author Biography
home page,
or contact him via email at lys@dao.gsfc.nasa.gov
Peter Lyster
Fri Aug 15 21:24:51 EDT 1997