
To parallelize simulation codes while retaining portability across a wide range of parallel computer architectures.
In the world of computational science, the trend is towards simulations of higher and higher spatial and temporal resolution. In order to run these codes efficiently, it is necessary to parallelize them well. However, it is also desirable to write code that can be easily ported between various parallel computer architecures.
BSP (Bulk Synchronous Parallel) is a parallel programming model that serves both the needs of performance and portability. A BSP program is split into segments of code called "supersteps". Communications are always initiated within supersteps, but they are not guaranteed to be finished until the end of that superstep. Thus, if you bring data from a remote process into your address space during a particular superstep, you cannot access that data until after the superstep is over, because you cannot be guaranteed that your data retrieval has been completed.
The Oxford BSP Library is the implementation of BSP used for this project. It has the advantages of:
The architecture used for much of this project was the Beowulf parallel workstation. The Beowulf used for this project consisted of 14 Pentium 100's running Linux 2.0 and connected with a 100 Mbit/s Ethernet cable.
My accomplishments were threefold:
In order to make intelligent implementation decisions when programming a parallel machine, the programmer must know certain things about the hardware and software platform. When using BSP, it is important to know:
Towards this end, I did a study of the characteristics of Oxford BSP on a Beowulf. Results are shown below. Figure 1 is a graph of barrier latency, the cost incurred by each superstep in a BSP program. With 14 processors, the barrier cost is 3300 microseconds, the time it takes for each processor in the Beowulf to perform 30,000 floating-point operations.

Figure 1: Barrier latency
Figures 2 and 3 show bandwidth measurements. The peak aggregate bandwidth achieved on the Beowulf was 80 Mbit/s with 14 processors. It was found that the granularity of messages did not much affect the bandwidth, i.e., it was not significantly slower to send many small messages instead of one big one. In general, the aggregate bandwidth could be predicted with the formula:
B = (80 Mbit/s) / ((1 + (200 Mbit)/C) * (1 + (3.2/P)^2)
where B is the bandwidth (in Mbit/s), C is the amount of communication within a superstep performed by one processor (in Mbits), and P is the number of processors active. The graphs in Figures 2 and 3 assume that C is very large.

Figure 2: Single-processor bandwidth

Figure 3: Aggregate bandwidth
Methods for sorting data are needed in parallel programs just as much as in serial programs -- more so, in fact. Parallel sorts are important to simulation codes because they are a versatile and efficient way of achieving data locality, which is necessary for efficient use of memory and thus for efficient programs. In order to make the writing of physics codes in BSP easier, I implemented a parallel sort.
The algorithm of the sort was Parallel Sorting by Regular Sampling (PSRS). PSRS is in a class of sorts called bin sorts, which are insensitive to communications latency.
The sort performed quite well. Figure 4 is a speedup curve for the sort in two different programming models (BSP and MPI) on the Beowulf. The sort does not scale past 10 processors because of bandwidth limitations on the Beowulf.

Figure 4: Speedup curves for a PSRS sort on a Beowulf
Figure 5 is a speedup curve for the sort on the Convex Exemplar (SPP-1000). Without the bandwidth restrictions, the sort scales quite well. It is currently the fastest sort on both of these machines.

Figure 5: Speedup curves for a PSRS sort on a Convex Exemplar
The sort was used in a finite-element code that Dr. Clark Mobarry and I worked on. Currently the code simulates gas dynamics, but it could easily be modified to work in other areas, such as magnetohydrodynamics. As a small demonstration of what it can do, we ran a simulation of a shockwave hitting a wall at an angle. The results of the simulation are available as an 800K MPEG file. The sort was used to implement a procedure called Morton ordering, which is a kind of multidimensional sorting. The data segmentation and distribution techniques developed for the sort were also applied in the area of the code that kept the variable coherent across all of the processors.
The sort was also used by Dr. Kevin Olson when he ported a particle code to BSP. He is using this particle code to simulate colliding galaxies.
Student Investigator:
Jason Crawford
Homeschooler
crawford@maxwell.gsfc.nasa.gov
Mentors:
Dr. Clark Mobarry
Goddard Space Flight Center
Clark.Mobarry@gsfc.nasa.gov
301-286-2081
Dr. Kevin Olson
George Mason University
olson@jeans.gsfc.nasa.gov
301-286-8707
Table of Contents | Section Contents -- Applications | Subsection Contents -- Inhouse Computational Scientists