To examine performance of Beowulf-class computers for engineering and design applications.
During FY97, JPL built a 16-node Beowulf system (Hyglac) and examined its performance while running five design and analysis engineering codes. This year, we have built upon that work. First, we examined four of the codes on 64 Beowulf nodes, in terms of overall performance compared with the CRAY T3D at JPL and the CRAY T3E at Goddard Space Flight Center (GSFC). We also have examined the codes on faster Beowulf CPU's (300 MHz Pentium II processors, as opposed to the 200 MHz Pentium Pro processors that are in Hyglac). Second, as one of our obstacles last year seemed to be low-speed mathematical libraries for Linux, we have examined a new set of libraries which was released during the last year.
We presented and published the work completed in the previous year:
D. S. Katz, T. Cwik, B. H. Kwan, J. Z. Lou, P. L. Springer, T. L. Sterling, and P. Wang, "An Assessment of a Beowulf System for a Wide Class of Analysis and Design Software," presented at 4th NASA National Symposium on Large-Scale Analysis and Design on High-Performance Computers and Workstations, Williamsburg, Virginia, October 1997, published in Advances in Engineering Software, v. 26(3-6), pp. 451-461, July 1998.
We additionally completed the study of four application codes on 64-node Beowulf systems:
| Code | Description | 64-node performance ratio (T3E run-time / Beowulf run-time using 200 MHz Pentium Pro) | Scaling (16-node performance ratio / 64 node performance ratio) | Single-node percentage speed-up (200 MHz Pentium Pro -> 300 MHz Pentium II) |
|---|---|---|---|---|
| Physical Optics Application |
Used for design of reflector antennas and telescopes operating at microwave frequencies. | 3 | 1.09 | 40% |
| Electromagnetic Finite-Difference Time-Domain Application |
Used to solve Maxwell's Equations in the time domain, for analysis of antenna patterns, radar cross section calculations, and examination of fields within small electronic components and interconnects. | 3 | 1.05 | 20% |
| Electromagnetic Finite Element Solver |
Used for similar analysis as the finite difference software. Works in the frequency domain, using a very different algorithm than the FDTD software. | 8 | 1.25 | 40% |
| Incompressible Fluid Flow Application |
Used for solving the Navier-Stokes equations for incompressible flow simulations, using a second-order projection method with a multigrid solver. | 7 | 1.95 | 40% |
Larger performance ratios indicate worse code performance on Beowulf compared with the CRAY T3E. Larger scaling ratios indicate worse scaling on Beowulf compared with the CRAY T3E.
We also examined parallel matrix-matrix multiplication, using a new set of optimized mathematical libraries for Linux. The following table shows results for matrix-matrix multiplication using ScaLAPACK's PBLAS parallel routine PDGEMM. The PBLAS routines in turn call BLAS routines for Linux which were developed for the ASCI Red machine.
| Processor Configuration | ||||
|---|---|---|---|---|
| Order of Matrices (N) | ||||
| 1000 | 2000 | 4000 | 8000 | |
| 2 x 2 | 276 MFlop/s | 436 MFlop/s | 396 MFlop/s | |
| 4 x 4 | 285 MFlop/s | 742 MFlop/s | 1256 MFlop/s | 1329 MFlop/s |
For these results, PVM was used for the communication library (BLACS). The PvmDirectRoute option was set for better message passing times--without this option performance figures are cut in half.
Performance drops in the 2x2 configuration probably because the block size had to be decreased to 100 because of memory limitations. On all other runs the block size was N/2 for the 2x2 case and N/4 for the 4x4 case. Exception: for the case where N = 8000, the block size also had to be decreased to 100.
Clearly, the four codes can all be run on 64-node Beowulf machines, but for the last two codes, new Beowulf CPU's are almost required for reasonable performance. Moving to DEC Alpha CPU's would provide even more help to all of the code. The last two codes again are near their limits, though, in terms of numbers of nodes.
While running these benchmarks, we have noticed that there is a profound lack of good middleware for Beowulf machines, meaning the code that the application writing does not write but that the user requires, ranging from simply loading the code on all the nodes and starting the job to scheduling the nodes. This lack was not a large problem on Hyglac, but for larger, multi-user machines it is an obstacle to getting useful work done.
In the matrix-matrix multiplication results, we are obtaining performance that is near 50 percent of the peak possible performance. This is a marked improvement over times previously achieved by using the reference version of BLAS. This was made possible by finely tuned BLAS code we obtained from the ASCI project, and illustrates the fact that underlying sequential code libraries can be just as important as parallel code in achieving full performance from parallel machines.
We will encourage other JPL projects to use Hyglac and other Beowulf machines, in order to gain better understanding of this type of system. We will also encourage and work with any groups who are involved in writing middleware for Beowulf-class computers.
Daniel S. Katz
NASA Jet Propulsion Laboratory
Daniel.S.Katz@jpl.nasa.gov
818-354-7359
Other contributors to this report were Tom Cwik, John Lou, and Paul Springer.