Basic Research Banner

High Performance Input/Output System for High Performance Computing and Four-Dimensional Data Assimilation

Final Report (1993 - 1996)

Project Overview

The Northeast Parallel Architectures Center of Syracuse University is applying basic computer science research in high performance input/output systems for parallel computers to the NASA grand challenge applications of four-dimensional data assimilation. Our approach is to apply leading parallel computing research to a number of existing techniques for assimilation, and extract parameters indicating where and how input/output limits computational performance. Using detailed knowledge of the application problems, we are:

Research Activities

PASSION: Parallel And Scalable Software for Input-Output

In order to develop a successful assimilation system for Earth Science, there will be a continual need to process and reprocess data sets with ever-improving and more complete assimilation system. There will also be a requirement to diagnose the quality of the data sets. Data assimilation provides the most compute-intensive as well as I/O-intensive undertaking in NASA Earth Science research, and therefore, high-performance I/O capability will be essential to new generation data assimilation systems. The objective of designing the PASSION software is to develop software support for parallel I/O that permits scalable I/O operations to match the growing computational power of the new parallel supercomputer.

We have concentrated our recent research on the correlation between data distribution and parallel I/O performance. In several studies, we have shown that the performance of parallel file systems can vary significantly as a function of the selected data distributions cannot be supported. Further, in these studies we have described how the parallel language extensions, though simplifying the programming, do not address the performance problems found in parallel file systems.

As a result of our studies, we have developed a parallel I/O runtime package that uses an alternative scheme for conducting parallel I/O - the Two-Phase Access Strategy - which guarantees higher and more consistent performance over a wider spectrum of data distributions. We have designed and implemented runtime primitives that make use of the two-phase access strategy to conduct parallel I/O, and facilitate the programming of parallel I/O operations. Our performance results show that I/O access rates are improved by up to several orders of magnitude. Further, the variation in performance over various data distributions is restricted to within a factor of two of the best access rate.

A number of high level programming language have recently introduced intrinsics that support parallel I/O through a runtime library. By using these primitives, I/O operation instructions within applications become portable across various parallel file systems. Also, the primitives are convenient to use; the instructions for carrying out parallel I/O operations do not involve much more than a declaration of the data decomposition mapping and the use of open, close, read and write routines.

Yet, these languages supported I/O primitives suffer from a serious drawback. Because they use a direct access mechanism to perform the I/O, the user data distribution mapping remains tightly linked to the file mapping to disks. Thus, they are susceptible to the same performance fluctuations and limitations (e.g., unsupported data distributions) that are observed of the parallel file systems.

Our system provides the portability and convenience of language supported I/O primitives. In addition, because it makes use of the two-phase access strategy to carry out I/O, it effectively decouples user mapping from the file mappings of the parallel file system, and provides consistently high performance independent of the data decompositions used. Further, since these primitives are linked at the compile-time as a runtime library, they can be used with any MIMD (node programming + message passing library) program, or from a data parallel program such as one written in HPF. The runtime primitives library provides a set of simple I/O routines.

Advantages of the Runtime I/O System:

  1. The runtime system can be easily ported on various machines which provide parallel file systems, This makes the runtime primitives highly portable and easy to use.
  2. By using these primitives, the more complex data distributions (Block-Block or Block-Cyclic) are made available to the user. The only additional information required are the global, local array information and the processor grid information.
  3. Primitives allow the user to control the data mapping over the disks. This is a significantly advantage since the user can vary the number of disks to optimize the data access time.
  4. Under certain conditions, the primitives allow the programmer can change the data distribution on the processors dynamically.
  5. The data access time is significantly improved and is made more consistent since the primitives use two-phase access strategy.

TCE - Thread-based Communication Environment

TCE employs light-weight multithreading. It assumes each I/O event is independent (a thread) and that scalable I/O can be accomplished by managing a parallel/distributed thread queue. The integration between PASSION and TCE can offer a more robust and unified view toward using meta-computing for scalable heterogeneous I/O for four-dimensional data assimilation. TCE provides the following functionality:

The following advantages can be offered by using Thread-based Communication Environment of execution for handling I/O operations:

  1. For a large class of applications it is much easier to express a program control flow by concurrent actions than by the finite automaton model.
  2. On multiprocessors systems, multiple threads cab run in parallel thus speeding up the applications execution.
  3. An application can take an advantage of overlapping between I/O and CPU operations (CPU and I/O channels can operate concurrently because of different hardware circuitry).
  4. Multithreading can make a large percentage of latency incurred during memory and I/O operations.

Load Balancing Technique:

Load balancing is an important issue in parallel/distributed computing to well utilize accessible computing resources. A distributed computing architecture can be transformed to a connected graph, such as a vertex represents a computational node or I/O node, and an edge represents a communication link between them. Each computational node, I/O node and communication edge consists of computational power, I/O performance and high-speed communication network, respectively. The optimization can be performed across the entire system and the disk allocation optimization can be performed on each I/O node. This optimization technique can improve the performance of parallelization of four-dimensional data assimilation.

Load balancing of the distributed heterogeneous system is vital scalable I/O. Our use of meta-computing for large-scale four-dimensional data assimilation problems makes this more critical. The load balancing problem can be viewed as a graph-partitioning problem. Graph-partition problems belong to the class of NP-complete problems; hence exact solutions are computational intractable for large problems. However, good suboptimal solutions are sufficient for effective parallelization of most of applications. SPRINT - Scalable Partitioning, Refinement and INcremental partitioning Techniques - collects three important partitioning methods based on physical information (e.g., recursive inertial bisection, recursive orthogonal bisection and index-based partitioning). Index-based partitioning methods not only can be used to partitioning graphs but also can be used to improved the disk allocation.

Exploration of data locality is a necessary feature for effectively parallelize the 4DDA applications on a parallel/distributed environment. For example, the carious levels of memory hierarchy present in architectures include registers, caches, disk accesses, network communication, etc. The parallel machine can be modeled to have a two levels of hierarchy, local access and non-local access. In the 4DDA problems data clustering techniques can improve the efficiency for the scalable distributed I/O. Generally, physical proximate data can be clustered together and stored at continuous space on the disk. This can typically reduce the cost of data-accessing latency when the data size is large.

Research Accomplishments Embedded in Software

PASSION - Parallel And Scalable Software for Input-Output

PASSION provides software support for I/O intensive loosely synchronous problems. It has a layered approach and provides support at the compiler, runtime support and file system levels. The various components of PASSION are briefly described in the following paragraphs:

  1. Parallel Access to Files: PASSION provides support for parallel access to files for read/write operations, supports distribution of data on parallel file systems as well as distribution and redistribution of data among the processors of a distributed memory machine. It uses the Two-Phase Access Strategy for this purpose.
  2. Out-of-Core Computations: Out-of-Core Computations are those in which the primary data structures are too large to fit in main memory and hence reside on disks. PASSION provides runtime, compiler and language support for these computations.
  3. File System Optimization: One of the key components of PASSION is to use the access pattern knowledge extracted from the program and provided by the compiler and runtime system to the file system, so that certain optimizations can be performed. These optimizations include efficient management and allocation of access pattern.
  4. VIP-FS Portable Parallel File System: VIP-FS provides a portable parallel file system in a distributed computing environment. The file system is deemed a virtual file system because it is implemented using multiple individual standard file systems integrated by a message passing system. VIP-FS is portable across many architectures as well as many message passing systems and is designed to work in a heterogeneous environment.
  5. Integrating Task and Data Parallelism Using Parallel I/O: This component deals with providing a "parallel pipes" mechanism for communication among data parallel tasks. Since data distributions in two data parallel tasks may be different and may not be known in advance, this component provides a way to perform communication in parallel while hiding the individual distributions, and redistributing the data if it is necessary.

TCE - Thread-based Communication Environment

The multithreaded support is based on a user-level thread package, thus TCE can be easily ported to different processors. The different architecture supported are:

  1. clusters of heterogeneous workstations - through ports and channels,
  2. shared memory parallel machines - through multithreading, and
  3. distributed memory parallel machines - through ports and channels.

The set of interface functions in TCE has been minimized; by using multiple threads we can get grid of all non-blocking and asynchronous calls. Respectively, much cleaner semantics and better optimized code have been obtained. A TCE process is a regular UNIX, DOS or Windows NT process. Although it is spawned according to the user request, its creation and scheduling is carried exclusively by the Operating System. A TCE thread is a light-weight thread of execution. Its creation and destruction as well as scheduling is controlled by the TCE layer. The underlying Operating System is neither involved in those actions, nor it is aware that a process is multithreaded.

SPRINT: Scalable Partitioning, Refinement and INcremental partitioning Techniques

SPRINT consists of three major components which are static graph partitioner, incremental graph partitioner and communication scheduler. In static graph-partitioning methods we choose to use the physical-proximate algorithms to conduct load balancing over the heterogeneous network because of the effectiveness, the efficiency and the quality. In incremental partitioning the index-based is chosen to solve the dynamic change load on the fly. SPRINT provides efficient methods to solve load change over the time and generates as good quality as the static partitioner with much less cost.

SPRINT also provides communication level to redistribute the data or load among all the participated computing nodes. It is built on top of P4 communication environment since P4 provides effective and simple interface. A communication scheduler involves in the SPRINT implementation for achieve better performance. We conducted the experiments on CM-5, Intel Paragon and workstation clusters (e.g., DEC-Alpha, SUN4, RS/6000, etc.).

Conclusion

PASSION provides software support for high-performance parallel I/O on distributed memory parallel computers. It provides support for compiling out-of-core data parallel programs, parallel input-output of data and parallel access to files, communication of out-of-core data, redistribution of data stored on disks, many optimizations including Data Prefetching from disks, Data Sieving, Data Reuse, etc., as well as support at the file system level. PASSION also provides an initial framework for runtime support for out-of-core irregular problems.

All the runtime procedures, optimizations and file system support described in the attached technical reports have been implemented. A subset of the compiler has been implemented and a full implementation is in progress. PASSION is currently available on the Intel Paragon, Touchstone Delta and iPSC/860 using Intel's Concurrent File System, and IBM SP-1 and SP-2 using the Vesta Parallel File System.

TCE is to provide an efficient, thread-based communication library capable of supporting distributed and parallel processing on variety on platforms; to ensure interoperability between different types of architectures with different CPUs and Operating System; to make the environment as simple as possible without compromising the performance or functionality; to assist the programmer in choosing computational nodes and style of interactions between his processes.

SPRINT is designed to solve the load balancing problem on a parallel/distributed system. It not only provides graph partitioning techniques to solve the highly irregular application but also a parallel computing environment for developing parallel applications on message-passing-based machines. SPRINT can improve the locality of data in an application to gain better performance.

Reference

  1. "Compiler and Runtime Support for Out-of-Core HPF Programs", Rajeev Thakur, Rajesh Bordawekar and Alok Choudhary, Proc. of Int. Conf. on Supercomputing (ICS 94), July 1994, pp. 382-391.
  2. "PASSION Runtime Library for Parallel I/O", Rajeev Thakur, Rajesh Bordawekar, Alok Choudhary, Ravi Ponnusamy and Tarvinder Singh, Proc. Of the Scalable Parallel Libraries Conference, Oct. 1994.
  3. "PASSION: Parallel And Scalable Software for Input-Output", Alok Choudhary, Rajesh Bordawekar, Michale Harry, Rakesh Krishnaiyer, Ravi Ponnusamy, Tarvinder Singh and Rajeev Thakur, NPAC Technical Report SCCS-636, Sep. 1994.
  4. "Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines", Rajesh Bordawekar, Alok Choudhary and Rajeev Thakur, NPAC Technical Report SCCS-622, Sep. 1994.
  5. "The Design of VIP-FS: A Virtual Parallel File System for High Performance Parallel and Distributed Computing", Juan Miguel del Rossario, Michale Harry and Alok Choudhary, NPAC Technical Report SCCS-628, May 1994.
  6. "ADOPT: A Dynamic Scheme for Optimal Prefetching in Parallel File System", Tarvinder Singh and Alok Choudhary, NPAC Technical Report SCCS-627, 1994.
  7. "Integrating Task and Data Parallelism Using Parallel I/O Techniques", Bhaven Avalani, Alok Choudhary, Ian Foster and Rakesh Krishnaiyer, Proc. Of the Int. Workshop on Parallel Processing, Bangalore, India, Dec. 1994.
  8. "Thread-based Communication Environment", Janusz Niemiec, Wojtek Furmanski and Geoffrey C. Fox, NPAC Internal Report, 1995.
  9. "An Architecture-Independent Locality-Improving Transformations of Computational Graphs Embedded in k-dimensions", Chao-Wei Ou, Manoj Gunwani and Sanjay Ranka, ICS 95, July 1995, pp. 289-298.
  10. "SPRINT: Scalable Partitioning, Refinement and INcremental partitioning Techniques", Chao-Wei Ou and Sanjay Ranka, NPAC Technical Report, December 1995.

Points of Contact

Geoffrey C. Fox, Alok Choudhary, Wojtek Furmanski, Donald M. Leskiw, Kim Mills, Chao-Wei Ou
Northeast Parallel Architectures Center
Syracuse University
gcf@nova.npac.syr.edu

Richard Rood
NASA/Goddard Space Flight Center


Table of Contents | Section Contents -- Basic Research | Subsection Contents -- CESDIS University Research Program in Parallel Computing