
Project Overview
The Northeast Parallel Architectures Center of Syracuse University is applying basic computer science research in high performance input/output systems for parallel computers to the NASA grand challenge applications of four-dimensional data assimilation. Our approach is to apply leading parallel computing research to a number of existing techniques for assimilation, and extract parameters indicating where and how input/output limits computational performance. Using detailed knowledge of the application problems, we are:
Research Activities
PASSION: Parallel And Scalable Software for Input-Output
In order to develop a successful assimilation system for Earth Science, there will be a continual need to process and reprocess data sets with ever-improving and more complete assimilation system. There will also be a requirement to diagnose the quality of the data sets. Data assimilation provides the most compute-intensive as well as I/O-intensive undertaking in NASA Earth Science research, and therefore, high-performance I/O capability will be essential to new generation data assimilation systems. The objective of designing the PASSION software is to develop software support for parallel I/O that permits scalable I/O operations to match the growing computational power of the new parallel supercomputer.
We have concentrated our recent research on the correlation between data distribution and parallel I/O performance. In several studies, we have shown that the performance of parallel file systems can vary significantly as a function of the selected data distributions cannot be supported. Further, in these studies we have described how the parallel language extensions, though simplifying the programming, do not address the performance problems found in parallel file systems.
As a result of our studies, we have developed a parallel I/O runtime package that uses an alternative scheme for conducting parallel I/O - the Two-Phase Access Strategy - which guarantees higher and more consistent performance over a wider spectrum of data distributions. We have designed and implemented runtime primitives that make use of the two-phase access strategy to conduct parallel I/O, and facilitate the programming of parallel I/O operations. Our performance results show that I/O access rates are improved by up to several orders of magnitude. Further, the variation in performance over various data distributions is restricted to within a factor of two of the best access rate.
A number of high level programming language have recently introduced intrinsics that support parallel I/O through a runtime library. By using these primitives, I/O operation instructions within applications become portable across various parallel file systems. Also, the primitives are convenient to use; the instructions for carrying out parallel I/O operations do not involve much more than a declaration of the data decomposition mapping and the use of open, close, read and write routines.
Yet, these languages supported I/O primitives suffer from a serious drawback. Because they use a direct access mechanism to perform the I/O, the user data distribution mapping remains tightly linked to the file mapping to disks. Thus, they are susceptible to the same performance fluctuations and limitations (e.g., unsupported data distributions) that are observed of the parallel file systems.
Our system provides the portability and convenience of language supported I/O primitives. In addition, because it makes use of the two-phase access strategy to carry out I/O, it effectively decouples user mapping from the file mappings of the parallel file system, and provides consistently high performance independent of the data decompositions used. Further, since these primitives are linked at the compile-time as a runtime library, they can be used with any MIMD (node programming + message passing library) program, or from a data parallel program such as one written in HPF. The runtime primitives library provides a set of simple I/O routines.
Advantages of the Runtime I/O System:
TCE - Thread-based Communication Environment
TCE employs light-weight multithreading. It assumes each I/O event is independent (a thread) and that scalable I/O can be accomplished by managing a parallel/distributed thread queue. The integration between PASSION and TCE can offer a more robust and unified view toward using meta-computing for scalable heterogeneous I/O for four-dimensional data assimilation. TCE provides the following functionality:
The following advantages can be offered by using Thread-based Communication Environment of execution for handling I/O operations:
Load Balancing Technique:
Load balancing is an important issue in parallel/distributed computing to well utilize accessible computing resources. A distributed computing architecture can be transformed to a connected graph, such as a vertex represents a computational node or I/O node, and an edge represents a communication link between them. Each computational node, I/O node and communication edge consists of computational power, I/O performance and high-speed communication network, respectively. The optimization can be performed across the entire system and the disk allocation optimization can be performed on each I/O node. This optimization technique can improve the performance of parallelization of four-dimensional data assimilation.
Load balancing of the distributed heterogeneous system is vital scalable I/O. Our use of meta-computing for large-scale four-dimensional data assimilation problems makes this more critical. The load balancing problem can be viewed as a graph-partitioning problem. Graph-partition problems belong to the class of NP-complete problems; hence exact solutions are computational intractable for large problems. However, good suboptimal solutions are sufficient for effective parallelization of most of applications. SPRINT - Scalable Partitioning, Refinement and INcremental partitioning Techniques - collects three important partitioning methods based on physical information (e.g., recursive inertial bisection, recursive orthogonal bisection and index-based partitioning). Index-based partitioning methods not only can be used to partitioning graphs but also can be used to improved the disk allocation.
Exploration of data locality is a necessary feature for effectively parallelize the 4DDA applications on a parallel/distributed environment. For example, the carious levels of memory hierarchy present in architectures include registers, caches, disk accesses, network communication, etc. The parallel machine can be modeled to have a two levels of hierarchy, local access and non-local access. In the 4DDA problems data clustering techniques can improve the efficiency for the scalable distributed I/O. Generally, physical proximate data can be clustered together and stored at continuous space on the disk. This can typically reduce the cost of data-accessing latency when the data size is large.
Research Accomplishments Embedded in Software
PASSION - Parallel And Scalable Software for Input-Output
PASSION provides software support for I/O intensive loosely synchronous problems. It has a layered approach and provides support at the compiler, runtime support and file system levels. The various components of PASSION are briefly described in the following paragraphs:
TCE - Thread-based Communication Environment
The multithreaded support is based on a user-level thread package, thus TCE can be easily ported to different processors. The different architecture supported are:
The set of interface functions in TCE has been minimized; by using multiple threads we can get grid of all non-blocking and asynchronous calls. Respectively, much cleaner semantics and better optimized code have been obtained. A TCE process is a regular UNIX, DOS or Windows NT process. Although it is spawned according to the user request, its creation and scheduling is carried exclusively by the Operating System. A TCE thread is a light-weight thread of execution. Its creation and destruction as well as scheduling is controlled by the TCE layer. The underlying Operating System is neither involved in those actions, nor it is aware that a process is multithreaded.
SPRINT: Scalable Partitioning, Refinement and INcremental partitioning Techniques
SPRINT consists of three major components which are static graph partitioner, incremental graph partitioner and communication scheduler. In static graph-partitioning methods we choose to use the physical-proximate algorithms to conduct load balancing over the heterogeneous network because of the effectiveness, the efficiency and the quality. In incremental partitioning the index-based is chosen to solve the dynamic change load on the fly. SPRINT provides efficient methods to solve load change over the time and generates as good quality as the static partitioner with much less cost.
SPRINT also provides communication level to redistribute the data or load among all the participated computing nodes. It is built on top of P4 communication environment since P4 provides effective and simple interface. A communication scheduler involves in the SPRINT implementation for achieve better performance. We conducted the experiments on CM-5, Intel Paragon and workstation clusters (e.g., DEC-Alpha, SUN4, RS/6000, etc.).
Conclusion
PASSION provides software support for high-performance parallel I/O on distributed memory parallel computers. It provides support for compiling out-of-core data parallel programs, parallel input-output of data and parallel access to files, communication of out-of-core data, redistribution of data stored on disks, many optimizations including Data Prefetching from disks, Data Sieving, Data Reuse, etc., as well as support at the file system level. PASSION also provides an initial framework for runtime support for out-of-core irregular problems.
All the runtime procedures, optimizations and file system support described in the attached technical reports have been implemented. A subset of the compiler has been implemented and a full implementation is in progress. PASSION is currently available on the Intel Paragon, Touchstone Delta and iPSC/860 using Intel's Concurrent File System, and IBM SP-1 and SP-2 using the Vesta Parallel File System.
TCE is to provide an efficient, thread-based communication library capable of supporting distributed and parallel processing on variety on platforms; to ensure interoperability between different types of architectures with different CPUs and Operating System; to make the environment as simple as possible without compromising the performance or functionality; to assist the programmer in choosing computational nodes and style of interactions between his processes.
SPRINT is designed to solve the load balancing problem on a parallel/distributed system. It not only provides graph partitioning techniques to solve the highly irregular application but also a parallel computing environment for developing parallel applications on message-passing-based machines. SPRINT can improve the locality of data in an application to gain better performance.
Reference
Points of Contact
Geoffrey C. Fox, Alok Choudhary, Wojtek Furmanski, Donald M.
Leskiw, Kim Mills, Chao-Wei Ou
Northeast Parallel Architectures Center
Syracuse University
gcf@nova.npac.syr.edu
Richard Rood
NASA/Goddard Space Flight Center
Table of Contents | Section Contents -- Basic Research | Subsection Contents -- CESDIS University Research Program in Parallel Computing