
Large-scale data systems need distributed indexing to manage and locate distributed data. This project is concerned with developing, implementing, and measuring and modeling the performance of distributed search structures. We focus on developing the dB-tree, a distributed B-tree.
During the first year of funding, we developed some correctness theory results, developed an implementation, developed a simulator, and collected some simulation results.
During the second, we completed correctness theory results, completed an implementation and collected results from it, completed the simulator and executed several studies, and developed an analytical performance model.
During the third year, we have concentrated on developing an extensible simulator for distributed index structures. The source code of the simulator, written in C++, is available at http://www.cis.ufl.edu/~ted/.
The project has also enabled us to perform research on related research issues of interest to NASA. Recent work on high-performance distributed synchronization has developed distributed list algorithms. These algorithms can potentially be applied to distributed index maintenance. An initial investigation of this technique has resulted in a distributed reader/writer lock, that supports a higher throughput that previous mutual exclusion algorithms.
NASA's EOSDIS project will ingest and archive petabytes of satellite data per year and distribute the data to a worldwide research community. To aid in planning and maintaining such a large archive we are developing performance models of mass storage archives. Part of this work is to analyze reference patterns to existing archives, and another part is to develop analytical queuing model of tertiary storage.
We discuss these results further in the following sections.
We have developed a distributed simulator, which can be used to study the performance of distributed search structures. The design of the simulator is object-oriented and it could be adapted to develop a simulator for any distributed system. In this section, we describe the architecture of the simulator and discuss the implementation of the dB-tree with lazy updates.
There are five components in the model, three of which represent the distributed system itself, while the remaining two capture the applications utilizing the system services. The components, each of which is realized by a separate object in the simulator, are the following:
|
Source |
generates the system workload |
|
Scheduler |
assigns the generated workload to different processors |
|
Message Router |
mimics the inter-connection network between the processors |
|
Processor |
models each node of the distributed system |
|
Sink |
receives the completed requests |
The interaction between these modules primarily consists of service requests and completion replies. In the remainder of this section, we describe each of these components in detail.
Source: The source component represents the applications utilizing the system services. In our current implementation, it consists of an object which generates dB-tree insert and search requests. The workload generator is modeled as an open system, and requests are generated in a Poisson stream at a mean rate specified by the ArrivalRate parameter. The proportion of insert and search requests can be controlled by InsertProb and SearchProb, which are probabilities that define the proportion of insert and search requests.
Scheduler: All requests generated by the Source are delivered to the Scheduler. The Scheduler assigns the request to a particular processor, where the request is initiated. The execution of the request may generate subsequent relayed actions which are propagated to other processors through the Message router. In our current implementation, the Scheduler randomly assigns the requests to the processors. If we maintain the load profile of each processor, we can use some load balancing techniques to distribute the generated requests.
Message Router: The Message Router models the inter-connection network between the processors. Every request consists of the target processor and the action to be performed on the target processor. The request is delivered to the target processor after a time delay which is a function of the source and target processors of the request. This is done to model the delay associated with the communication channel. The time delay on a particular channel follows an {\it erlang} distribution with a mean and standard deviation, which are determined by the source and target processors of the channel.
Processor: The Processor models a node of the distributed system. Each processor is realized by an object of the Processor class and the number of processors is a parameter of the simulator. A processor consists of queue manager, node manager, resource manager and disk.
The queue manager maintains a queue of pending actions to be performed on the part of the search structure maintained by the processor (the message queue). The node manager repeatedly takes an action from the queue manager and performs the action on a node. The action execution typically generates a subsequent action on another node. If the next node to process is stored locally, then a new entry is put into the message queue. Otherwise, the node manager sends a request to the appropriate remote processor through the message router.
The resource manager handles the requests from the node manager to read(write) a node from(to) the disk. In our current implementation, we assume that all internal nodes are present in memory and the leaf nodes reside on the disk. This can be easily extended and we can incorporate a buffer manager between the node manager and the resource manager.
The disk models the physical resources at a processor.
Sink: The Sink module receives the completed requests from the node manager. A request is assumed to be complete if it does not generate subsequent actions. Each request is assigned a request-id and hence all the actions generated by a request can be identified. The sink module gathers statistics on these requests and measures the performance of the system from the perspective of the various applications using the system.
Distributed synchronization is needed to arbitrate access to a shared resource in a message passing system. Reader/writer synchronization can improve efficiency and throughput if a large fraction of accesses to the shared resource are queries. In this paper, we present a highly efficient distributed algorithm that provides FCFS concurrent-reader exclusive-writer synchronization with an amortized O(log n) messages per critical section entry and O(log n) bits of storage per processor. We evaluate the new algorithm with a simulation study, comparing it to fast and low-overhead distributed mutual exclusion algorithms.
We find that when the request load contains a large fraction of read locks, our algorithm provides higher throughput and a lower acquire time latency than is possible with the distributed mutual exclusion algorithms, with a small increase in the number of messages passed per critical section entry.
The low space and message passing overhead, and high efficiency make the algorithm scalable and practical for implementation. The algorithm we present can easily be extended to give preference to readers or writers.
A paper describing this research was submitted to the Frontiers '96 conference. See also UF CISE technical report 96-018, available from http://www.cis.ufl.edu/cis/tech-reports/.
The successful implementation of mass storage archives require careful attention to performance optimizations, to ensure that the system can handle the offered load. However, performance optimizations require an understanding of user access patterns. Since on-line archives and digital libraries are so new, little information is available.
The National Space Science Data Center (NSSDC) of NASA Goddard Space Flight Center has run an on-line mass storage archive of space data, the National Data Archive and Distribution Service (NDADS), since November 1991. A large worldwide space research community makes use of NSSDC, requesting more than 20,000 files per month. Since the initiation of their service, NSSDC has maintained log files which record all accesses the archive.
In the report, tr96-020 (available from http://www.cis.ufl.edu/cis/tech-reports/ we present an analysis of the NDADS log files, spanning a four-year period (1992 - 1995).
We analyze the log files and discuss several issues, including caching, reference patterns, changes in user interest, user characterization, clustering, and system loading. A preliminary version of this report appeared in the 1995 NASA Goddard conference on Mass Storage Systems and Technologies. The full report has been submitted to the International Journal on Digital Libraries.
The Goddard Space Flight Center (GSFC) Distributed Active Archive Center (DAAC) has been operational for more than two years. Its mission is to support existing and pre-EOS Earth science datasets, facilitate the scientific research, and test Earth Observing System Data and Information System (EOSDIS) concepts. Over 550,000 files and documents have been archived and more than 6 Terabytes have been distributed. Information about user access patterns and their impact on system loading is needed to optimize current operations and to plan for future archives (i.e., EOS-AM1). To facilitate the management of the daily activities, the GSFC DAAC has developed a data base system to track all correspondence, requests, ingestion and distribution. In addition, several log files which record transactions on UniTree. This study identifies some of the users' requests pattern submitted at GSFC DAAC during 1995. The analysis is limited to a subset of all the orders which have their files under the control of the UniTree hierarchical storage management.
Some of the results show that a large percentage of the volume of data requested came from two data types, were for high level (L3 and L4) products, were distributed mostly on 4 mm and 8 mm tapes, and that most requests came from North America, although there is significant world-wide use. We found a very wide range in the size of individual requests, and that most of the volume that is distributed is ordered by a few users.
We evaluate some file caching algorithms and find that LRU/2-bin has the best performance, but that STbin also works well.
This report will appear in the 1996 NASA Goddard Conference on Mass Storage Systems and Technologies.
Large scale scientific projects generate and use huge amounts of data. For example, the NASA EOSDIS project is expected to archive one petabyte per year of raw satellite data. This data is made automatically available for processing into higher level data products and for dissemination to the scientific community. Such large volumes of data can only be stored in robotic storage libraries (RSLs) for near-line access. A characteristic of RSLs is the use of a robot arm that transfers media between a storage rack and the read/write drives, thus multiplying the capacity of the system.
The performance of the RSLs can be a critical limiting factor of the performance of the archive system. However, the many interacting components of a RSL make a performance analysis difficult. In addition, different RSL components can have widely varying performance characteristics. This paper describes our work to develop performance models of a RSL. We first develop a performance model of a RSL in isolation. Next, we show how the RSL model can be incorporated into a queuing network model. We use the models to make some example performance studies of archive systems.
The models described in this paper, developed for the NASA EOSDIS project, are implemented in C with a well-defined interface. The source code and accompanying documentation are available through WWW at http://www.cis.ufl.edu/~ted/.
Papers describing these models will appear in the NASA Goddard conference on Mass Storage Systems and Technologies, and in Performance '96. See also UF CISE technical report 96-019, available from http://www.cis.ufl.edu/cis/tech-reports/.
Theodore Johnson
University of Florida
Department of Computer and Information Sciences
ted@cis.ufl.edu
Table of Contents | Section Contents -- Basic Research | Subsection Contents -- CESDIS University Research Program in Parallel Computing