Scientific Data Management Group Making Significant Progress, Also Making News

June 16, 2000

NERSC's Scientific Data Management Group notched a couple of wins this month, contributing to an important milestone in the Earth Sciences Grid project and getting a full-page article in the June 12 issue of Computerworld magazine.

The Computerworld article described the group's Storage Access Coordination System (STACS), which streamlines the task of searching and retrieving requested subsets of data files from massive tape libraries. STACS was developed as part of DOE's High Energy and Nuclear Physics Grand Challenge to make it easier for physicists to select data from experiments with the STAR detector at Brookhaven National Laboratory. STACS has three components.

The first is a specialized index, which allows users to specify a request based on the properties of data they are looking for. This is especially useful in physics, where there can be tens of thousands of files containing records of millions of "events," the term used for particle collisions.

The second component, called a Query Monitor, coordinates such requests from multiple users. Because retrieving files can be very time consuming (and adding hardware to speed up the process is expensive), consolidating file requests from multiple users can make the information available sooner to more people.

The third component, called the HRM (for HPSS Resource Manager) manages, queues and monitors file transfer requests to the High Performance Storage Systems (HPSS), such as the ones at NERSC, San Diego Supercomputer Center (SDSC) and other labs.

STACS is now being integrated into the data analysis program at Brookhaven, a "very gratifying" result for a system that started out as an exploratory project, said Group Lead Arie Shoshani.

Those three components are now helping the group make headway on two projects funded earlier under DOE's Next Generation Internet effort. Whereas STACS was developed for use on a storage system at a single site, the NGI and DOE's Science Grid envision applying such capabilities to systems distributed among multiple sites. Earlier this year, the group began pushing the Grid technologies, such as the Globus Grid software, to see what they could do.

The Program for Climate Model Diagnosis and Intercomparison (PCMDI) at LLNL has developed a package of applications for getting information from files and then manipulating that data for climate modeling research. What was needed, Arie said, was a way to find out where the desired files were located among participating sites, and, if the files had been replicated from the original site and stored at another, which file could be retrieved the fastest.

Group member Alex Sim spent about two months developing a tool, called the Request Manager, to do just that. Earlier this month, the group proved the viability of the Request Manager and reached "a very important milestone in the Earth Science Grid project," Arie said. A demo was set up (and is running successfully) of files distributed on six sites and accessed in a coordinated fashion using Globus software components. One of the sites is an HPSS system running at SDSC. The others are disk caches available at LBNL, the National Center for Atmospheric Research in Colorado, the Information Sciences Institute (ISI) in Southern California, Argonne National Lab and LLNL.

Three parts of Globus software were used: the security-enhanced Globus Security Infrastructure File Transfer Protocol (GSIFTP) modules, a replica catalog (implemented using LDAP), and a Network Weather Service (NWS) which regularly assesses traffic conditions on specified networks to find the least congested routes (developed by University of Tennessee).

The Scientific Data Management Group's part in this project was key to its success. In the demo, the Request Manager accepts a request for a set of files from the PCMDI front end (at LLNL), checks the Globus replica catalog to find replicas of each file, selects the best location from which to get the file using the NWS information, and uses the secure Globus GSIFTP to move the file to the destination.

"This milestone required a lot of coordination and good will on the part of participants, especially involving Globus people, PCMDI people, our group for the software development, and various individuals where the data was replicated," Shoshani said. "It is pretty amazing that all this complex technology actually works!"

In addition, the group will soon add the HPSS Resource Manager (HRM) they developed as another site that will access files from NERSC's HPSS. This will demonstrate the ability to pre-stage files to a local disk before moving it using the Globus GSIFTP. This will allow researchers to find the files they need and pre-stage them for transfer at a later time to take advantage of Quality of Services (QoS) network scheduling.

In the group's second Grid project, the Particle Physics Data Grid, they are also making progress in getting the Storage Resource Broker (SRB) grid server developed and used at SDSC to communicate directly with HRM at LBNL. As a result of this collaboration new functionality was added to SRB to request pre-staging, status of the transfer request, and aborting file transfer requests.

Another tool in the group's toolbox is the Query Estimator, or QE, currently being improved and enhanced by group member John Wu. It uses the specialized index developed for STACS. When querying large storage systems for data with specific characteristics, it's hard to know what will turn up. It could be a few files, or it could be thousands. A query involving a large number of files could take days or weeks to complete, especially if many users access the system. When such a query starts, the unsuspecting user, not realizing the size of the query, often kills it after a few hours or days, which is a waste of time and resources. The QE provides users with an idea of the number of files and the number of events that will turn up, and how long it will take to get them. "What QE does is help you avoid misdirected queries you're going to kill anyway," Shoshani said.

To learn more about the Scientific Data Management R&D Group, go to <http://sdm.lbl.gov/>

About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.