Summary of ERSUG MeetingJune 13 - 14, 1995, Livermore, California
The Energy Research Supercomputer Users Group (ERSUG) June 1995 meeting was hosted by the National Energy Research Supercomputer Center (NERSC) at the Lawrence Livermore National Laboratory (LLNL). Some of the talks are summarized below.
The meeting opened with remarks from Tom Kitchens and Bill McCurdy. Tom Kitchens began by saying that the Office of Scientific Computing (for whom he works) has become the Mathematical, Information, and Computational Sciences (MICS) Division in a recent DOE reorganization. MICS, headed by John Cavallini, is aligned under the Computation and Technology Research program within the Office of Energy Research (ER). Among the DOE developments affecting NERSC is a modest budget decrease in fiscal year (FY) 1996 and a more significant decrease in FY1997 (from 15% to as much as 25%).
Bill McCurdy, NERSC Director, explained that another major development was the solicitation of proposals from LLNL and Lawrence Berkeley National Laboratory (LBNL) to operate NERSC in FY1997; however, discussion of the issue was scheduled later in the meeting when the Supercomputing Access Committee (SAC) members in Washington would be able to participate by video teleconference.
Strategic Planning Issues
Mike McCoy, NERSC Deputy Director, gave the leadoff presentation on strategic planning at NERSC. To summarize, the two major forces that drive the strategic planning process are technological evolution and user requirements. In the former category are rapid advances in microprocessor technology, coalescence around UNIX and other protocols for use in scientific computing, new wide-area networking (WAN) technology, and new developments in tertiary storage systems. In the latter category are the needs for computing capability, computing capacity, and usable computational resources.
McCoy elaborated on the technical advances and NERSC's role as a computer center providing capability in the form of supercomputers, capacity in the form of a Supercomputing Auxiliary Service (SAS) for code development and assimilation, and user services, all in a loosely integrated production environment. NERSC's response to the user requirements and technology evolution is that the production environment of the future is a complex that spans multiple major sites or nodes of which NERSC is one node. The whole complex provides a complete scale of capabilities for computational science as well as the distributed computing support to unify all machines, specifically including small machines, for capacity computing. NERSC proposes to prototype a single production environment across a few major sites, including NERSC. This could represent a functional test bed in a comprehensive plan to coordinate numerous user sites into one distributed computing complex.
Viewing NERSC as a node in a distributed computing complex, future developments are categorized into those related to providing convenient access to capacity and those related to providing the specialized core capability.
For capability scientific computing, NERSC proposes to implement a Unified Production Environment (UPE) consisting of:
- Integrated system administration management
- Common banker across all systems
- Prototype network of workstations
- Distributed batch and interactive computing
- Licensed software administration
- Single NERSC login and authentication
- ER-wide distributed file system
- Archival storage for a distributed file system
- Major increase in capability
NERSC is preparing for the arrival of a massively parallel processing (MPP) machine in FY1996. In preparation, NERSC's Massively Parallel Computing Group (MPCG) is already assisting users in parallelizing existing applications. The group holds workshops and provides educational material. NERSC has access to a fraction of a CRAY T3D machine at LLNL, which it can make available to NERSC users. In general, NERSC (along with the Center for Computational Science and Engineering, or CCSE) will be offering users assistance from experts in parallel computing who also have backgrounds in physical sciences, experts in the development of portable parallel code, and experts in scientific visualization and data management.
Planning for the Next Generation of High-Performance Systems
Keith Fitzgerald, File Storage Systems Group Leader (Acting), gave a report on storage. All CFS (Common File System) files are now online in the Storage-Tek silos. Growth is being artificially constrained with storage quotas. The National Storage Laboratory (NSL) base system will be used in the first step toward a replacement for CFS. The NSL base is now running on an IBM RS/6000 R24 and is available for testing by friendly users. The only user interface to the system is FTP (File Transfer Protocol). There are no HiPPI (high-performance parallel interface) channel devices connected yet, although there are plans for RAID (Redundant Array of Inexpensive Disks) disks and other devices.
Tammy Welcome, Group Leader for MPCG, described the effort her group is making to help users parallelize applications that use significant amounts of CRAY Y-MP C90 time. Her group has worked on five codes that use 20% of C90 capacity. Her group also runs the MPP Access Program that provides users direct access to massively parallel processors so that they can do their own code development. There will be an FY1996 access program for which proposals must be submitted by August 18. The group held a two-and-a-half-week MPP workshop that started immediately after the ERSUG meeting. The goal of the workshop was to educate NERSC users about the general techniques of parallel computing and to provide hands-on experience with users' own scientific applications.
Betsy Foote, SMP Project Leader, informed ERSUG about a study in symmetric multiprocessing (SMP) using a Silicon Graphics (SGI) Power Challenge machine to learn how such machines can complement MPPs and be partial replacements to the current generation of vector supercomputers. This is an experiment involving others at LLNL as well as NERSC. The SGI machine in this experiment has 12 CPUs based on R8000 chips. Machine utilization reached 100% within a week of its arrival. There currently are about 50 active users with about 15 active concurrently during prime time. NERSC has learned so far that it is relatively easy to move a uniprocessor code onto this machine. NERSC is still investigating the ease of moving moderately parallel codes and needs dedicated system time to do timing studies of performance. The current scheduling algorithm reduces a process's priority as it accumulates CPU time and has resulted in a few users getting a disproportionate amount of CPU time. A mechanism to allow the fair sharing of this resource is needed.
John Allen, System Administration Team Leader, discussed a new experiment NERSC was undertaking to use a network of workstations to facilitate distributed supercomputing. Since NERSC needed to replace desktop equipment anyway, this experiment leveraged upon the new equipment to see what could be done with a loose cluster of workstations. NERSC will use switched Ethernet/FDDI (fiber distributed data interface) network with 32 SGI workstations, 32 Sun Sparc 5s, and an existing Sun SparcCenter 2000. One approach is to operate it as a dynamic cluster using commercial software such as LSF (Load Sharing Facility), AFS (Andrew File System), PVM (Parallel Virtual Machine), and MPI (Message Passing Interface). The alternative is the Berkeley N.O.W. (Network of Workstations) software, which supports such important features as migrating a job to another workstation if it becomes blocked where it was running. NERSC will initially set up the new equipment in two networked clusters, each including an SMP "front end" system and, over time, transition to the Berkeley N.O.W. environment or other suitable distributed computing environments.
Chris Anderson, Graphics Projects Leader, outlined the status and direction for graphics at NERSC. The graphics project consists of two full-time and two half-time employees. There are now many graphics and visualization tools on the SAS machines on which all NERSC users are entitled to have accounts. The project members are interested in collaborating with users and are particularly interested in helping users visualize the results of simulations of the SPP, SMP, and MPP efforts. As users write or rewrite codes for parallel machines, they are invited to discuss with the NERSC graphics programmers new visualization techniques that could be incorporated during the code development.
Moe Jette, Large-Scale Systems Group Leader, gave a presentation on the progress of the Production Environment. He reported that part of the strategic plan for system administration has been done. In other areas, NERSC is now dealing with the issue of foreign nationals' access to NERSC computers and software; the Central User Bank (CUB) system has been ported to the T3D; the Portable Batch System (PBS) is being beta tested (Pacific Northwest Laboratory agreed to assist in the beta testing of PBS); NERSC is working to bring in Kerberos security; and UNICOS 8.0 has been successfully installed on the C90.
Distributed Computing and Networks
Jim Leighton, Networking and Distributed Computing Manager, presented condensed tables of network traffic. The most curious finding was that the average packet size has increased in the last year. The speculation is that users are using Gopher, Mosaic, or Netscape more and are causing larger blocks of data to be transferred. The procurement of a fast packet upgrade resulting in the installation of asynchronous transfer mode (ATM) service for ESnet appears to be completed after several vendor protests. Eight sites at T3 speeds are now linked with ATM service. ESnet has successfully completed the transition off the NSF (National Science Foundation) backbone to the new architecture. Several sites that are not directly connected to the T3 ATM system but have reached the limits of their current T1-based network systems will get multiple T1 lines to handle their bandwidth requirements.
Sandy Merola of LBNL gave first a summary of current ESnet Steering Committee (ESSC) activities, then an overview of the committee structure. One activity has been to champion ESnet by arguing to maintain its current level of funding. Another has been to devise a funding structure encompassing uniprogram and multiprogram network sites that conforms to all funding guidelines (i.e., avoid misappropriating or reappropriating funds). Another activity is to investigate how to sustain international network service at an affordable price now that NSFnet is gone. A new activity in which ERSUG may wish to participate is an ESnet strategic planning session.
ESSC is chartered by MICS to document, review, and prioritize network requirements; review the ESnet budget with respect to the prioritized network requirements; identify network requirements that require further research; establish performance objectives; propose innovative techniques for enhancing ESnet capabilities; and advise the ESnet management personnel. ESSC members are appointed by division heads to represent program offices. ESSC takes and records formal votes.
A subgroup chartered by ESSC is the Distributed Computing Coordinating Committee (DCCC). It has a long list of responsibilities, but for the purpose of the ERSUG meeting, the most interesting one is to advance the planning, development, and implementation of the needed Distributed Informatics, Computing, and Collaborative Environment (DICCE).
Barry Howard, Distributed Computing Group Leader, gave an update on distributed computing and DCCC/DICCE. This was an elaboration of the types of services that DICCE, through the distributed computing complex, is meant to include. It should promote the sharing both of information and computing resources and facilitate collaboration by geographically distributed researchers. DICCE has an architecture consisting of layers of services built upon ESnet. Among the layers are the Distributed Computing Environment, authentication service, key distribution service, distributed file systems, secure e-mail, and seamless e-mail enclosures.
Science Talks by Host Institution
John Bell from CCSE at NERSC gave an overview of what their division was doing with local adaptive mesh refinement for computational fluid dynamics.
Bruce Cohen of LLNL described ongoing work in the Numerical Tokamak Project.