The Energy Research Supercomputer Users Group (ERSUG) June 1995 meeting was hosted by the National Energy Research Supercomputer Center (NERSC) at the Lawrence Livermore National Laboratory (LLNL). Some of the talks are summarized below.
Bill McCurdy, NERSC Director, explained that another major development was the solicitation of proposals from LLNL and Lawrence Berkeley National Laboratory (LBNL) to operate NERSC in FY1997; however, discussion of the issue was scheduled later in the meeting when the Supercomputing Access Committee (SAC) members in Washington would be able to participate by video teleconference.
McCoy elaborated on the technical advances and NERSC's role as a computer center providing capability in the form of supercomputers, capacity in the form of a Supercomputing Auxiliary Service (SAS) for code development and assimilation, and user services, all in a loosely integrated production environment. NERSC's response to the user requirements and technology evolution is that the production environment of the future is a complex that spans multiple major sites or nodes of which NERSC is one node. The whole complex provides a complete scale of capabilities for computational science as well as the distributed computing support to unify all machines, specifically including small machines, for capacity computing. NERSC proposes to prototype a single production environment across a few major sites, including NERSC. This could represent a functional test bed in a comprehensive plan to coordinate numerous user sites into one distributed computing complex.
Viewing NERSC as a node in a distributed computing complex, future developments are categorized into those related to providing convenient access to capacity and those related to providing the specialized core capability.
For capability scientific computing, NERSC proposes to implement a Unified Production Environment (UPE) consisting of:
NERSC is preparing for the arrival of a massively parallel processing (MPP) machine in FY1996. In preparation, NERSC's Massively Parallel Computing Group (MPCG) is already assisting users in parallelizing existing applications. The group holds workshops and provides educational material. NERSC has access to a fraction of a CRAY T3D machine at LLNL, which it can make available to NERSC users. In general, NERSC (along with the Center for Computational Science and Engineering, or CCSE) will be offering users assistance from experts in parallel computing who also have backgrounds in physical sciences, experts in the development of portable parallel code, and experts in scientific visualization and data management.
Tammy Welcome, Group Leader for MPCG, described the effort her group is making to help users parallelize applications that use significant amounts of CRAY Y-MP C90 time. Her group has worked on five codes that use 20% of C90 capacity. Her group also runs the MPP Access Program that provides users direct access to massively parallel processors so that they can do their own code development. There will be an FY1996 access program for which proposals must be submitted by August 18. The group held a two-and-a-half-week MPP workshop that started immediately after the ERSUG meeting. The goal of the workshop was to educate NERSC users about the general techniques of parallel computing and to provide hands-on experience with users' own scientific applications.
Betsy Foote, SMP Project Leader, informed ERSUG about a study in symmetric multiprocessing (SMP) using a Silicon Graphics (SGI) Power Challenge machine to learn how such machines can complement MPPs and be partial replacements to the current generation of vector supercomputers. This is an experiment involving others at LLNL as well as NERSC. The SGI machine in this experiment has 12 CPUs based on R8000 chips. Machine utilization reached 100% within a week of its arrival. There currently are about 50 active users with about 15 active concurrently during prime time. NERSC has learned so far that it is relatively easy to move a uniprocessor code onto this machine. NERSC is still investigating the ease of moving moderately parallel codes and needs dedicated system time to do timing studies of performance. The current scheduling algorithm reduces a process's priority as it accumulates CPU time and has resulted in a few users getting a disproportionate amount of CPU time. A mechanism to allow the fair sharing of this resource is needed.
John Allen, System Administration Team Leader, discussed a new experiment NERSC was undertaking to use a network of workstations to facilitate distributed supercomputing. Since NERSC needed to replace desktop equipment anyway, this experiment leveraged upon the new equipment to see what could be done with a loose cluster of workstations. NERSC will use switched Ethernet/FDDI (fiber distributed data interface) network with 32 SGI workstations, 32 Sun Sparc 5s, and an existing Sun SparcCenter 2000. One approach is to operate it as a dynamic cluster using commercial software such as LSF (Load Sharing Facility), AFS (Andrew File System), PVM (Parallel Virtual Machine), and MPI (Message Passing Interface). The alternative is the Berkeley N.O.W. (Network of Workstations) software, which supports such important features as migrating a job to another workstation if it becomes blocked where it was running. NERSC will initially set up the new equipment in two networked clusters, each including an SMP "front end" system and, over time, transition to the Berkeley N.O.W. environment or other suitable distributed computing environments.
Chris Anderson, Graphics Projects Leader, outlined the status and direction for graphics at NERSC. The graphics project consists of two full-time and two half-time employees. There are now many graphics and visualization tools on the SAS machines on which all NERSC users are entitled to have accounts. The project members are interested in collaborating with users and are particularly interested in helping users visualize the results of simulations of the SPP, SMP, and MPP efforts. As users write or rewrite codes for parallel machines, they are invited to discuss with the NERSC graphics programmers new visualization techniques that could be incorporated during the code development.
Moe Jette, Large-Scale Systems Group Leader, gave a presentation on the progress of the Production Environment. He reported that part of the strategic plan for system administration has been done. In other areas, NERSC is now dealing with the issue of foreign nationals' access to NERSC computers and software; the Central User Bank (CUB) system has been ported to the T3D; the Portable Batch System (PBS) is being beta tested (Pacific Northwest Laboratory agreed to assist in the beta testing of PBS); NERSC is working to bring in Kerberos security; and UNICOS 8.0 has been successfully installed on the C90.
Sandy Merola of LBNL gave first a summary of current ESnet Steering Committee (ESSC) activities, then an overview of the committee structure. One activity has been to champion ESnet by arguing to maintain its current level of funding. Another has been to devise a funding structure encompassing uniprogram and multiprogram network sites that conforms to all funding guidelines (i.e., avoid misappropriating or reappropriating funds). Another activity is to investigate how to sustain international network service at an affordable price now that NSFnet is gone. A new activity in which ERSUG may wish to participate is an ESnet strategic planning session.
ESSC is chartered by MICS to document, review, and prioritize network requirements; review the ESnet budget with respect to the prioritized network requirements; identify network requirements that require further research; establish performance objectives; propose innovative techniques for enhancing ESnet capabilities; and advise the ESnet management personnel. ESSC members are appointed by division heads to represent program offices. ESSC takes and records formal votes.
A subgroup chartered by ESSC is the Distributed Computing Coordinating Committee (DCCC). It has a long list of responsibilities, but for the purpose of the ERSUG meeting, the most interesting one is to advance the planning, development, and implementation of the needed Distributed Informatics, Computing, and Collaborative Environment (DICCE).
Barry Howard, Distributed Computing Group Leader, gave an update on distributed computing and DCCC/DICCE. This was an elaboration of the types of services that DICCE, through the distributed computing complex, is meant to include. It should promote the sharing both of information and computing resources and facilitate collaboration by geographically distributed researchers. DICCE has an architecture consisting of layers of services built upon ESnet. Among the layers are the Distributed Computing Environment, authentication service, key distribution service, distributed file systems, secure e-mail, and seamless e-mail enclosures.
Bruce Cohen of LLNL described ongoing work in the Numerical Tokamak Project.