Annual Report
2000
TABLE OF CONTENTS YEAR IN REVIEW SCIENCE HIGHLIGHTS
YEAR IN REVIEW

Systems and Services  
Director's
Perspective
 
----------------
YEAR IN REVIEW
----------------
Computational Science
BOOMERANG Data, Analyzed at NERSC, Reveals Flat Universe
Systems and Service
IBM SP Launched Ahead of Schedule with Million-Hour Bonus for Users
Research and Development
Amazing Algorithm Pulls Digits Out of
ACTS Toolkit Provides Solutions to Common Computational Problems
Grid Applications Win SC2000 Competition
Deb Agarwal Named One of "Top 25 Women of the Web"
----------------
SCIENCE HIGHLIGHTS
----------------
Basic Energy Sciences
Biological and Environmental Research
Fusion Energy Sciences
High Energy and Nuclear Physics
Advanced Scientific Computing Research and Other Projects
High-resolution global climate models require high-capability parallel systems such as NERSC’s IBM SP and Cray T3E. See page 53 for details.

A hallmark of NERSC since its founding in 1974 has been the expertise and competence of the employees staffing the facility and the high quality of services they deliver to our users. Year after year, the NERSC staff delivers critical computing resources, applications, and information to enable users to make the most of their allocations on our high performance systems. Each year the staff introduces innovative enhancements and improvements to the systems and services as well.

While NERSC employees continued to anticipate and meet users’ needs in 2000, they also undertook a number of special projects aimed at maintaining NERSC’s pre-eminence among scientific computing facilities — most notably, the installation and testing of the Phase 1 IBM SP system, and the move of our other systems to Berkeley Lab’s new Oakland Scientific Facility, where there is room for the Phase 2 SP as well as future expansion. In this section we highlight some examples of both ongoing activities and special projects. For a more comprehensive view, “How Are We Doing: A Self-Assessment of the Quality of Services and Systems at NERSC” is available online in .pdf format.


Off to a Flying Start in Y2K

In 1999, the mass media had portrayed the year 2000 as a cyberspecter threatening to undermine nearly everything remotely connected to a computer. Thanks to careful and thorough preparations at NERSC, the calendar change came and went uneventfully, with no loss of service and no impact on users. Unlike other facilities that shut down over the holiday and brought in extra staff to deal with potential problems, NERSC maintained normal operations with only one extra person on duty for security on New Year’s Eve.

 
   
 
While most of NERSC’s Y2K preparations were completed in 1999, R. K. Owen, Brent Draney, John McCarthy, and Greg Butler put on the finishing touches. Greg developed a new methodology for testing the IBM SP system and had it in place within 10 days of the system’s installation in April. R. K. and John prepared a backup system for user account information in NERSC’s Central User Bank (CUB), and Brent upgraded CUB with a number of Y2K patches. As NERSC security analyst, Brent was on the job on New Year’s Eve; he quickly fixed a handful of minor problems that cropped up with the date change and also averted an attempted hacker attack.

This business-as-usual stance was possible because most of NERSC’s Y2K testing was actually completed in late 1998 and early 1999. In fact, NERSC was one of the first organizations within DOE to demonstrate Y2K compliance, and was the only site to fully test IBM supercomputers, HPSS, or a Cray T3E for Y2K compliance. Hardware, operating systems, layered software, and scientific applications were all subjected to rigorous testing.

The Y2K compliance team, led by Jim Craw, included Harsh Anand, Greg Butler, Tina Butler, Jonathon Carter, Thomas Davis, Tina Declerck, Jed Donnelley, Keith Fitzgerald, Susan Green, Frank Hale, William Harris, John Hules, Wayne Hurlbert, Nancy Johnston, Cheri Lawrence, Ken Okikawa, Bill Saphir, Jackie Scoggins, Craig Tull, David Turner, Mike Welcome, and Tammy Welcome.

And thanks to NERSC’s ongoing security program, there have been no serious security incidents to date.


NERSC Delivers New Resources with Traditional Efficiency

With Phase 1 of the IBM SP system installed (see page 10), NERSC was able to double MPP allocations from 4.7 million processor hours in FY 1999 to 9.3 million in FY 2000. FY 2001 MPP allocations are up another 30%, even without factoring in the Phase 2 SP system, which will have 2,528 processors and a peak speed of 3.8 teraflops. PVP allocations increased about 50% from FY 1999 to FY 2000, while storage resource units (SRUs) increased about 40%.

Not content merely to increase our resources, we operated them with our traditional efficiency and convenience to users. The Cray T3E utilization rate was over 90% for most of the year (Figure 4), and IBM SP utilization topped 80% continuously from the time the system was placed in production through the end of the year — an impressive achievement with a new system (Figure 5). In addition, to help users of the Cray SV1 computers test and debug their programs, we opened one of the machines for interactive use.

 
Figure 4. Cray T3E utilization for FY 2000.
 
Figure 5. IBM SP utilization from April to November 2000.


Move to Oakland Goes Like Clockwork

After almost two years of planning, months of construction, and an intense week of moving and installation, NERSC’s Cray supercomputers, HPSS, PDSF cluster, and auxiliary systems were up and running in Berkeley Lab’s new Oakland Scientific Facility on November 6 (see photos below and on the following pages). The moved systems were down for less than a week. NERSC’s IBM RS/6000 SP Phase 1 computer remained in service during the move and will stay in Berkeley until the Phase 2 system is installed in Oakland. Staff from NERSC’s Computer Operations and Network Support Group have relocated to the Oakland facility, and other Berkeley Lab employees will move into the facility early in 2001.

 
Construction of Berkeley Lab’s new Oakland Scientific Facility provided the floor space and electrical capacity for future expansion of NERSC’s computing systems.
 
 
 

The move was the culmination of a process that began in January 1999, when NERSC, looking for space to accommodate larger computer systems, launched an extensive site selection process. A request for proposals drew eight qualified responses, from which the Oakland site emerged as the favorite. After a contract was signed in August 1999, the former bank building was stripped down to its support structure, seismically reinforced, and completely rebuilt to meet or exceed all current codes and standards. Additional improvements in the electrical supply capability and a high-volume ventilation system were among the many special accommodations required for high performance computers. The site includes room for future expansion.

Howard Walter  

Howard Walter, head of the Future Infrastructure, Networking and Security Group, managed the project for NERSC throughout the process of specification development, bid solicitation and evaluation, contract negotiation, design, permitting, construction, and relocation. A ceremony to dedicate the facility is being planned for early spring 2001.


HPSS Increases Capacity and Performance

NERSC’s High Performance Storage System grew significantly in both capacity and performance during the past year thanks to the work of Nancy Meyer and the Mass Storage Group. From January to December, archival storage increased from 660 terabytes (TB) to 880 TB. The amount of data being stored at the end of the year was 145 TB. The online disk data cache grew from 1.8 TB at the beginning of the calendar year to 6 TB at the end. And the default disk speed was increased from 6 megabits per second (Mb/sec) to 32 Mb/sec. The storage environment moved from individual MicroChannel machines to IBM SP nodes, doubling the number of processors available to the storage machines and doubling the speed of each machine.

 
Through hard work and ingenuity, Steve Lowe, Wayne Hurlbert, and James Lee overcame hardware problems and a tight schedule to complete the transfer of archival storage data to an upgraded system.
 


50 TB Data Transfer Overcomes Hurdles

Moving terabytes of data, even under the best conditions, can be a time-consuming chore. But when the hardware doesn’t cooperate and time pressure is factored in, it’s even tougher. In March 2000, Wayne Hurlbert and James Lee of the Mass Storage Group and Steve Lowe of the Computer Operations and Network Support Group were given the task of moving 50 terabytes (TB) of archival data from NERSC’s IBM storage libraries to the StorageTek silos. The transfer was driven in part by a need to trade in the IBM systems by a set date and add an additional 220 TB capability to the storage system. As the data transfer began, hardware failed, causing the team to transfer the data using just 6 of 14 tape drives. Then the internal high-speed network slowed down, further impeding timely transfer of data. The team worked night and day to fix the hardware problems and came up with a workable plan to move the data. Despite the technical difficulties, they managed to move the 50 TB in four weeks at a sustained rate of 21 Mb/sec, completing the job on time.


PDSF Expands and Upgrades

Ongoing expansion and upgrades of the PDSF (Parallel Distributed Systems Facility) continued this year. The PDSF is used by several high energy and nuclear physics experiments for data analysis and simulations. Eighty-nine dual-CPU compute nodes were added for a total of 151 nodes or 277 processors, and the disk vault was expanded by 4.0 TB to a total of 7.5 TB. Tom Davis wrote a channel bonding addition to the Linux kernel that made network connectivity to the disk vault more reliable by using the two physical Ethernet connections as a single logical connection. With the move to Oakland, five disk servers were given Gigabit Ethernet fiber connections, so the PDSF now has a gigabit connection to HPSS.


New Cluster Computing Team Established

NERSC Deputy Director Bill Kramer announced in June that NERSC was establishing a team of staff from multiple groups to coordinate all NERSC Division cluster computing activities (research, development, advanced prototypes, pre-production, production, and user support). This team will assure the most effective implementations of division resources related to cluster computing.

The NERSC Cluster Computing Team is led by Tammy Welcome and is primarily composed of staff (ranging from part time to almost full time) from the Advanced Systems, Future Technologies, Computational Systems, and User Services groups. The goals of the team are:

  • Continue providing high-quality support and service for production clusters such as the PDSF.
  • Take maximum advantage of the NERSC production environment to improve cluster systems while maintaining appropriate levels of service.
  • Investigate the feasibility and effectiveness of cluster computing as a full-production, highly parallel computing platform.
  • Exercise leadership in the cluster arena within DOE and the national HPC community, and collaborate with other groups.


Operations Team Takes on New Technologies

 
   
 
The initiative of Russell Huie and Elizabeth Bautista helped fellow members of the Computer Operations and Network Support Group to continue providing outstanding service to NERSC and ESnet users.

Keeping up with technological changes is a challenge in itself, but finding a way to make the information readily available can be even tougher. When Elizabeth Bautista of NERSC’s Computer Operations and Network Support Group took on the task of updating documentation for the center’s Operations staff, she not only had to work her way through pages of outdated printed information, she also had to scale a steep learning curve to find the best way to reorganize and update the information on the Web, so Operations staff members could more easily get the information they needed. Elizabeth not only acquired the needed expertise and completed the job on time, but she also motivated other group members to improve their troubleshooting skills as part of the process.

Russell Huie also put himself in the learning fast lane by stepping forward as the main point of contact when ESnet announced major changes in its Video Conferencing System (VCS). After taking on the responsibility, Rusty gathered all the information needed to operate, troubleshoot, and continue providing the same high level of videoconferencing services to the ESnet user community using the new Digital Collaboration Services (DCS). Once he was up to speed on the new system, Rusty provided training to other members of the Operations staff and also ensured that thorough DCS documentation was available to users.


NIM Provides Better Account Management

At the beginning of FY2001, after a year-long development project, NERSC replaced the CUB account management system with the NERSC Information Management system (NIM), a Web-based application for managing accounts. The primary reason for creating NIM was that CUB was designed for Cray vector computers and could no longer be extended to support the current high performance systems like the IBM SP. CUB also could not be ported to other architectures such as cluster or HPSS systems. Moving to a Web-based application allows NIM to be used on any platform. However, the move to a Web interface necessitated new authentication and security components, which are incorporated in NIM.

NIM makes it easier to transfer resources, either from reserve accounts to repositories, or from repo to repo. Another advantage is that NIM is based on open-source PHP (a server-side, cross-platform, HTML embedded scripting language) and uses an Oracle database, making it easier to support than the in-house produced CUB. When Version 2 is completed in the summer of 2001, NIM will provide users with richer data covering all NERSC platforms. NIM was created by members of the User Services, Computational Systems, Mass Storage, and Computer Operations and Network Support groups, with Howard Walter as project manager.


Consultation and Training Promote User Productivity

NERSC’s commitment to making scientific computing more productive extends to each researcher who uses our resources. The User Services Group is the user community’s primary point of contact with NERSC. This group is responsible for problem management and consulting; help with user code optimization and debugging; documentation; online, remote, and classroom training; and third-party applications and library support. In addition, the Computer Operations and Networking Support Group is available seven days a week, 24 hours a day to help troubleshoot problems.

During the past year, User Services presented 32 training sessions, which included a three-day onsite and videoconference workshop on using the IBM SP system; two lectures on high performance linear algebra using the IBM Power 3 architecture; two days of training sessions for new and intermediate users in conjunction with the NERSC Users Group meeting at Oak Ridge National Laboratory; and six teleconference sessions with online materials. A wide range of documentation and training materials are also available, with over 3,800 new training files added in just the past year.


Survey Responses Lead to Improved Services, Information

Each year for the past three years, NERSC’s User Services Group has asked users to complete a survey rating the center’s systems and services and asking for suggestions on how NERSC can better serve its users. With each survey, the number of users participating has increased. The level of user satisfaction has also risen each year, in part because the NERSC staff implements new procedures based on comments from users. For example, based in large part on the input received in the Fiscal Year 1999 survey, NERSC made the following improvements:

  • To accommodate users running large jobs on the Cray T3E, NERSC created a long-running queue (up to a maximum of 12 hours) for jobs using up to 256 PEs.
  • To help users of NERSC’s Cray SV1 computers test and debug their programs, the center opened one of the machines for interactive use.
  • To keep users better informed of NERSC announcements and changes, User Services created new email lists, continuing changes made as a result of the 1998 survey.
  • To make it easier for users to find information on the Web about running batch and interactive jobs, a new Web page with near real-time information was created.
  • To help users decipher what’s being said, a NERSC Glossary and Acronym List was created.
  • To help users quickly find out whether any machine at NERSC is up or down, an automated window presenting machine status was added to the bottom of the HPCF home page.

Results of the 2000 survey are available online.

< Table of Contents Top ^
Next >