|
|||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A hallmark of NERSC since its founding in 1974 has been the expertise and competence of the employees staffing the facility and the high quality of services they deliver to our users. Year after year, the NERSC staff delivers critical computing resources, applications, and information to enable users to make the most of their allocations on our high performance systems. Each year the staff introduces innovative enhancements and improvements to the systems and services as well. While NERSC employees continued to anticipate and meet users needs in 2000, they also undertook a number of special projects aimed at maintaining NERSCs pre-eminence among scientific computing facilities most notably, the installation and testing of the Phase 1 IBM SP system, and the move of our other systems to Berkeley Labs new Oakland Scientific Facility, where there is room for the Phase 2 SP as well as future expansion. In this section we highlight some examples of both ongoing activities and special projects. For a more comprehensive view, How Are We Doing: A Self-Assessment of the Quality of Services and Systems at NERSC is available online in .pdf format. In 1999, the mass media had portrayed the year 2000 as a cyberspecter threatening to undermine nearly everything remotely connected to a computer. Thanks to careful and thorough preparations at NERSC, the calendar change came and went uneventfully, with no loss of service and no impact on users. Unlike other facilities that shut down over the holiday and brought in extra staff to deal with potential problems, NERSC maintained normal operations with only one extra person on duty for security on New Years Eve.
This business-as-usual stance was possible because most of NERSCs Y2K testing was actually completed in late 1998 and early 1999. In fact, NERSC was one of the first organizations within DOE to demonstrate Y2K compliance, and was the only site to fully test IBM supercomputers, HPSS, or a Cray T3E for Y2K compliance. Hardware, operating systems, layered software, and scientific applications were all subjected to rigorous testing. The Y2K compliance team, led by Jim Craw, included Harsh Anand, Greg Butler, Tina Butler, Jonathon Carter, Thomas Davis, Tina Declerck, Jed Donnelley, Keith Fitzgerald, Susan Green, Frank Hale, William Harris, John Hules, Wayne Hurlbert, Nancy Johnston, Cheri Lawrence, Ken Okikawa, Bill Saphir, Jackie Scoggins, Craig Tull, David Turner, Mike Welcome, and Tammy Welcome. And thanks to NERSCs ongoing security program, there have been no serious security incidents to date.
With Phase 1 of the IBM SP system installed (see page 10), NERSC was able to double MPP allocations from 4.7 million processor hours in FY 1999 to 9.3 million in FY 2000. FY 2001 MPP allocations are up another 30%, even without factoring in the Phase 2 SP system, which will have 2,528 processors and a peak speed of 3.8 teraflops. PVP allocations increased about 50% from FY 1999 to FY 2000, while storage resource units (SRUs) increased about 40%. Not
content merely to increase our resources, we operated them with our traditional
efficiency and convenience to users. The Cray T3E utilization rate was
over 90% for most of the year (Figure 4), and IBM SP utilization topped
80% continuously from the time the system was placed in production through
the end of the year an impressive achievement with a new system
(Figure 5). In addition, to help users of the Cray SV1 computers test
and debug their programs, we opened one of the machines for interactive
use.
After
almost two years of planning, months of construction, and an intense week
of moving and installation, NERSCs Cray supercomputers, HPSS, PDSF
cluster, and auxiliary systems were up and running in Berkeley Labs
new Oakland Scientific Facility on November 6 (see photos below and on
the following pages). The moved systems were down for less than a week.
NERSCs IBM RS/6000 SP Phase 1 computer remained in service during
the move and will stay in Berkeley until the Phase 2 system is installed
in Oakland. Staff from NERSCs Computer Operations and Network Support
Group have relocated to the Oakland facility, and other Berkeley Lab employees
will move into the facility early in 2001.
The move was the culmination of a process that began in January 1999, when NERSC, looking for space to accommodate larger computer systems, launched an extensive site selection process. A request for proposals drew eight qualified responses, from which the Oakland site emerged as the favorite. After a contract was signed in August 1999, the former bank building was stripped down to its support structure, seismically reinforced, and completely rebuilt to meet or exceed all current codes and standards. Additional improvements in the electrical supply capability and a high-volume ventilation system were among the many special accommodations required for high performance computers. The site includes room for future expansion.
Howard Walter, head of the Future Infrastructure, Networking and Security Group, managed the project for NERSC throughout the process of specification development, bid solicitation and evaluation, contract negotiation, design, permitting, construction, and relocation. A ceremony to dedicate the facility is being planned for early spring 2001.
NERSCs
High Performance Storage System grew significantly in both capacity and
performance during the past year thanks to the work of Nancy Meyer and
the Mass Storage Group. From January to December, archival storage increased
from 660 terabytes (TB) to 880 TB. The amount of data being stored at
the end of the year was 145 TB. The online disk data cache grew from 1.8
TB at the beginning of the calendar year to 6 TB at the end. And the default
disk speed was increased from 6 megabits per second (Mb/sec) to 32 Mb/sec.
The storage environment moved from individual MicroChannel machines to
IBM SP nodes, doubling the number of processors available to the storage
machines and doubling the speed of each machine.
Moving terabytes of data, even under the best conditions, can be a time-consuming chore. But when the hardware doesnt cooperate and time pressure is factored in, its even tougher. In March 2000, Wayne Hurlbert and James Lee of the Mass Storage Group and Steve Lowe of the Computer Operations and Network Support Group were given the task of moving 50 terabytes (TB) of archival data from NERSCs IBM storage libraries to the StorageTek silos. The transfer was driven in part by a need to trade in the IBM systems by a set date and add an additional 220 TB capability to the storage system. As the data transfer began, hardware failed, causing the team to transfer the data using just 6 of 14 tape drives. Then the internal high-speed network slowed down, further impeding timely transfer of data. The team worked night and day to fix the hardware problems and came up with a workable plan to move the data. Despite the technical difficulties, they managed to move the 50 TB in four weeks at a sustained rate of 21 Mb/sec, completing the job on time. Ongoing expansion and upgrades of the PDSF (Parallel Distributed Systems Facility) continued this year. The PDSF is used by several high energy and nuclear physics experiments for data analysis and simulations. Eighty-nine dual-CPU compute nodes were added for a total of 151 nodes or 277 processors, and the disk vault was expanded by 4.0 TB to a total of 7.5 TB. Tom Davis wrote a channel bonding addition to the Linux kernel that made network connectivity to the disk vault more reliable by using the two physical Ethernet connections as a single logical connection. With the move to Oakland, five disk servers were given Gigabit Ethernet fiber connections, so the PDSF now has a gigabit connection to HPSS.
NERSC Deputy Director Bill Kramer announced in June that NERSC was establishing a team of staff from multiple groups to coordinate all NERSC Division cluster computing activities (research, development, advanced prototypes, pre-production, production, and user support). This team will assure the most effective implementations of division resources related to cluster computing. The NERSC Cluster Computing Team is led by Tammy Welcome and is primarily composed of staff (ranging from part time to almost full time) from the Advanced Systems, Future Technologies, Computational Systems, and User Services groups. The goals of the team are:
Keeping up with technological changes is a challenge in itself, but finding a way to make the information readily available can be even tougher. When Elizabeth Bautista of NERSCs Computer Operations and Network Support Group took on the task of updating documentation for the centers Operations staff, she not only had to work her way through pages of outdated printed information, she also had to scale a steep learning curve to find the best way to reorganize and update the information on the Web, so Operations staff members could more easily get the information they needed. Elizabeth not only acquired the needed expertise and completed the job on time, but she also motivated other group members to improve their troubleshooting skills as part of the process. Russell Huie also put himself in the learning fast lane by stepping forward as the main point of contact when ESnet announced major changes in its Video Conferencing System (VCS). After taking on the responsibility, Rusty gathered all the information needed to operate, troubleshoot, and continue providing the same high level of videoconferencing services to the ESnet user community using the new Digital Collaboration Services (DCS). Once he was up to speed on the new system, Rusty provided training to other members of the Operations staff and also ensured that thorough DCS documentation was available to users.
At the beginning of FY2001, after a year-long development project, NERSC replaced the CUB account management system with the NERSC Information Management system (NIM), a Web-based application for managing accounts. The primary reason for creating NIM was that CUB was designed for Cray vector computers and could no longer be extended to support the current high performance systems like the IBM SP. CUB also could not be ported to other architectures such as cluster or HPSS systems. Moving to a Web-based application allows NIM to be used on any platform. However, the move to a Web interface necessitated new authentication and security components, which are incorporated in NIM. NIM makes it easier to transfer resources, either from reserve accounts to repositories, or from repo to repo. Another advantage is that NIM is based on open-source PHP (a server-side, cross-platform, HTML embedded scripting language) and uses an Oracle database, making it easier to support than the in-house produced CUB. When Version 2 is completed in the summer of 2001, NIM will provide users with richer data covering all NERSC platforms. NIM was created by members of the User Services, Computational Systems, Mass Storage, and Computer Operations and Network Support groups, with Howard Walter as project manager.
NERSCs commitment to making scientific computing more productive extends to each researcher who uses our resources. The User Services Group is the user communitys primary point of contact with NERSC. This group is responsible for problem management and consulting; help with user code optimization and debugging; documentation; online, remote, and classroom training; and third-party applications and library support. In addition, the Computer Operations and Networking Support Group is available seven days a week, 24 hours a day to help troubleshoot problems. During the past year, User Services presented 32 training sessions, which included a three-day onsite and videoconference workshop on using the IBM SP system; two lectures on high performance linear algebra using the IBM Power 3 architecture; two days of training sessions for new and intermediate users in conjunction with the NERSC Users Group meeting at Oak Ridge National Laboratory; and six teleconference sessions with online materials. A wide range of documentation and training materials are also available, with over 3,800 new training files added in just the past year.
Each year for the past three years, NERSCs User Services Group has asked users to complete a survey rating the centers systems and services and asking for suggestions on how NERSC can better serve its users. With each survey, the number of users participating has increased. The level of user satisfaction has also risen each year, in part because the NERSC staff implements new procedures based on comments from users. For example, based in large part on the input received in the Fiscal Year 1999 survey, NERSC made the following improvements:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||