National Energy Research Scientific Computing Center 2004 Annual Report
Navigation
Developing metrics for petascale facilities
In spring of 2006, Dr. Raymond Orbach, the Department of Energy Under Secretary for Science, asked the Advanced Scientific Computing Research Advisory Committee (ASCAC) “to weigh and review the approach to performance measurement and assessment at [ALCF, NERSC, and NLCF], the appropriateness and comprehensiveness of the measures, and the [computational science component] of the science accomplishments and their effects on the Office of Science’s science programs.” The Advisory Committee formed a subcommittee to respond to the charge, which was co-chaired by Gordon Bell of Microsoft and James Hack of the National Center for Atmospheric Research.
NERSC has long used goals and metrics to assure what we do is meeting the needs of DOE and its scientists. Hence, it was natural for NERSC to take the lead, working with representatives from the other sites to formulate a joint plan for metrics. Together with the other sites, NERSC then reviewed all the information and suggestions.
The committee report, accepted in February 2007, identified two classes of metrics—control metrics and observed metrics. Control metrics have specific goals which must be met, and observed metrics are used for monitoring and assessing activities. The subcommittee felt that there should be free and open access to the many observed metrics computing centers collect and utilize, but “it would be counter-productive to introduce a large number of spurious ‘control’ metrics beyond the few we recommend below.”
The committee report pointed out, “It should be noted that NERSC pioneered the concept of ‘project specific services’ which it continues to provide as part of SciDAC and INCITE projects.” Another panel recommendation is that the all centers “use a ‘standard’ survey based on the NERSC suser urvey that has been used for several years in measuring and improving service.”
The final committee report is available at http://www.sc.doe.gov/ascr/ASCAC/ASCAC_Petascale-Metrics-Report.pdf.
Software roadmap to plug and play petaflop/s
In the next five years, the DOE expects to field systems that reach a petaflop of computing power. In the near term (two years), DOE will have several “near-petaflops” systems that are 10% to 25% of a peraflop-scale system. A common feature of these precursors to petaflop systems (such as the Cray XT3 or the IBM BlueGene/L) is that they rely on an unprecedented degree of concurrency, which puts stress on every aspect of HPC system design. Such complex systems will likely break current “best practices” for fault resilience, I/O scaling, and debugging, and even raise fundamental questions about programming languages and application models. It is important that potential problems are anticipated far enough in advance that they can be addressed in time to prepare the way for petaflop-scale systems.
DOE asked the NERSC and Computational Research divisions at Lawrence Berkeley National Laboratory to address these issues by considering the following four questions:
- What software is on a critical path to make the systems work?
- What are the strengths/weaknesses of the vendors and of existing vendor solutions?
- What are the local strengths at the labs?
- Who are other key players who will play a role and can help?
Berkeley Lab responded to these questions in the report “Software Roadmap to Plug and Play Petaflop/s.”1 In addition to answering the four questions, this report provides supplemental information regarding NERSC’s effort to use non-invasive workload profiling to identify application requirements for future systems; describes a set of codes that provide good representation of the application requirements of the broader DOE scientific community; and provides a comprehensive production software requirements checklist that was derived from the experience of the NERSC-3, NERSC-4, and NERSC-5 procurement teams. It presents a detailed view of the software requirements for a fully functional petaflop-scale system environment, and assesses how emerging near-petaflop systems conform or fail to conform to these requirements.
The Berkeley view of the landscape of parallel computing research
The path to petascale computing will be paved with new system architectures featuring hundreds of thousands of manycore processors. Such systems will require scientists to completely rethink programming models. Among those computer scientists already looking to the petascale horizon are Science Driven Systems Architecture (SDSA) Team members John Shalf and Kathy Yelick, who are two of the co-authors of a white paper called “The Landscape of Parallel Computing Research: A View from Berkeley.” Based on two years of discussions among a multidisciplinary group of researchers, this paper addresses the challenge of finding ways to make it easy to write programs that run efficiently on manycore systems.
The creation of manycore architectures—hundreds to thousands of cores per processor—demands that a new parallel computing ecosystem be developed, one that is very different from the environment that supports the current sequential and multicore processing systems. Since real-world applications are naturally parallel and hardware is naturally parallel, what is needed is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems. The paper provides strategic suggestions on how to accomplish this (see http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf).
Another SDSA research collaboration is the RAMP Project (Research Accelerator for Multiple Processors), which focuses on how to build low cost, highly scalable hardware/software prototypes, given the increasing difficulty and expense of building hardware. RAMP is exploring emulation of parallel systems via field programmable gate arrays (FPGAs). Although FPGAs are slower than other types of hardware, they are much faster than simulators, and thus can be used to evaluate novel ideas in parallel architecture, languages, libraries, and so on.
Performance and potential of cell processor analyzed
Though it was designed as the heart of the new Sony PlayStation3 game console, the STI Cell processor also created quite a stir in the computational science community, where the processor’s potential as a building block for high performance computers has been widely discussed and speculated upon.
To evaluate Cell’s potential, LBNL computer scientists evaluated the processor’s performance in running several scientific application kernels, then compared this performance with other processor architectures. The group presented their findings in a paper at the ACM International Conference on Computing Frontiers, held May 2-6, 2006, in Ischia, Italy. An article about the paper in the HPCwire newsletter was the most-read item in the history of the newsletter, according to editor Michael Feldman, and after a mention on SlashDot, the paper was viewed online by about 30,000 readers.
The paper, “The Potential of the Cell Processor for Scientific Computing,” was written by Samuel Williams, Leonid Oliker, Parry Husbands, Shoaib Kamil and Katherine Yelick of Berkeley Lab’s Future Technologies Group and by John Shalf, head of NERSC’s Science-Driven System Architecture Team.
“Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency,” the authors wrote in their paper. “We also conclude that Cell’s heterogeneous multicore implementation is inherently better suited to the HPC environment than homogeneous commodity multicore processors.”
Cell, designed by a partnership of Sony, Toshiba, and IBM, is a high performance implementation of software-controlled memory hierarchy in conjunction with the considerable floating point resources that are required for demanding numerical algorithms. Cell takes a radical departure from conventional multiprocessor or multicore architectures. Instead of using identical cooperating commodity processors, it uses a conventional high performance PowerPC core that controls eight simple SIMD (single instruction, multiple data) cores, called synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller.
Despite its radical departure from mainstream general-purpose processor design, Cell is particularly compelling because it will be produced at such high volumes that it could be cost-competitive with commodity CPUs. At the same time, the slowing pace of commodity microprocessor clock rates and increasing chip power demands have become a concern to computational scientists, encouraging the community to consider alternatives like STI Cell. The authors examined the potential of using the forthcoming STI Cell processor as a building block for future high-end parallel systems by investigating performance across several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations on regular grids, as well as 1D and 2D fast Fourier transformations.
According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but the majority of scientific applications require double precision (64-bit). Although Cell’s peak double-precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2 GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double-precision performance.
The authors developed a performance model for Cell and used it to show direct comparisons of Cell with the AMD Opteron, Intel Itanium2 and Cray X1 architectures. The performance model was then used to guide implementation development that was run on IBM’s Full System Simulator in order to provide even more accurate performance estimates.
The authors argue that Cell’s three-level memory architecture, which decouples main memory accesses from computation and is explicitly managed by the software, provides several advantages over mainstream cache-based architectures. First, performance is more predictable, because the load time from an SPE’s local store is constant. Second, long block transfers from off-chip DRAM can achieve a much higher percentage of memory bandwidth than individual cache-line loads. Finally, for predictable memory access patterns, communication and computation can be effectively overlapped by careful scheduling in software.
While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell’s peak double-precision performance is fourteen times slower than its peak single-precision performance. If Cell were to include at least one fully utilizable pipelined double-precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double.
The full paper can be read at http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf.
Integrated Performance Monitoring tool adopted by other HPC centers
Although supercomputing centers around the country operate different architectures and support separate research communities, they face a common challenge in making the most effective use of resources to maximize productivity. One method for doing this is to analyze the performance of various applications to identify the bottlenecks. Once identified, these performance speedbumps can often be smoothed out to get the application to run faster and improve utilization. This is especially important as applications and architectures scale to thousands or tens of thousands of processors.
In 2005 David Skinner, then a member of NERSC’s User Services Group (he now leads the Open Software and Programming Group), introduced Integrated Performance Monitoring, or IPM. IPM is a portable profiling infrastructure that provides a performance summary of the computation and communication in a parallel program. IPM has extremely low overhead, is scalable to thousands of processors, and was designed with a focus on ease of use, requiring no source code modification (see http://www.nersc.gov/nusers/resources/software/tools/ipm.php).
Skinner cites the lightweight overhead and fixed memory footprint of IPM as important innovations. Unlike performance monitoring based on traces, which consume more resources the longer the code runs, IPM enforces strict boundaries on the resources devoted to profiling. By using a fixed memory hash table, IPM achieves a compromise between providing a detailed profile and avoiding impact on the profiled code.
IPM was also designed to be portable and runs on the IBM SP, Linux clusters, Altix, Cray X1, NEC SX6, and the Earth Simulator. Portability is key to enabling cross-platform performance studies. Portability, combined with IPM’s availability under an open source software license, has also led to other centers adopting and adding to the IPM software. Current users include the San Diego Supercomputer Center (SDSC), the Center for Computation and Technology (CCT) at Louisiana State University, and the Army Research Laboratory.
After hearing a presentation on IPM by Skinner in 2005, SDSC staff began porting IPM to their machines. By spring 2006 it was in production.
“We decided to use it because there is nothing else out there that is as easy to use and provides the information in an easy to understand form for our users,” said Nick Wright of SDSC’s Performance Modeling and Characterization Lab. “It has helped our center assist users with their performance-related issues.”
IPM is used quite extensively at SDSC to understand performance issues on both the IBM Power4+ system (Datastar) and the three-rack BlueGene system. In addition to regular usage by SDSC users, “IPM is used extensively by user services consultants and performance optimization specialists to diagnose and treat performance issues,” Wright said. In fact, research using IPM contributed to two technical papers written by SDSC staff and submitted to the SC07 conference.
At Louisiana State University’s CCT, the staff will soon start running IPM on a number of systems. Staff at the center first learned about the tool from NERSC’s John Shalf, who has longstanding ties to the LSU staff, and a more recent visit by Skinner. According to Dan Katz, assistant director for Cyberinfrastructure Development at CCT, a number of CCT users are already familiar with IPM as they have run applications at both NERSC and SDSC.
LSU operates two sets of HPC resources. At LSU itself, mostly for local users and collaborators, CCT is in the process of installing a new ~1500-core Linux system, in addition to some smaller IBM Power5 systems and a few other small systems. For the Louisiana Optical Network Initiative (LONI), a statewide network of computing and data services, CCT is installing six Linux systems with a total of ~9000 cores, and six IBM Power5 systems across the state.
“In both cases, we plan to run IPM, initially on user request, and longer term automatically all the time,” Katz said “We will do this for two reasons: First, to help users understand the performance of their applications, and therefore, to be able to improve performance. Second, it will help us understand how our systems are being used, which helps us understand what systems we should be investigating for future installation.”
Reducing the red tape in research partnerships
While research partnerships with other government agencies or academic institutions are the most common types of partnerships for NERSC, collaboration with commercial computer and software vendors has also produced valuable results over the years. For example, collaborations with IBM have made possible the use of GPFS software with systems from multiple vendors for the NERSC Global Filesystem, the evaluation of the Cell processor for scientific computing, and the development of the eight-processor node in the IBM Power line of processors.
Protecting intellectual property is always a priority in collaborative research, so partnerships with vendors typically involve negotiation of agreements on issues such as non- disclosure, source code licensing, or trial/beta testing. With an organization as large and dispersed as IBM, frequent collaboration can mean repeated negotiations on similar issues with different units of the organization—a time-consuming task.
NERSC staff thought there had to be a better way of setting up research agreements without starting from scratch every time. A meeting between Bill Kramer and Bill Zeitler, senior vice president and group executive of IBM’s Systems and Technology Group, led to the formation of a task force to address the issue.
The Berkeley Lab team included Kramer, NERSC procurement specialist Lynn Rippe, and Cheryl Fragiadakis and Seth Rosen from the Lab’s Technology Transfer Department, with input from Kathy Yelick, Lenny Oliker, and Jim Craw. Working with IBM’s Doug Duberstein, they developed a set of master agreements between NERSC and IBM that could be used as starting points. IBM also set up a central repository where these agreements could be accessed throughout the organization.
For example, the master non-disclosure agreement contains the terms and conditions that the two organizations have negotiated over the years, along with a list of the types of disclosed information that have already been agreed on. When a new project involves some other type of information, the two parties just have to add a supplement that specifies that information, without having to renegotiate all the other terms and conditions. Or, to test a new trial software package, NERSC and IBM simply sign a supplement adding that software package to the master testing and licensing agreements.
These master agreements are expected to save a substantial amount of time for both organizations. NERSC is making the language in these agreements available to other laboratories so that they can simplify their negotiations too.
1 William T.C. Kramer, Jonathan Carter, David Skinner, Lenny Oliker, Parry Husbands, Paul Hargrove, John Shalf, Osni Marques, Esmond Ng, Tony Drummond, and Kathy Yelick, “Software Roadmap to Plug and Play Petaflop/s,” Lawrence Berkeley National Laboratory report LBNL-59999 (July 2006), http://www.nersc.gov/news/reports/LBNL-59999.pdf.
Kathy Yelick