table of contents advances in computational science research news the NERSC center

the nersc center

Comprehensive Scientific Support

Support for Large-Scale Projects

Many of the important breakthroughs in computational science come from large, multidisciplinary, multi-institutional collaborations working with advanced codes and large datasets, such as the SciDAC and INCITE collaborations. These teams are in the best position to take advantage of terascale computers and petascale storage, and NERSC provides its highest level of support to these researchers. This support includes special service coordination for queues, throughput, increased limits, etc.; and specialized consulting support; which may include algorithmic code restructuring to increase performance, I/O optimization, visualization support — whatever it takes to make the computation scientifically productive.

The principal investigators of the three 2004 INCITE projects were unanimous in their praise of the support provided by NERSC staff. William A. Lester, Jr., PI of the “Quantum Monte Carlo Study of Photoprotection via Carotenoids in Photosynthetic Centers” project, said:

“The INCITE Award led us to focus on algorithm improvements that significantly facilitated the present calculations. One of these improved quantum Monte Carlo scaling … others included improved trial function construction and a more efficient random walk procedure. We have benefited enormously from the support of NERSC staff and management … and the Visualization Group has provided us with modes of presenting our work beyond our wildest imagination.”

Tomasz Plewa, leader of the project “Thermonuclear Supernovae: Stellar Explosions in Three Dimensions,” commented:

“We have found NERSC staff extremely helpful in setting up the computational environment, conducting calculations, and also improving our software. One example is help we have received when implementing an automatic procedure for code checkpointing based on the access to the queue remaining time. We have also been able to resolve problems with large I/O by switching to a 64-bit environment. Making this environment available on short notice was essential for the project.”

And P. K. Yeung, PI of the “Fluid Turbulence and Mixing at High Reynolds Number” INCITE project, offered these comments:

“The consultant services are wonderful. Our designated consultant has worked hard to help us, including responding to emails every day of the week. We have benefited from consultants’ comments on code performance, innovative ideas for improvement, and diagnostic assistance when user or system problems occur. The visualization staff have also been very good….

“We really appreciate the priority privilege that has been granted to us in job scheduling. This has allowed most of our jobs to start relatively quickly compared to what we experience at other sites…. Our INCITE Award has had tremendous impact on our work, and by opening up many exciting opportunities for collaborations and long-term partnerships, can be expected to bring many long-lasting advances in both turbulence and computational science in general.”

Integrated Performance Monitoring Simplifies Code Assessment

As the HPC center of choice for the DOE research community, NERSC consistently receives requests for more computing resources than are available. Because computing time is so valuable, making the most of every allocated processor-hour is a paramount concern. Evaluating the performance of application codes in the diverse NERSC workload is an important and challenging endeavor. As NERSC moves toward running more large-scale jobs, finding ways to improve performance of large-scale codes takes on even greater importance.

For this reason identifying bottlenecks to scalable performance of parallel codes has been an area of intense focus for NERSC staff. To identify and remove these scaling bottlenecks, David Skinner of NERSC’s User Services Group has developed Integrated Performance Monitoring, or IPM. IPM is a portable profiling infrastructure that provides a performance summary of the computation and communication in a parallel program. IPM has extremely low overhead, is scalable to thousands of processors, and was designed with a focus on ease of use, requiring no source code modification. These characteristics are the right recipe for measuring application performance in a production environment like NERSC’s, which consists of hundreds of projects and parallelism ranging from 1 to 6,000 processors.

Skinner points to the lightweight overhead and fixed memory footprint of IPM as one of its biggest innovations. Unlike performance monitoring based on traces, which consume more resources the longer the code runs, IPM enforces strict boundaries on the resources devoted to profiling. By using a fixed memory hash table, IPM achieves a compromise between providing a detailed profile and avoiding impact on the profiled code.

IPM was also designed to be portable. It runs on the IBM SP, Linux clusters, Altix, Cray X1, NEC SX6, and the Earth Simulator. Portability is key to enabling cross-platform performance studies. Portability, combined with IPM’s availability under an open source software license, will hopefully lead to other centers adopting and adding to the IPM software.

Skinner characterizes IPM as a “profiling layer” rather than a performance tool. “The idea is that IPM can provide a high-level performance summary which feeds both user and center efforts to improve performance,” Skinner said. “IPM finds ‘hot spots’ and bottlenecks in parallel codes. It also identifies the overall characteristics of codes and determines which compute resources are being used by a code. It really provides a performance inventory. Armed with that information, users can improve their codes and NERSC can better provide compute resources aligned to meet users’ computational needs.”

IPM automates a number of monitoring tasks that Skinner and other HPC consultants used to perform manually. By running a code with IPM, NERSC staff can quickly generate a comprehensive performance picture of a code, with the information presented both graphically (Figure 6) and numerically.

   
 
  Figure 6. IPM can graphically present a wide range of data, including communication balance by task, sorted by (a) MPI rank or (b) MPI time.


The monitors that IPM currently integrates include a wide range of MPI communication statistics; HPM (Hardware Performance Monitor) counters for things like flop rates, application memory usage, and process topology; and system statistics such as switch traffic.

The integration in IPM is multi-faceted, including binding the above information sources together through a common interface, and also integrating the records from all the parallel tasks into a single report. On some platforms IPM can be integrated into the execution environment of a parallel computer. In this way, an IPM profile is available either automatically or with minor effort. The final level of integration is the collection of individual performance profiles into a database that synthesizes the performance reports via a Web interface. This Web interface can be used by all those concerned with parallel code performance: users, HPC consultants, and HPC center managers. As different codes are characterized, the results are posted to protected Web pages. Users can access only the pages for the codes they are running.

One of the first uses for IPM was to help the initial three INCITE projects make the most effective use of their large allocations. Subsequently it has been expanded to other projects. Even a small improvement — say 5 percent — in a code that runs on a thousand processors for millions of processor-hours is a significant gain for the center. “Our primary goal is to help projects get the most out of their allocated time,” Skinner said.

But the same information is also interesting to the center itself. Obtaining a center-wide picture of how computational resources are used is important to knowing that the right resources are being presented and in the right way. It also guides choices about what future NERSC computational resources should look like. For example, IPM shows which parts of MPI are widely used by NERSC customers and to what extent. “It’s good to know which parts of MPI our customers are using,” Skinner said. “As an HPC center this tells us volumes about not only what we can do to make codes work better with existing resources as well as what future CPUs and interconnects should look like.”

“We are looking for other programmers to contribute to IPM,” Skinner added. “IPM complements existing platform-specific performance tools by providing an easy-to-use profiling layer that can motivate and guide the use of more detailed, in-depth performance analysis.”

More information about IPM is available at http://www.nersc.gov/projects/ipm/.

New Capabilities for Remote Visualization of Large Data Sets

In 2002 NERSC hosted a workshop to solicit input from the user community on the subject of future visualization and analysis requirements. The workshop findings report2 cited several areas of user concern and need. One is the need for more capable visualization and analysis tools to address the challenges posed by ever-larger and complex data being collected from experiments and generated by simulations. Another is the need for centralized deployment and management of general-purpose visualization resources. Yet another is the need for remote visualization software that can be used by multiple members of distributed teams.

During 2004, the NERSC Center took steps towards addressing these concerns. First, the Center consolidated management of licensed visualization software to a set of redundant license servers, thus streamlining and simplifying license management. Consolidating licenses from different machines into a larger license pool has increased the number of licenses available to users for certain applications, such as IDL. The result is increased availability of licensed visualization software at a reduced cost of managing licenses.

Figure 7. Screen shot of the EnSight Gold display client rendering multiple views of GTK fusion simulation results. (Image: Berkeley Lab Visualization Group and Scott Klasky, Princeton Plasma Physics Laboratory)

An additional benefit of the central license servers is that remote users can now check out a license for a commercial visualization application and run it on their remote desktop. In some cases, the commercial applications are quite expensive, so the user community benefits by being able to take advantage of resources paid for and maintained by NERSC. More information is available at the “Remote License Services at NERSC” Web page.3

To address the need for more capable visualization tools that can be used in a remote and distributed setting, NERSC has purchased and deployed EnSight Gold, an interactive visualization application from CEI. EnSight runs in a pipelined fashion, with visualization processing taking place at NERSC, and rendering taking place on the user’s local desktop, which makes it particularly useful when data files are too large to download for local analysis. The “back end” that performs the visualization processing at NERSC can be run in parallel, scaling to accommodate large data files. EnSight also supports collaborative interaction — multiple persons at different locations may run the EnSight display client simultaneously, all viewing and interacting with the same data set running at NERSC (Figure 7).

The Berkeley Lab Visualization Group is evaluating several additional new visualization applications that may be useful for remotely visualizing large scientific data sets in the NERSC environment.

User Surveys Prompt Improvements

For the sixth consecutive year, NERSC conducted a survey and invited all users to provide input on how well the Center is meeting their HPC requirements and expectations. Not only do the survey responses indicate how well the Center is performing and where it can improve, but the feedback is also used to implement changes to improve systems and services. In 2004, 209 users responded to the survey.

Areas with the highest user satisfaction include the HPSS mass storage system, both its reliability and its availability, account support services, and HPC consulting. In addition to overall consulting support, particular areas rated highly were the timely initial response to consulting questions and follow-up to initial consulting questions. The largest increases in satisfaction over last year’s survey came from training classes attended in person, visualization services, the HPSS and Seaborg Web pages, and software bug resolution.

Areas with the lowest user satisfaction include the Seaborg’s batch turnaround time and queue structure, as well as services used by only small numbers of users. These services include the math and visualization servers, Grid services, and training classes presented over the Access Grid. The areas rated significantly lower this year include Seaborg and available computing hardware.

In the section “What does NERSC do well?” 118 respondents gave top marks to access to powerful computing resources, excellent support services and staff, reliable hardware and a well-managed center, an easy-to-use environment for users, general satisfaction, HPSS, and the documentation on the NERSC Web site.

When it came to “What should NERSC do differently?” 94 responses focused mainly on issues involving Seaborg. The top request was to improve Seaborg’s turnaround time, followed by changing the job scheduling policies for Seaborg, especially midrange jobs. There were also requests to provide more computing resources and to improve the allocations process.

Such comments often lead to real change. As a result of the 2003 survey, NERSC made the following improvements:

Complete results of the 2004 user survey can be found at http://www.nersc.gov/news/survey/2004/.

_______________

2 B. Hamann, E. W. Bethel, H. D. Simon, and J. Meza, NERSC “Visualization Greenbook”: Future Visualization Needs of the DOE Computational Science Community Hosted at NERSC, Lawrence Berkeley National Laboratory report LBNL-51699 (2002), http://vis.lbl.gov/Publications/2002/VisGreenFindings-LBNL-51699.pdf.

3 http://www.nersc.gov/nusers/services/licenses/overview.php