Science Highlights banner
SciDAC header

Performance Evaluation Research Center

The Performance Evaluation Research Center (PERC) is an ISIC whose goals are to develop a science for understanding performance of scientific applications on high-end computer systems, and to develop engineering strategies for improving performance on these systems.

Volume-rendered image showing surface of maximum heat release
Figure 2   Estimation results for NERSC’s IBM SP (Seaborg) with variable sample size: (a) observed free-processor fraction, (b) estimated free-processor fraction. Among the total 1,920 measurements, the estimation was within the required 8% accuracy 1,747 times, corresponding to an estimation success rate of 91%. The average sample size obtained was 46 processors, or 1.5% of the machine size. (Click on image for larger version.)
One of PERC’s first milestones was installation of the Hockney development platform at NERSC, one frame of an SP system donated by IBM and dedicated to the PERC effort. Named in honor of the late British computer scientist Roger Hockney, this system will be used for software testing and development. Other PERC accomplishments included analyzing the key codes EVH1, AORSA3D, PCTM, CCSM2.0, MILC, and pVarDen, identifying changes to improve performance in many cases; developing a highly effective tool for modeling performance of large-scale codes; developing a data-dependent memory tracing tool; and developing a Fortran interface for a performance assertion prototype.

A joint PERC/Los Alamos Computer Science Institute/NSF project addressed the growing problem of monitoring systems with thousands of components. In the resulting publication, Mendes and Reed proposed a new technique for monitoring large systems based on statistical sampling. Instead of checking every system component individually, they select a statistically valid subset of components, inspect this subset in detail, and derive estimates for the whole system based on the properties found in the subset. Their experiments demonstrated the effectiveness of these techniques for estimating the fraction of available processors in parallel machines (Figure 2), the fraction of network sites reachable from a certain point, and the mean latency expected from that point to the rest of the network. These results show that one can reliably estimate the state of a large system at a small fraction of the cost required by traditional monitoring schemes. This cost reduction, in turn, can enable measurements that would be impractical by regular means, and can also enable the use of more powerful algorithms for system management and for optimized resource utilization.


INVESTIGATORS
D. H. Bailey and E. Strohmaier, Lawrence Berkeley National Laboratory; J. Dongarra, University of Tennessee; D. A. Reed, University of Illinois; D. Quinlan, B. de Supinski, and J. Vetter, Lawrence Livermore National Laboratory; P. Worley and T. Dunigan, Oak Ridge National Laboratory; P. Hovland and B. Norris, Argonne National Laboratory; J. Hollingsworth, University of Maryland; A. Snavely, San Diego Supercomputer Center.

PUBLICATION
C. L. Mendes and D. A. Reed, “Monitoring large systems via statistical sampling,” Proc. of LACSI Symposium 2002, Santa Fe, NM, October 2002.

URLs
http://perc.nersc.gov/

 
NERSC Annual Report 2002 Table of Contents Science Highlights NERSC Center