| Performance
Evaluation Research Center
The Performance Evaluation Research Center (PERC) is an
ISIC whose goals are to develop a science for understanding
performance of scientific applications on high-end computer
systems, and to develop engineering strategies for improving
performance on these systems.
 |
|
 |
|
| Figure
2 Estimation results for NERSC’s
IBM SP (Seaborg) with variable sample size: (a) observed
free-processor fraction, (b) estimated free-processor
fraction. Among the total 1,920 measurements, the estimation
was within the required 8% accuracy 1,747 times, corresponding
to an estimation success rate of 91%. The average sample
size obtained was 46 processors, or 1.5% of the machine
size. (Click on image for larger version.) |
|
One of PERC’s first milestones was installation of
the Hockney development platform at NERSC, one frame of an
SP system donated by IBM and dedicated to the PERC effort.
Named in honor of the late British computer scientist Roger
Hockney, this system will be used for software testing and
development. Other PERC accomplishments included analyzing
the key codes EVH1, AORSA3D, PCTM, CCSM2.0, MILC, and pVarDen,
identifying changes to improve performance in many cases;
developing a highly effective tool for modeling performance
of large-scale codes; developing a data-dependent memory tracing
tool; and developing a Fortran interface for a performance
assertion prototype.
A joint PERC/Los Alamos Computer Science Institute/NSF project
addressed the growing problem of monitoring systems with thousands
of components. In the resulting publication, Mendes and Reed
proposed a new technique for monitoring large systems based
on statistical sampling. Instead of checking every system
component individually, they select a statistically valid
subset of components, inspect this subset in detail, and derive
estimates for the whole system based on the properties found
in the subset. Their experiments demonstrated the effectiveness
of these techniques for estimating the fraction of available
processors in parallel machines (Figure 2), the fraction of
network sites reachable from a certain point, and the mean
latency expected from that point to the rest of the network.
These results show that one can reliably estimate the state
of a large system at a small fraction of the cost required
by traditional monitoring schemes. This cost reduction, in
turn, can enable measurements that would be impractical by
regular means, and can also enable the use of more powerful
algorithms for system management and for optimized resource
utilization.
INVESTIGATORS
D. H. Bailey and E. Strohmaier, Lawrence Berkeley National
Laboratory; J. Dongarra, University of Tennessee; D. A. Reed,
University of Illinois; D. Quinlan, B. de Supinski, and J.
Vetter, Lawrence Livermore National Laboratory; P. Worley
and T. Dunigan, Oak Ridge National Laboratory; P. Hovland
and B. Norris, Argonne National Laboratory; J. Hollingsworth,
University of Maryland; A. Snavely, San Diego Supercomputer
Center.
PUBLICATION
C. L. Mendes and D. A. Reed, “Monitoring large systems
via statistical sampling,” Proc. of LACSI Symposium
2002, Santa Fe, NM, October 2002.
URLs
http://perc.nersc.gov/
|