Effective System Performance (ESP) Benchmark
It is now generally recognized in the high performance computing community that peak performance does not adequately predict the usefulness of a system for a given set of applications. One of the first benchmarks designed to measure system performance in a real-world operational environment was NERSC's Effective System Performance (ESP) test. NERSC introduced ESP in 1999 with the hope that this test would be of use to system managers and would help to spur the community (both researchers and vendors) to improve system efficiency.
The discussion below uses examples from the Cray T3E system that NERSC was operating in 1999.
- Improved MPP System Efficiency Equals Million-Dollar Savings
- Theoretical Peak Performance Does Not Equal Real-World Results
- Effective System Performance (ESP): A New Metric
- Overview of the ESP Testing Process
Click here to download the ESP software.
Improved MPP System Efficiency Equals Million-Dollar Savings
Increasing the effectiveness of NERSC's Cray T3E from 80% to 90% would be equivalent to adding more than $2 million in additional hardware for the following reasons:
- 644 processing elements (PEs) running at 90% are equivalent to 725 PEs running at 80%.
- 81 PEs are needed to make up the difference.
- A PE costs ~$50,000 list, $25,000 discounted.
- 81 × $25,000 = $2,025,000.
In fact, over 18 months from October 1997 to July 1999, NERSC increased T3E utilization from ~55% to ~90% -- a value of $10.25 million (Fig. 1). This is almost equivalent to the improvement in processor/price performance from Moore's Law.
Fig. 1: Over 18 months, NERSC increased Cray T3E utilization from ~55% to ~90% -- a value of $10.25 million.
Theoretical Peak Performance Does Not Equal Real-World Results
Peak operations per second says nothing about how specific scientific codes will perform. The percentage of peak performance achieved on NERSC's Cray T3E varies widely (Table 1 and Fig. 2).
|Benchmark||System Performance||Single Processor Performance||% of Peak|
|Theoretical peak||580.0 Gflop/s||900 Mflop/s||100.0%|
|Linpack||444.2 Gflop/s||690 Mflop/s||76.6%|
|LSMS code (Locally Self-consistent Multiple Scattering), 1998 Gordon Bell Prize-winning application||256.0 Gflop/s||398 Mflop/s||44.1%|
|Average of seven major NERSC applications||67.0 Gflop/s||~104 Mflop/s||11.6%|
|NAS Parallel Benchmarks||29.6 Gflop/s||~46 Mflop/s||5.1%|
Fig. 2: Percent of peak performance achieved varies widely on NERSC's Cray T3E, depending on the benchmark.
Effective System Performance (ESP): A New Metric
ESP is designed to evaluate systems for overall effectiveness, independent of processor performance. The ESP test suite simulates "a day in the life of an MPP" by measuring total system utilization. Results take into account both hardware (PE, memory, disk) and system software performance. Developed by NERSC as part of a major system procurement process, ESP is designed to predict the effectiveness of a system before purchase, as well as to evaluate system changes before implementation.
The goals of the ESP benchmark include:
- determine how well an existing system supports a particular scientific workload
- assess systems for that workload before purchase
- provide quantitative effectiveness information regarding system enhancements
- compare different systems on a single workload or discipline
- compare system-level performance on workloads derived from different disciplines
- compare different systems for different workloads.
Overview of the ESP Testing Process
The ESP test employs of a suite of scientific applications. The standard NERSC ESP suite is a set of 82 individual jobs, two of which are full-configuration jobs, namely, calculations that use the entire system. Except for the two full-configuration jobs, these jobs are submitted to the system's job scheduling software in an order given by a particular pseudo-random number generator.
At time zero, jobs are submitted until the total number of PEs requested by the jobs in the queue exceeds twice the number of PEs available in the system (Fig. 3). Several minutes after the start of the test, another batch of pseudo-randomly selected jobs is submitted, until the total number of PEs requested by the new jobs in the queue exceeds the number of PEs available. Again several minutes later, the remainder of the jobs in the suite (except for the full-configuration jobs) are submitted.
Fig. 3: The ESP test measures total system performance by simulating a realistic MPP workload.
Twenty-four minutes into the test, the first of the two full-configuration jobs is submitted. This job is to be run immediately upon submission, before any other non-executing code in the queue is run. This may be achieved by one of the following methods:
- Disabling the input job queue, waiting until all running jobs have completed, and then running the full-configuration code.
- Terminating the running jobs, allowing the full-configuration job to run, and then restarting the terminated jobs later.
- Employing some form of concurrent, priority scheduling system that permits the full-configuration job to be run on nodes currently being used by other jobs.
- Saving the currently running jobs, using a system-initiated checkpoint-restart facility, allowing the full-configuration job to be run, and then resuming the other jobs later.
Immediately after the full-configuration job is completed, the system is entirely shut down and then rebooted. After the system has been rebooted, the job suite is restarted.
Three hours after the test has started, a second full-configuration job is submitted. All codes must run to completion, with verified correct results.
The system effectiveness ratio, E, is then computed as
- pi = number of processors utilized by job i
- ti = wall clock run time in seconds required by job i on a dedicated system
- n = number of individual jobs run during the test
- P = total number of processors in the system
- S = time required for shutdown and reboot
- T = total elapsed wall clock time for test (not counting shutdown and reboot).
While the ESP as described above is targeted to the NERSC T3E, the test can easily be modified to make it more appropriate for other systems.
- Adrian T. Wong, Leonid Oilker, William T. C. Kramer, Teresa L. Kaltz and David H. Bailey, "ESP: A System Utilization Benchmark."
- Adrian Wong, Leonid Oilker, William Kramer, Teresa Kaltz and David Bailey, "System Utilization Benchmark on the Cray and IBM SP," for The 6th Workshop on Job Scheduling Strategies for Parallel Processing, April 19, 2000.
- Adrian T. Wong, Leonid Oilker, William T. C. Kramer, Teresa L. Kaltz and David H. Bailey, "Evaluating System Effectiveness in High Performance Computing Systems," LBNL Technical Report #44542, November 11, 1999.