NERSC Reaches Another Checkpoint/Restart Milestone
June 1, 2005
On the weekend of June 11 and 12, IBM personnel used NERSC’s Seaborg supercomputer for dedicated testing of IBM’s latest HPC Software Stack, a set of tools for high performance computing. To maximize system utilization for NERSC users, instead of “draining” the system (letting running jobs continue to completion) before starting this dedicated testing, NERSC staff checkpointed all running jobs at the start of the testing period. “Checkpointing” means stopping a program in progress and saving the current state of the program and its data — in effect, “bookmarking” where the program left off so it can start up later in exactly the same place.
This is believed to be the first full-scale use of the checkpoint/restart software with an actual production workload on an IBM SP, as well as the first checkpoint/restart on a system with more than 2,000 processors. It is the culmination of a collaborative effort between NERSC and IBM that began in 1999. Of the 44 jobs that were checkpointed, approximately 65% checkpointed successfully. Of the 15 jobs that did not checkpoint successfully, only 7 jobs were deleted from the queuing system, while the rest were requeued to run again at a later time. This test enabled NERSC and IBM staff to identify some previously undetected problems with the checkpoint/restart software, and they are now working to fix those problems.
In 1997 NERSC made history by being the first computing center to achieve successful checkpoint/restart on a massively parallel system, the Cray T3E. For the original news story, see <http://www.nersc.gov/news/newsroom/checkpoint10-21-97.php>.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.