Berkeley Lab Checkpoint Restart Improves Productivity

March 30, 2009

FRANKLIN is NERSC's Cray XT4 massively parallel processing system with 38,128 Opteron compute cores and a peak performance of 356 TFlops.

A combustion researcher may run a huge simulation of a laboratory-scale flame experiment on a supercomputer to better understand the turbulence-chemistry interactions that affect fuel efficiency. But if the system crashes, then all the data from the run is lost and the user has no choice but to start over.

The new version Berkeley Lab Checkpoint Restart (BCLR) software, released in January 2009, could mean that scientists running extensive calculations will be able to recover from such a crash – if they are running on a Linux system. This open-source software preemptively saves the state of applications using the Message Passing Interface (MPI), the most widely used mechanism for communication among processors working concurrently on a single problem. Automatic checkpoints are taken every few hours to ensure that in case of a hardware malfunction, work can resume from the last checkpoint instead of the beginning.

Developed by systems engineers in the Lawrence Berkeley National Laboratory’s (Berkeley Lab) Computational Research Division (CRD), BLCR was initially released to the public in November 2003 as open source software. Since then, many developers from both academia and industry have integrated BLCR into their software packages, including the MVAPICH2, OpenMPI and Cray implementations of MPI, and the Cluster Resources batch system. The original funding for BCLR development came from the SciDAC Scalable Systems Software ISIC; it is now funded through a CS base program called Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project.

“BLCR benefits all system stakeholders – users, operators and owners – by recovering the productivity that is lost when failure occurs,” says Paul Hargrove of CRD’s Future Technologies Group, one of BLCR’s developers.

According to Hargrove, there are currently other types of checkpoint/restart software for Linux clusters, but BLCR differs from others because it works with MPI. Climate modeling is one example of a complicated problem that can benefit from BCLR. To accurately predict and model climate conditions, scientists must take into account how the atmosphere interacts with land, ice and ocean surfaces. On a supercomputer, multiple processors tackle parts of each problem and communicate their results through MPI. Then all the results are calculated together to get the big picture model or prediction. Whereas most checkpoint/restart software can save the state before or after MPI communications are completed, BLCR automatically saves the application no matter what state it is in, even if communication is in progress. This feature allows for flexibility in scheduling and is extremely useful for unexpected machine failure.

One beneficial application for BLCR is in “urgent computing,” which requires rededicating computing resources on short notice to solve problems of great social importance, like predicting the path of tornadoes, hurricanes and tsunamis. When an urgent computing request comes up, the system can now stop whatever it is doing, tackle the time-sensitive problem, and then resume work from the saved checkpoint.

“In the past, an interrupted workflow either due to component failure or to take on an urgent request, would mean starting the interrupted jobs over from the beginning. Sometimes starting over would require days of redundant processing. But now with BLCR, researchers can recover from an unexpected interruption in a few hours,” says Hargrove.

He notes that another common loss of utilization on a production system is “queue draining” before scheduled maintenance. Because no applications can be running at the time maintenance begins, it is typical for the software that schedules jobs to be put in a mode where the system will run only those applications that will be completed before maintenance occurs. Since there are not usually enough short-running jobs queued, this results in a system with lower-than-normal utilization for the day leading up to a scheduled down time.

With BLCR, system administrators no longer have to drain the queues before maintenance. Now they can checkpoint before the system goes down for maintenance and resume the jobs when the system is up again, hence improving the machine’s productivity. The same approach allows system administrators to implement separate job queues to run the largest jobs only during certain hours of the day to improve the system’s average turnaround time.

The Berkeley Lab has had a long history of developing checkpoint restart for parallel systems. In 1997, the Cray T3E-900 at NERSC was the first massively parallel system to implement checkpoint/restarting in a production mode. Checkpointing was used on that system to move running jobs around within the system to pack them more tightly, thereby improving system utilization. Inspired by the Cray T3E checkpoint/restart, Hargrove and his colleagues developed BLCR because no other checkpoint/restart software on the market met the needs of high performance computing applications on Linux systems, which now account for 88 percent of the largest systems, according to the November 2008 TOP500 Supercomputer Sites list. In addition to BLCR interest at NERSC, Hargrove notes that other Department of Energy facilities and National Science Foundation TeraGrid centers have expressed interest in the technology.

The new layer that expands BCLR’s checkpoint footprint by allowing it to simultaneously run on thousands of compute nodes was developed by the Cray Center of Excellence (COE), which was established when the contract for NERSC-5, or Franklin, was awarded to Cray in 2006. The COE’s main goal is to develop innovative software for production-level supercomputing. This is achieved by allowing Cray employees to tap into the vast production expertise of NERSC staff by working from the Berkeley Lab’s Oakland Scientific Facility for two years. The production tools and software developed by the COE will utilize Cray’s release and update process, thus allowing Cray XT sites worldwide to benefit from the COE collaboration. Brian Welty, Terry Mallberg and their Cray colleagues ported and tuned BCLR for deployment on Cray systems as part of the COE.

In addition to Hargrove, other BLCR developers include Eric Roman, also of CRD’s Future Technologies Group, and Jason Duell, formerly of CRD.

About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.