NERSC Helps Climate Researchers Get Results Faster to Meet Deadline
August 1, 2004
When experts on the Earth’s environment join forces with experts on the high performance computing environment, the future of our global climate comes into focus faster — at least in the results of model simulations.
That’s what happened this summer when researchers from the National Center for Atmospheric Research (NCAR) asked NERSC consultants to help them improve the throughput of their simulations so that they could present the results at an upcoming meeting of the Intergovernmental Panel on Climate Change (IPCC). The research team, led by Warren Washington, is carrying out a substantial portion of the climate change scenarios for the U.S. DOE/NSF contribution to the IPCC Fourth Assessment Report. This multi-institutional effort involves scientists and software engineers at NCAR, Los Alamos National Laboratory, Oak Ridge National Laboratory, and Lawrence Berkeley National Laboratory.
The NERSC portion of this project is a foundational segment of a coordinated international effort aimed at understanding the global and regional impacts of human-induced climate change. The climate change scenarios are being simulated with the Community Climate System Model (CCSM), which consists of four models — atmosphere, ocean, sea ice, and land — coupled through the “flux coupler.” Climate change experiments are performed by running five CCSM jobs as an ensemble, in which each job is run using different initial conditions.
The NCAR team had been running the CCSM ensemble on NERSC’s Seaborg system as five separate jobs; but with the IPCC meeting deadline looming, they asked NERSC’s consultants if there was an efficient way to combine them into a single job. A single job running on 512 or more processors could take advantage of NERSC’s incentives for highly parallel jobs — preferential scheduling and a 50% discount on allocation charges. But merging five separate jobs into one is easier said than done. The first obstacle involved IBM’s LoadLeveler job scheduling software, which is used on Seaborg. LoadLeveler allows only one parallel job in a batch job and can have only one set of environment variables — data associated with a running program that tells the computer about the user’s system configuration and preferences, and how to execute the parallel program. But each job in the CCSM ensemble has its own large set of environment variables. Merging the five jobs into one job would involve redefining all of those environment variables in various script files as well as in the CCSM source code.
Another obstacle was combining the instructions for task geometries and communication from all five jobs into one big set of instructions. Task geometries determine how tasks are assigned or distributed among the nodes and processors, and communication instructions set the rules for how the computational results are communicated between nodes. Combining these instructions would be a major effort involving modification of the source code. And any change in the job run configuration meant more modifications to the code.
Fortunately, NERSC consultant Jonathan Carter had successfully tackled similar challenges of running multiple parallel jobs as a single job. He solved the problem by using Internet Protocol (IP) as the combined job’s communication protocol instead of the default User Space (US) communication subsystem, which is designed to take advantage of the IBM SP’s high performance switch. Although using IP instead of US results in a slight performance degradation, IP does offer several advantages:
- IP allows the user to run multiple parallel jobs.
- IP does not require renaming of environment variables, since each component job
- can be executed by a subshell.
- The task geometries for all five jobs can be handled by building hostlist files,
- which list the processors on which the tasks should run.
- No source code modification is required. Any change in job run configuration
- would require only modification of the scripts to build new hostlist files
- corresponding to the job task geometries.
NESRC consultant Harsh Anand, who specializes in providing support for climate applications, demonstrated the feasibility of this approach for CCSM by setting up a job script that combined five CCSM jobs into a single job. She then worked with the NCAR researchers on a longer, tougher test: they ran a combined job for a month, then compared the results with the output obtained from individual job runs. The two results were identical.
Finally, Harsh Anand helped the NCAR team implement the new and modified scripts in their production runs, and CCSM is now running on 720 processors. Although there is a slight loss of efficiency (~10–15%) for the IP implementation of CCSM, the NCAR team thinks that tradeoff is made worthwhile by the faster overall turnaround time, the availability of more processor hours, and the reduced human workload associated with fewer job runs — not to mention the fact that they did not have to rewrite their code while trying to meet a deadline.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. DOE Office of Science. »Learn more about computing sciences at Berkeley Lab.