The NERSC CAM README file
CAM: CCSM Community Atmospheric Model
The Community Atmosphere Model (CAM) is the amospheric component of the Community Climate System Model (CCSM) developed at NCAR and elsewhere for the weather and climate research communities. Although generally used in production as part of a coupled system in CCSM, CAM can be run as a "stand alone" (uncoupled) model as it is here. The NERSC benchmark runs CAM version 3.1 at D resolution (about 0.5 degree) using a Finite Volume (FV) dynamical core.
Atmospheric models consist of two principal components, the "dynamics," and the "physics." The dynamics, for which the solver is referred to as the "dynamical core," are the large-scale part of a model, the atmospheric equations of motion affecting wind, pressure and temperature that are resolved on the underlying grid. The physics is characterized by subgrid-scale processes such as radiation, moisture convection, friction and boundary layer interactions that are taken into consideration implicitly (via parameterizations). [JAB]
In CAM3 the dynamics are solved using an explicit time integration and finite-volume discretization that is local and entirely in physical space. Hydrostatic equilibrium is assumed and a Lagrangian vertical coordinate is used, which together, effectively reduce the dimensionality from three to two.
The version of CAM as used in this benchmark consists of about 103,000 lines of Fortran90 in about 350 files and about 7,000 lines of C (as measured by SLOCCount). All Fortran files have either .F or .F90 and are preprocessed via the preprocessing option of the Fortran90 compiler (not via cpp explicitly). The C code contains a variety of system-dependent functions, such as timers. Portions of the Earth System Modeling Framework (ESMF) are included also, in a separate set of system-dependent sub-directories.
The version supplied uses MPI although it can also be built for a single-processor run and for SHMEM, if available. The code contains a great many OpenMP loops as well, but in the NERSC-5 procurement benchmarking the code was to be run in MPI-only mode.
The Network Common Data Form (NETCDF) library is required to build CAM. The source and build files can be obtained from http://www.unidata.ucar.edu/software/netcdf/. Generally, you'll find that NETCDF needs to be built with the same compiler you're going to use to build CAM.
Several routines use quad-precision data declarations to preserve bit-for-bit accuracy on parallel systems. Two of these routines, gauaw_mod.F90 and phcs.F90, use quad-precision unconditionally, although the actual precision is determined via a Fortran90 "KIND" declaration and is therefore compiler or machine dependent (in the code as selected_real_kind(12) or selected_real_kind(17) and selected using a preprocessor definition). These two files are in the directories $camroot/atm/cam/src/control/ and $camroot/atm/cam/src/advection/slt/phcs.F90, respectively. An additional file, geopk.F90, in directory $camroot/atm/cam/src/dynamics/fv/, uses "REAL*16" data declarations based on a preprocessor option DSIZE.
Explanation of preprocessor flags:
|-DSPMD||Set to TRUE for SPMD parallelism (message passing only, no intra-SMP parallelism).|
|-DSIZE||If set to 16 uses quad precision in some routines.|
|-SMP||Set to TRUE for intra-SMP parallelism.|
Relationship to NERSC Workload
CAM is used for both short-term weather prediction and as part of a large climate modeling effort to accurately detect and attribute climate change, predict future climate, and engineer mitigation strategies. An example of a "breakthrough" computation involving CAM might be its use in a fully coupled ocean/land/atmosphere/ice climate simulation with 0.125-degree (approximately 12 kilometer) resolution involving an ensemble of eight to ten independent runs. Doubling horizontal resolution in CAM increases computational cost eightfold. The computational cost of CAM in the CCSM, holding resolution constant, has increased 4x since 1996. More computational complexity is coming in the form of, for example, super-parameterizations of moist processes and ultimately, elimination of parameterization and use of grid-based methods. [LOFT] [WEH] [SIMON]
The solution procedure for the FV dynamical core involves two parts: (1) dynamics and tracer transport within each control volume (referred to in the code as cd_core and trac2d, respectively) and (2) remapping of prognostic data from the Lagrangian frame back to the original vertical coordinate (in te_map). The remapping occurs on a time scale several times longer than that of the transport.
Earlier versions of CAM used a one-dimensional domain decomposition in which each subdomain contained all longitude lines and only a subset of latitude lines. (If you have trouble remembering latitude from longitude, as some people do, consider that the one-dimensional domain decomposition produced halos rather than fruit sections.) Additional parallelism was included using OpenMP decomposition for loops that run both over vertical levels and over a subset of latitudes. Although this method applied well to both the slab-based dynamics and the Lagrangian remap, it yielded a parallel speedup bounded by a constant as the number of processors increased.
The version of CAM provided here uses a formalism effectively containing two different, two-dimensional domain decompositions. In the first cd_core is decomposed in both latitude and vertical level. Because this decomposition is inappropriate for the remap phase, and the vertical integration step for calculation of pressure, for which there exist vertical dependences, a longitude-latitude decomposition is used for these phases of the computation. Optimized transposes move data from the program structures required by one phase to the other. The transpose from yz to xy decomposition takes place within dynpkg and one key to this scheme is having the number of processors in the latitudinal direction be the same for both of the two-dimensional decompositions. [MIR]
In the code the latitude-vertical processor grid is referenced by the variables npr_y and npr_z and these are specified in the NAMELIST input file. The latitude-longitude processor grid is referenced by the variables nprxy_x and nprxy_y. The code sets nprxy_x=npr_z and nprxy_y=npr_y.
The incorporation of "chunks" (column based data structures) into the physics routines allows enhanced vectorization, additional parallelism and easier dynamic load balancing. Multiple chunks are assigned to each MPI process and OpenMP threads loop over each local chunk. The optimal chunk size depends on the machine architecture, e.g., 16-32 for IBM SP. An example of a chunked domain decomposition is shown in the figure below, where the colors represent chunks owned by different MPI processes. [HE]
Regarding the optimized transposes that CAM uses, the benchmark runs appear to use the non-transpose geopotential (geopk) communication method, (selected via the variables "geopk16byte" and "geopktrans," the latter of which is set in the NAMELIST input file). This means that local partial sums in z are computed and then communicated among processors to combine them, instead of performing transposes between yz and xy space. We also should explain what is meant by modcomm transpose method = 1 and modcomm geopk method = 1.
CAM uses the parallel decomposition and communication layer Parallel Library for Grid Manipulations (PILGRIM), source and build files for which are located in $camroot/models/utils/pilgrim. PILGRIM is, itself, built upon the "mod_comm" library of basic communication primitives, also supplied with the distribution. Mod_comm contains support for irregular communication using a derived data type that defines a set of chunks to be sent to (or received from) another processor. The irregular communication routines operate on arrays of block descriptors whose length is equal to number of PEs involved in the communication. This means the irregular communication primitives are merely non-blocking all-to-all primitives.
The PILGRIM module mod_comm contains numerous optimizations for various platforms and underlying communication primitives, all enabled via CPP preprocessor options. The complete list of them can be found in the file mod_comm.F90. Mod_comm can use MPI-1 or MPI-2 and can combine the latter with OpenMP multithreading.
Choices available when using MPI for the transpose/geopk method within CAM: temporary contiguous buffers or MPI derived data types.
Choices available when using MPI2 with CAM:
- 0 for temporary contiguous buffers
- 1 for direct mpi_put's of contiguous segments into temporary contiguous window, with threading over the segments (default).
- 2 for mpi derived types into temporary contiguous window, with threading over the target.
- 3 for derived types at source and target, with threading over the target.
The choice of pure MPI, pure OpenMP, or mixed MPI/OpenMP is made at build time and there are certain defaults chosen for certain machines. For example, the default parallelization on the IBM is hybrid MPI/OpenMP, and the default parallelization on both PCs and SGI systems is pure OpenMP.
CAM also uses what it refers to as "global arrays," which, according to the code, are buffers into which data are packed for the transfer to other PEs. According to the code, global arrays are 1-dimensional, they are accessed as needed inside the Ga_Put/Ga_Get routines with offset vars. (Ga_Put/Ga_Get routines are all 4d with openmp on the 3rd (k) dim.) All this from mod_comm.F90.
For NERSC-related procurements please visit the procurement site.
Due to licensing restrictions CAM cannot be downloaded from the NERSC site. To obtain the code visit the CAM Download Page.
You can download the NERSC-5 CAM NERSC-5 input data files here.
The build procedure uses a configure script that is called from within a shell script. Some changes are required to the shell script (and some other files) but probably the configure script should not be changed.
Two runs are required for CAM and they use the same executable but different numbers of processors. Only a single build is required. However, two scripts are provided, one for running on 56 processors and the other for running on 240 processors.
The following steps required to build the code.
- Untar the tar file and change to the "cam31" directory. This directory will be defined as $camroot in later scripts. Execute a script that has been provided in this directory as follows:
source set-cam-pathThis will set the environment variables $camroot and $caminput. Note: you must use csh or tcsh to build CAM.
- Go to the $camroot/run directory. At this point you can either build and run in a single step or build first and run second. To build first and then run edit the run-seaborg-fvxx script (where xx is either "56" or "240") and uncomment the "exit 0" statement on line 121 by removing the "#" symbol.
- The code as provided makes certain assumptions about the build environment. For example, for Linux systems the Portland Group compilers (pgf90 and pgcc) are assumed. If the assumptions are not valid for your machine or if you want to add a new machine two files must be edited. The first is Makefile in the $camroot/models/atm/cam/bld directory. This file contains the templates for several systems (Linux/PGI, SunOS, OSF, SuperUX, EarthSimulator, UNICOS/mp, AIX, and IRIX64). The second file that needs to be edited is the base_variables file in one of the /$camroot/models/utils/esmf/build/X directories; here, X represents one of the machines for which templates already exist.
- Returning to the $camroot/run directory, there are also several paths in the run-seaborg-fvxx script that may need to be edited, including two related to NETCDF (include and lib) on lines 61 and 62, the setting of camroot on line 71, and two related to MPI on lines 65 and 66. Note that these may be commented in the script as provided!
- You may also have to edit the first line of the script to change the location of csh. This will be the case if you try to execute the script and get "Command not found."
- Some of these substitutions may also have to be done in the other scripts (e.g., "build-namelistq56.csh"), too.
- Execute the build script. If successful the executable, called "cam" will be in the directory /net/scratch2/hjw/CAM/cam31/benchmark/bld. Another very useful file is $camroot/benchmark/bld/MAKE.out The build script will also execute a second script to create the NAMELIST input file with correct paths in it for your environment.
File suffixes include .F .F90 .c .h .inc. Other important files include mkDepends, mkSrcfiles, and Filepath.
An approximate description of the directory structure is given in the following table. Some of the directories well down in the tree are not described.
|Directory or File||Description|
|$camroot||root directory of the CAM distribution|
|$camroot/run||Main build/run scripts are here. This is where you go to build the code. Contains a symbolic link to ../benchmark|
|$camroot/benchmark||Created by the run script.|
|$camroot/benchmark/bld||Created by the run script. Contains compiler output files (*.o and *.mod). If you need to "Make clean" then come here and do rm -rf *.|
|$camroot/benchmark/job56_out||Created by the run script. Contains all output files from the run.|
|$camroot/camcheck||A directory containing code and scripts to verify results.|
|$camroot/models/atm/cam/bld||Main Makefile template and configure scripts are here. Also some submission scripts.|
|$camroot/models/atm/cam/bld/Makefile||This is the template Makefile for all builds, preconfigured for certain systems. You may have to edit it.|
|$camroot/models/atm/cam/src||Source code is in six subdirectories here. MAIN and MPI_Init are in $camroot/models/atm/cam/src/control/cam.F90|
|$camroot/models/atm/cam/tools||nothing of interest to the benchmark here|
|$camroot/benchmark_save||Reference output files from runs on Seaborg are in two subdirectories, qq56_out and qq240_out.|
|$camroot/models/atm/cam/src/control/cfort.h||Contains certain important definitions, such as whether to use underscores in linking C and Fortran. Certain assumptions are made here; e.g., Linux needs underscore (which may not be true.|
The code is run by executing one of the shell scripts provided (called run-seaborg-fv56, run-seaborg-fv240, maybe run-pbs-fv56, and maybe run-pbs-fv240). Again, these scripts will both build and run the code; to run make sure that the exit 0 statment is not commented out. Additionally, the scripts may require changes for various batch systems.The basic execution line is something like:
mpirun -np # $camroot/benchmark/bld/cam < namelistq56or
poe $camroot/benchmark/bld/cam < $jobdir/namelistq56 > &! atm.log
Two pre-prepared run scripts are in $camroot/run but there are also some others in /models/atm/cam/bld.
Computational nodes employed in the benchmark must be fully-packed; that is, the number CAM computational tasks must be equal to the number of physical processors on the node.
Extensive instrumention to time various parts of the code is already included. Hooks for PAPI are also included. Each processor writes a timing file called "timing.xx" that contains a complete profile.
The code is timed via gettimeofday.
The code writes out about 2GB of data files. This is part of the code that is timed but typically a very small portion of the total time.
The time you should report is total wallclock from the file "timing.0".
Almost all arrays in CAM are allocated dynamically although some are held in Fortran COMMON.Memory Required (per MPI Process) By The Sample Problems*:
* These values obtained from the NERSC LoadLeveler accounting system.
The minumum memory configuration required to run the problems in each configuration must be reported (OS + buffers + code + data + ...).
This benchmark runs CAM3 at D resolution (about 0.5 degree) which implies a grid of 576 (longitude) x 361 (latitude) x 28 (levels). The required two benchmark runs do the same 5-day simulation of global atmosphere on different number of processors: on 56 MPI tasks using an 8x7 processor domain decomposition and on 240 MPI tasks using a 60x4 processor domain decomposition For the NERSC-5 procurement only SPMD was used, no OpenMP.
Eight large data files are required to do the runs. These must be obtained separately from the NERSC-5 procurment web site. Because these files comprise about 6GB of data it is thought that they are too large to reliably download as a single file. Therefore, the tar file containing them has been split into 20 partial files, each about 268 MB. Recover the original CAM input files using the following steps:
- Download all 20 partial files.
- Verify the MD5 checksum for each file. The code MD5 is in the $camroot/TOOLS directory. It needs to be compiled for your system. Run it by typing
- After all files have been retrieved and verified extract their contents using the command line
cat caminp.0* | tar xvf -
The data files are based on the D resolution simulations from Phil Duffy's group at Lawrence Livermore National Laboratory.
The CAM $camroot/camcheck directory contains a program called "camcheck" and a script called "checkall." Both of these need to be used to verify the output from a run. Please note the following:
- You must compile the source file (camcheck.c). A Makefile is provided.
- You must link with the NETCDF library to produce the camcheck executable and you must use the same compiler that was used to build NETCDF. You may need to link in the Unix math library (-lm).
The script checkall will call the program camcheck to check the values of five two-dimensional fields extracted from one of the NETCDF output files ".cam2.h1.1978-09-01-00000.nc". You must have the environment variable $camroot set for this.
Run the checkall script in the cam3 output directory (which is either $camroot/benchmark/job56_out or $camroot/benchmark/job240_out) and redirect the output to a file. Then compare the output file with an output file provided in one of two $camroot/benchmark_save directories (qq56_out and qq240_out). The outputs in these two directories were obtained on the NERSC Seaborg IBM SP system. You should see at least 4 significant digits matching. An example of this procedure is shown below. It is again assumed that the environment variable $camroot has been set.
% cd $camroot/benchmark/job56_out % $camroot/camcheck/checkall > & checkall.out & % diff checkall.out $camroot/benchmark_save/qq56_out %
Succesful verification requires self checking.
If you run correctly, in your $rundir, you should see these files:
-- 5days/ -- atm.log.35232 -- $case.cam2.h1.1978-09-01-00000.nc ~ 99 MB -- $case.cam2.h2.1978-09-01-00000.nc ~ 332 MB -- $case.cam2.h2.1978-09-06-00000.nc ~ 16 MB -- $case.cam2.h3.1978-09-01-00000.nc ~ 133 MB -- $case.cam2.h3.1978-09-06-00000.nc ~ 6 MB -- $case.cam2.r.1978-09-06-00000 ~ 705 MB -- $case.cam2.rh0.1978-09-06-00000 ~ 852 MB -- $case.clm2.r.1978-09-06-00000 ~ 310 MB
[CAM] The CAM web page is at http://www.ccsm.ucar.edu/models/atm-cam/
[MIR] Arthur A. Mirin and William B. Sawyer, "A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model," International Journal of High Performance Computing Applications, Vol. 19, No. 3, 203-212 (2005)
[LOFT] Supercomputing Challenges at the National Center for Atmospheric Research, Richard Loft (NCAR), http://www.scd.ucar.edu/dir/CAS2K3/CAS2K3%20Presentations/Mon/loft.ppt
[SIMON] Horst Simon, et al., "Science Driven System Architecture: A New Process for Leadership Class Computing," Journal of the Earth Simulator, Vol. 2, January 2005, http://www-library.lbl.gov/docs/LBNL/565/45/PDF/LBNL-56545.pdf