NERSCPowering Scientific Discovery Since 1974

Cluster Compatibility Mode

Overview

Hopper compute nodes run a stripped down Linux operating system called compute node Linux (CNL).  Some standard Linux services, such as ssh, rsh, nscd, and ldap, are not supported on the compute nodes.  As a result, some user applications haven't been able to run on Hopper. Cluster compatibility mode (CCM) is the Cray software solution to this problem. It provides the standard Linux services needed to run most cluster-based independent software vendor (ISV) applications on the Cray XE6. Under CCM, everything an application "sees" is a standard Linux cluster. NERSC has made CCM available on Hopper to accommodate workflows that need  standard Linux services. Now codes such as G09, Wien2k, and NAMD replica simulations that previously couldn't run on Hopper, can run there under CCM. Additionally, CCM enables serial workloads to run on Hopper.

Users access CCM through a Hopper queue, ccm_queue. When you use the ccm_queue, CCM dynamically allocates and configures the compute nodes for CCM at job start, and releases the CCM environment after the job completes.

For interactive CCM workloads (e.g. debugging, data analysis) use the ccm_int queue, which has the same queue priority as the debug and interactive queues. For more information about queue configurations and scheduling policies, see Queues and Scheduling Policies.

Programming

Cray CCM is an exeuction environment where you can run ISV applications "out-of-the-box".  Applications that are compiled on any other x86_64 platform can run under Hopper CCM directly if you provide the needed runtime environment.  Users can also compile codes for CCM on Hopper. All compilers available to Hopper's native programming environment (Extreme Scalability Mode, ESM hereafter) are available under the CCM environment as well; the PGI, GNU, Intel, Pathscale, and Cray compilers are all available for CCM. Please note that parallel tools and libraries that are optimized for Hopper's native programming environment don't work under CCM. For example, the Cray custom MPICH2 libraries (xt-mpich2 modules) don't work under CCM, and therefore all parallel libraries that use the mpich2 libraries don't work with CCM. In addition, the compiler wrappers ftn, cc, and CC, should not be used under CCM. As an alternative, we have provided the OpenMPI library for CCM. MPI runs over TCP/IP and uses the OFED interconnect protocol over the Gemini High Speed Network (HSN). MPI codes need to be compiled for CCM by linking to the OpenMPI libraries. Users compile codes either using the native compiler calls, eg., pgif90, pgcc, pgCC, or the parallel compiler wrappers provided with OpeMPI. In contrast to Hopper ESM, the executables built for CCM are linked dynamically by default, as you would expect on a generic Linux cluster. And since the compiler wrappers from OpenMPI don't handle libraries other than MPI and the compiler’s own libraries, you need to provide the include paths to the header files, the library paths, and the libraries on the compile/link lines.

To compile codes to run under CCM, you need to load the openmpi_ccm module. Please note that we add "_ccm" in the module name to remind users to use it under CCM only.

module load openmpi_ccm
mpif90 test1.f90

We have provided the most commonly used libraries for CCM. The ScaLapck libraries have been compiled for CCM (module name scalapack_ccm). In addition, all serial and threaded libraries built for Hopper ESM should work under CCM (please use with caution, as we haven't confirmed all of them). We have tested that the ACML and FFTW (serial routines) that were built for ESM work under CCM. In addition, users can use the libraries available on Carver, and link their codes directly to Carver libraries. We have made a mkl_ccm module on Hopper that points to the Carver MKL build. 

Running Jobs

To access CCM on Hopper, you need to submit jobs to the ccm_queue. CCM jobs use "MOM" nodes for job launch. Instead of using the ALPS job launcher aprun, CCM jobs are launched from a MOM node to the compute nodes using the ccmrun command. The ccmrun command places a single instance of the execution commands on the head node of the allocated compute nodes (hereafter, we will call the nodes allocated and configured for CCM jobs as CCM nodes), and then the head node is responsible for launching the executables on the rest of the CCM nodes (remote CCM nodes hereafter) through whatever mechanisms used by the execution commands, e.g., mpirun (via ssh). There is another command, ccmlogin, which allows for interactive access to the head node of the CCM nodes. The ccmrun and ccmlogin commands wrappers the aprun command. Please refer to the man page for ccmrun and ccmlogin (you first need to load the ccm module to access these man pages, and can do this only on MOM nodes, because the ccm module is available on MOM nodes only).

Please note, Cray doesn't support the native torque launching mechanisms, so we had to build OpenMPI without batch system awareness (configured with --tm=disable). Therefore the job launcher mpirun from OpenMPI does not pass the environment on to the remote nodes. There are a few ways to pass the environment to the remote CCM nodes.

  1. Add environment variables in your shell startup file, .bashrc.ext and .cshrc.ext, and load the modules that define needed runtime environment variables in the same file as well.
  2. Use --prefix and -x options from mpirun command line options.
  3. Define environment variables in the batch job script and save them in the file ~/.ssh/environment, ie., after defining all the environment variables and loading the modules, do   
env > ~/.ssh/environment

 We recommend the first method, which appears to be most reliable.

Sample job scripts to run CCM jobs on Hopper

In the following job scripts, we assume you have loaded the openmpi_ccm modules in your shell startup file, i.e. you have the following line in your Hopper if block  in your ~/.bashrc.ext or ~/.cshrc.ext file. 

module load openmpi_ccm

Otherwise you need to invoke the mpirun command with the --prefix command line option or with the full path. Eg., you can use "mpirun --prefix /usr/common/usg/openmpi/default/pgi" or  "/usr/common/usg/openmpi/default/pgi/bin/mpirun" to replace the mpirun in the job scripts below.

To run G09, NAMD replica simulation, and WIEN2k, please refer to the website for each application.

A sample job script for running an MPI job:

#!/bin/bash -l
#PBS -N test_ccm
#PBS -q ccm_queue
#PBS -l mppwidth=48,walltime=30:00
#PBS -j oe

cd $PBS_O_WORKDIR
module load ccm
export CRAY_ROOTFS=DSL
mpicc xthi.c
ccmrun mpirun -np 48 -hostfile $PBS_NODEFILE ./a.out

A sample job script to run MPI+OpenMP job under CCM:

#!/bin/bash -l
#PBS -N test_ccm
#PBS -q ccm_queue
#PBS -l mppwidth=48,walltime=30:00
#PBS -j oe

cd $PBS_O_WORKDIR
module load ccm
export CRAY_ROOTFS=DSL
mpicc -mp xthi.c
export OMP_NUM_THREADS=6

ccmrun mpirun -np 8 -cpus-per-proc 6 -bind-to-core -hostfile $PBS_NODEFILE –x OMP_NUM_THREADS ./a.out

A sample job script to run multiple serial jobs under CCM:

#!/bin/bash -l
#PBS -q ccm_queue
#PBS -l mppwidth=24
#PBS –l walltime=1:00:00

cd $PBS_O_WORKDIR
module load ccm
export CRAY_ROOTFS=DSL

ccmrun multiple_serial_jobs.sh

Where the script, multiple_serial_jobs.sh, looks like this:

% cat multiple_serial_jobs.sh
./a1.out &
./a2.out &

./a24.out &
wait

Known issues with CCM

  • On Hopper and also on Edison, g09 jobs run slower when running over multple nodes compared to the single node run (we can't reproduce the g09 perfromance shown in the Figure 1 below). We have opened a bug against CCM regarding this issue, and Cray is investigating.  Until this issue is fixed, g09 users are recommended to run jobs on single node only.
  • The ccmlogin doesn't work on Hopper since Feb upgrade, a bug has been filed aginst Cray. Cray is invetigating

Performance comparison between Hopper CCM and Carver

A.   Gausssian 2009

The Gaussian code (G09) is a computational chemistry code that is used widely at NERSC.  G09 consists of many component executables called Link’s, and the code is parallelized in master / slave mode. G09 Link’s are parallelized with OpenMP threads intra nodes and with ssh inter nodes through the Linda communication library. Since Hopper compute nodes don't support ssh, G09 could not run on Hopper in the past. While most of the Link’s run on multiple nodes, some of them don’t, therefore the code does not scale well to the number of processor cores.

 

 


 

 

 

 

 

 

Figure 1 shows the performance comparison of G09 under CCM on Grace (the Hopper development machine) and on Carver.  It shows the sum of the runtime of the 3 main component Link's in a UHF calculation at two different core counts. We can see that G09 runs around at 2 times slower on Grace CCM than on Carver at these core counts, at which G09 jobs are most likely to run. The processor speed of Grace (Hopper) is slower than Carver by 30% ( 2.1GHz vs Carver's 2.7GHz). The performance slowdown is more than what the slower processor speed can account for. The lack of the process/memory affinity control over ssh on the increased number of the NUMA domains (4 vs Carver's 2) should be one of the important causes of this performance slowdown.

 

 

 

 

 

 

 

 

 

 Figure 2 shows per node basis performance of G09. When only 1 node is used G09 is ~24% faster under Grace CCM than Carver. However, when the number of nodes increases, G09 runs slower on Grace CCM than on Carver even with 3 times more number of cores.

A.   NAMD Replica Simulations

NAMD is a classical molecular dynamics code, and is widely used at NERSC. While the main code works on Hopper, its replica exchange jobs hasn't run on Hopper in the past. The replica exchange simulation runs many similar job instances (replicas) independently at the same time, and occasionally communicates between replicas through socket operations. Since socket operations are not supported on Hopper compute nodes, this job type couldn’t run on Hopper in the past. Now CCM enables this simulation on Hopper as well.

 

 

 

 

 

 

 

 

 

Figure 3 shows the performance comparison of NAMD replica exchange simulations on Hopper CCM and Carver, using a test case provided by a NERSC user. 12 replicas were calculated simultaneously, using 8 and 24 cores per replica, respectively. The NAMD replica jobs run around 14% slower on Hopper CCM than on Carver when 8 cores used per replica, but they run around 10% faster than on Carver when 24 cores used per replica. (This could be the cause of the slower file system on Carver, as the code writes many small files during the runs).

 

 

 

 

 

 

 

 

 

Figure 4.    NAMD 2.8 parallel scaling comparison between Hopper CCM,  and Carver (Hopper ESM results are also included for reference). The standard ApoA1benchmark (92K atoms, PME) was used. 

 

 

 

 

 

 

 

 

 

Figure 5.   The same data as in Figure 4, but in a standard NAMD benchmark format. The flatter the line is the better in parallel scaling. One can see that NAMD scales fine upto 144 cores under CCM and has a significant scaling drop at 288 cores while Carver and Hopper ESM continue to scale up. Although CCM doesn’t scale as well as Hopper ESM and Carver, it allows each replica to use up to 144 cores (for ~92K atoms) at a good parallel efficiency. Given the large capacity of Hopper, many more NAMD replicas can run simultaneously, with more cores per replica, which will bring a greater productivity for users. 

A.   WIEN2k

WIEN2k is an ab-initio electronic structure calculation code that based on Density Functional Theory. It consists of many component executables connected by shell scripts. It has two layers of parallel implementation, fine grid and k-point parallelization. The fine grid parallelization is implemented with MPI, and the k-point parallelization is implemented with ssh and file IO. Since ssh is not supported on Hopper compute nodes, WIEN2k was not able to run on Hopper in the past. Additionally, WIEN2k often runs at small scale per k-point but many k-points simultaneously, which requires multiple processes to share a single node -- the Cray ALPS job launcher does not support this. With CCM, now WIEN2k can run on Hopper as it does on any generic Linux Cluster.
 

 

 

 

 

 

 

 

 

 

Fig. 5 shows the run time comparison between CCM on Hopper and Carver with a user provided case. In this case each k-point used 12 cores. We can see that  at 252 core counts, Hopper CCM is slower than Carver by ~30% while at 84 core counts, it is slower more, by around 90%. If Carver WIEN2k users run their jobs on Hopper, they can get benefit from Hopper's larger capacity and faster queue turnaround.

Selected Application Benchmark Performance

In order to obtain some baseline performance numbers under CCM, selected application benchmarks with chosen input files were run under both CCM and the normal Hopper mode, which is also called the Extreme Scalability Mode (ESM).  

Benchmark

Science Area

Algorithm

Compiler Used

Concurrency Tested

Libraries

MILC

Lattice Gauge

Conjugate Gradient, sparse matrix, FFT

GNU

64, 256

 

ImpactT

Accelerator Physics (HEP)

PIC, FFT

PGI

64, 256

 

Paratec

Material Science (BES)

DFT, FFT, BLAS3

PGI

64, 256

Scalapack, FFTW

GTC

Fusion(FES)

PIC, Finite Difference

PGI

64, 256

 

Most applications were run with pure MPI.  Cray MPICH2 over Gemini network is used for ESM and OpenMPI over TCP/IP is used for CCM. Limited results are shown running with up to 256 cores on Hopper. (ImpactT and Paratec 256 cores data were obtained from the Hopper test system).

 

 

 

 

 

 

 

 

MILC performance comparison: With 64 cores, CCM run time is 1.84 times as of ESM. With 256 cores, CCM run time is 1.93 times as of ESM.  The speedup from 64 to 256 cores with ESM is 1.88. The speedup from 64 to 256 cores with CCM is 1.77.

 

 

 

 

 

 

 

 

ImpactT performance comparison: With 64 cores, CCM run time is 1.26 times as of ESM. With 256 cores, CCM run time is 1.30 times as of ESM.  The speedup from 64 to 256 cores with ESM is 3.85. The speedup from 64 to 256 cores with CCM is 3.73.

 

 

 

 

 

 

 

 

Paratec performance comparison: With 64 cores, CCM run time is 1.19 times as of ESM. With 256 cores, CCM run time is 1.55 times as of ESM.  The speedup from 64 to 256 cores with ESM is 3.23. The speedup from 64 to 256 cores with CCM is 2.47.

 

 

 

 

 

 

 

 

GTC weak scaling performance comparison: With 64 cores, CCM run time is 1.11 times as of ESM. With 256 cores, CCM run time is 1.19 times as of ESM.  The speedup from 64 to 256 cores with ESM is 3.61. The speedup from 64 to 256 cores with CCM is 3.37.

 

 

 

 

 

 

 

 

 

Among these four applications, CCM run time is between 1.11 times with GTC 64 cores to 1.93 times with MILC 256 cores as of ESM.  The more MPI communications (MILC) an application has, the more slow down of the CCM performance compared to ESM.

 

 

 

 

 

 

 

 

 

ESM speedup from 64 to 256 cores ranges from 1.88 to 3.85.  CCM speedup from 64 to 256 cores ranges from 1.77 to 3.73.  ESM has better speedup than CCM, but CCM is not a lot worse.

 

 

 

 

 

 

 

 

 

Mixed MPI/Open results are also shown for GTC using 192 cores (with another input file).  Both ESM and CCM has a sweet spot at 3 OpenMP threads per MPI task. CCM results are almost identical with ESM results due to minimal MPI communication is involved. Actual run time with CCM ranges from 1.01 times (24 threads) to 1.11 times(1 thread) as of ESM.

Downloads