NERSCPowering Scientific Discovery Since 1974

Babbage

Description

Babbage is a NERSC internal cluster containing the Intel Phi coprocessor, which is the first product based on Intel Many Integrated Core (MIC) architecture.  NERSC is evaluating this cluster as part of its "Application Readiness" effort aimed at learning how to assist users migrating to energy-efficient, highly parallel architectures.  Babbage is not available to general NERSC users.

Configuration

There is one login node named “bint01”, and 45 compute nodes named “bcxxxx”, with two MIC cards and two Intel Xeon "host" processors within each compute node. 
Each Xeon processor (Sandy Bridge EP) contains 8 cores, with 2 hardware threads (HT to be enabled) each core, and 128 GB of memory per node. 

The MIC card hostnames are  “bcxxxx-mic0”, “bcxxxx-mic1” and can be accessed directly from other nodes or other MIC-cards.  Each MIC card contains 60 cores, with 4 hardware threads each core, and 8 GB of memory per card.

Access

Use ssh to babbage.nersc.gov.

Programming Environment

The default shell is currently bash.  You can change your default shell through NIM.

The default loaded modules are as follows:

% module list
Currently Loaded Modulefiles:
1) modules 2) nsg/1.2.0 3) torque/4.2.1 4) moab/7.1.3 5) intel/14.0.0 6) impi/4.1.1

Filesystems

All NERSC global filesystems are mounted (except /projectb)

Programming

Running in native mode, where all code runs directly on the MIC card, is recommended.  To do this, compile using the flag "-mmic."

Compilers:

Intel compiler is available and load by default on Babbage.  Once you have logged in to the login node:

    • Use ifort, icc, or icpc to compile serial Fortran, C, or C++ codes.
    • Use mpiifort, mpiicc, or mpiicpc to compile parallel Fortran, C, or C++ MPI codes. 
    • Use the “-openmp” flag for OpenMP codes. 
    • Repeat: use the "-mmic" flag to compile for running natively on the MIC cards.

Other useful compiler options:

    • -align array64byte: Fortran only. Align all static array data to 64 memory address boundaries to ensure the array data can be loaded from memory to cache optimally.
    • -fcode-asm -S or -fsource-asm -S: two ways of generating source-annotated assembler
    • -vec_report6: generate vectorization reports indicating loops that are successfully or not successfully vectorized, plus additional information about any proven or assumed dependences. 
    • -openmp-report1:  generate the OpenMP parallelization reports indicating loops, regions, and sections successfully parallelized
    • -guide-vec=2: generate messages suggesting ways to improve vectorization
    • -free: specify Fortran source files are in free format.

Running Jobs

The MPI library on Babbage is Intel MPI (loaded byy default) and there are several new environment variables and command-line options that are useful or necessary.  

To run interactively:

bint01% qsub -I -V -l nodes=3 -l walltime=00:30:00

Below is a sample batch script to run a hybrid MPI/OpenMP job using across the 4 MIC cards that belong to the 2 compute nodes that are allocated to the batch job. There are total of 4 MPI tasks, 1 MPI task per MIC card, and 60 OpenMP threads per MPI task.  Use the "qsub" command to submit a batch script to the batch system.

#PBS -q regular
#PBS -l nodes=2
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=60
export KMP_AFFINITY=balanced
get_micfile
mpirun.mic -n 4 -hostfile micfile.$PBS_JOBID -ppn 1 ./myexe.mic (# sometimes the full path to the executable is needed, otherwise you may see a "no such file or directory" error).

Here "get_micfile" is a NERSC utility script that mimics the $PBS_NODEFILE but prints out the following to a file "mic.$PBS_JOBID" in your submit directory (assume the two nodes your batch job is allocated to are bc1013 and bc1012):

bc1013-mic0
bc1013-mic1
bc1012-mic0
bc1012-mic1

You can also add "mics=2" in the node request line as below, then the $PBS_MICFILE will contain the nodefile list for the two MIC cards on each node.  MSee the sample batch script below:

#PBS -q regular
#PBS -l nodes=2:mics=2
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=60
export KMP_AFFINITY=balanced
mpirun.mic -n 4 -hostfile $PBS_MICFILE -ppn 1 ./myexe.mic (# sometimes the full path to the executable is needed, otherwise you may see a "no such file or directory" error).

If you would like to run on just one or any number of the MIC cards that are allocated to your batch job, you can edit a custom hostfile and use "--hostname my_custum_hostfile" in the above mpirun command instead.  You can also spell out the MIC cards as below:

mpirun.mic -n 2 -hosts bc1013-mic0,bc1012-mic1 -ppn 1 ./myexe.mic

Environment variables can also be passed in a sample mpirun line as below:

mpirun.mic -n 16 -hosts bc1011-mic1 -env OMP_NUM_THREADS 2 -env KMP_AFFINITY balanced ./myexe.mic

Notice "-ppn 1" in the mpirun command means to run with 1 MPI task per MIC card in the hostfile specified.  If running with multiple MPI tasks per MIC card, please adjust the value for "-ppn" accordingly.


The "-genvall" argument to the mpirun command is recommended to export all environment variables to all processes. 

Thread Affinity Options

KMP_AFFINITY

 The following choices can be used for env KMP_AFFINITY to set OpenMP thread affinity.

    • none: this is the default setting on the compute hosts.
    • compact: binds threads as close to each other as possible (This is the setting on the MIC cards.)
    • scatter: binds threads as far apart to each other as possible
    • balanced: this option only exists to MIC cards. It first scatter threads to each core, so that each core has at least one thread, and it sets thread numbers utilizing the different hardware threads of the same core are close to each other.
    • explicit: use explicit bindings

Below is an example of setting KMP_AFFINITY to various options to allocate 6 OpenMP threads on one MIC card: (For illustration simplicity, assume each MIC card  has only 3 cores instead of 60 cores).

1) KMP_AFFINITY=compact

NodeCore 1Core 2Core 3
HT1HT2HT3HT4HT1HT2HT3HT4HT1HT2HT3HT4
Thread 
0 1 2 3 4 5            

2) KMP_AFFINITY=scatter

NodeCore 1Core 2Core 3
HT1HT2HT3HT4HT1HT2HT3HT4HT1HT2HT3HT4
Thread 
0 3     1 4     2 5

3) KMP_AFFINITY=balanced

NodeCore 1Core 2Core 3
HT1HT2HT3HT4HT1HT2HT3HT4HT1HT2HT3HT4
Thread 
0 1     2 3     4 5    

KMP_PLACE_THREADS

This is a new environment variable available for the MIC cards only.  In addition to KMP_AFFINITY, can set exact but still generic thread placement.

The format of the setting is: <n>Cx<m>T,<o>O, which means to use <n> Cores times <m> Threads with <o> of cores Offset.
e.g.  setenv KMP_PLACE_THTREADS 40Cx3T,1O, which means to use 40 cores, and 3 threads (HT2,3,4) per core.

Please notice that OS runs on logical proc 0, which lives on physical core 59 on Babbage.   OS procs on core 59 are 0,237,238,239.  Please avoid use proc 0, i.e., use max_threads=236 on Babbage.

Details of OpenMP Thread Affinity Control can be found here.

Process and Thread Affinity Options

An environment variable I_MPI_PIN_DOMAIN can be used to set MPI process affnity.  Thread affinity then works within process affinity.  The value of I_MPI_PIN_DOMAIN can be set in 3 different formats.

    • Multi-core shape: I_MPI_PIN_DOMAIN=<mc-shape>, where <mc-shape> can be core, socket, node, cache1, cache2, cache3, cache
    • Explicit shape: I_MPI_PIN_DOMAIN=<size>:<layout>, where <size> can be omp, auto, or an explicit value; <layout> can be platform, compact, or scatter.
    • Explicit domain mask: I_MPI_PIN_DOMAIN=<masklist>, where <masklist> is a list of mask values. 

More detais of setting I_MPI_PIN_DOMAIN can be found at two Intel documentations:

Notice the core numbering are differnet on the MIC cards than on the host nodes. On the host nodes, core number starts with 0, while on the MIC cards, core number starts from 1, since core 0 is reserved for the operating system.  Mult-core shape and Explicit shape schemes as listed above will automatically account for this.

Running Jobs on the Host 

You may want to run on the Xeon processors on the host for performance comparison with running on the MIC cards.  

    • To compile, do not use the "-mmic" flag.  
    • Use "get_hostfile" instead of "get_micfile" in the example batch script, and use "-hostfile hostfile.$PBS_JOBID" instead of "-hostfile micfile.$PBS_JOBID" in the mpirun command.  You can also edit a custom hostfile, with lines such as "bc1013-ib" instead of "bc1013-mic0".
    • Use "mpirun" instead of "mpirun.mic" in the job launch line.
    • There are 2 Xeon processors per host compute node, and 8 cores per processor, 2 hardware threads per core, so the maximum MPI tasks times OpenMP threads per compute node should be 32 (with HT enabled) or 16 (with HT not enabled, this is the current situation) when running on the host, as compared to the maximum number of 240 on each MIC card.
    • The affinity option "balanced" does not exist on the host node. Use other options instead.

Batch Queue Structures

Submit QueueExecution QueueNodesMax WallclockRelative PriorityRun LimitQueued Limit
debug debug  1-4 30 mins 1 2 2
regular reg_small  1-11 12 hrs 2 4 4
reg_med 12-23 6 hrs 2 2 2
reg_big 24-45 2 hrs 2 1 1

Batch jobs are managed via Torque/Moab. The usual Torque commands such as "qstat", "qdel" can be used to monitor batch jobs.  The Moab commands such as "showq", "showbf" are also quite useful.

Programming and Running Tips

    • The default stack size is often too small.  If your application gets a segmentation fault shortly after it started, consider increase the stack size limit on the MIC cards.  When using just one MIC card, you can ssh to it directly, and increase the stack size via "ulimit -s unlimited" on the card and run again.  For jobs using more than one MIC cards, the above solution does not work.  The trick is to wrap the executable in a script, for example:
% cat ulimit.sh
ulimit -s unlimited
./myexe.mic

Run fails without the script:
% mpirun.mic -n 4-hostfile mic.$PBS_JOBJD -ppn 2 ./myexe.mic
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Run succeeds with the script:
% mpirun.mic -n 4-hostfile mic.$PBS_JOBJD -ppn 2 ./ulimit.sh
... <with correct results> ...
      • Consider using the following compiler options for performance tuning:
        MIC COMPILER OPTIONS
        OptionExplanation
        -opt-assume-safe-padding When -opt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded past the end of the object. This means that code can access up to 64 bytes beyond what is specified in your program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option.
        -opt-streaming-stores always Enables generation of streaming stores for optimization. It helps especially when the application is memory bound. Do not read the original content of entire cache line from memory when we overwriting its whole content completely.
        -opt-streaming-cache-evict=0 Turn off all cache line evicts.
        -mP2OPT_hlo_pref_use_outer_strategy=F -mP2OPT is an internal developer switch.
        Execution Queue Nodes
        PRECISION COMPILER OPTIONS (USE WITH EXTREME CAUTION, MAKE SURE RESULTS ARE STILL CORRECT)
        OptionExplanation
        -fimf_precision=low Lets you specify a level of accuracy (precision) that the compiler should use when determining which math library functions to use. Low is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions.
        -fimf-domain-exclusion=15 Indicates the input arguments domain on which math functions must provide correct results. 15 specifies extremes, nans, infinities, denormals.
        -fp-model fast=1 The compiler uses more aggressive optimizations on floating-point calculations.
        -no-prec-div The compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. It gives slightly less precise results than full IEEE division.
        -no-prec-sqrt The compiler does a faster but less precise implementation of square root calculation.

Using Intel Trace Analyzer and Collector

Intel Trace Analyzer and Collector (ITAC) is a tool to help understand the MPI application behavior, quickly find bottlenecks and achieve high performance on parallel applications.

To use ITAC on Babbage, simply load the module:

% module load itac

Then compile with the flag "-trace" (with VT library to trace entrance of each MPI call), or "-tcollect" (with full tracing).   At run time, add the "-trace" flag to the mpirun.mic option.  A "*.stf" file will be generated, which can be used via "traceanalyzer" command to open a GUI.  ITAC can be used on both the host and on the MIC cards. 

Below is a sample session running on a MIC card:

-bash-4.1$ module load itac
-bash-4.1$ qsub -I -V -lnodes=1
qsub: waiting for job 4891.bint01.nersc.gov to start
qsub: job 4891.bint01.nersc.gov ready
-bash-4.1$ cd MIC/test_codes/
-bash-4.1$ module list
Currently Loaded Modulefiles:
  1) modules        2) nsg/1.2.0      3) torque/4.2.1   4) moab/7.1.3     5) intel/14.0.0   6) impi/4.1.1     7) itac/8.1.3
-bash-4.1$ uname -a         
Linux bc1005 2.6.32-279.22.1.el6.nersc3_r33_0.x86_64 #1 SMP Sat May 18 20:01:09 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux

-bash-4.1$ mpiifort -mmic -trace -o mpi-hello.mic mpi-hello.f
-bash-4.1$ mpirun.mic -host bc1005-mic0 -np 4 -ppn 4 -trace  ./mpi-hello.mic
myCPU is 2 of 4
myCPU is 1 of 4
myCPU is 3 of 4
myCPU is 0 of 4
[0] Intel(R) Trace Collector INFO: Writing tracefile mpi-hello.mic.stf in /global/u1/y/yunhe/MIC/test_codes
-bash-4.1$ traceanalyzer mpi-hello.mic.stf

Using Intel Advisor

Intel Advisor is a tool to help users to locate areas and add parallelism to the code areas where parallelism may provide significant benefit to the C/C++, and Fortran serial or MPI applications. To use Intel Advisor on Babbage:

% module load advisor

Then use the "advixe-gui" GUI interface or "advixe-cl" command line option to launch.   
Basic usage: advixe-cl <-action> [-option] [[--] application [application options]]    
See "man advixe-cl" for more information.

Using Intel Inspector

Intel Inspector is a tool to help detect memory and threading problems. To use Intel Inspector on Babbage:

% module load inspector

Then use the "inspxe-gui" GUI interface or "inspxe-cl" command line option to launch. 
Basic usage: inspxe-cl <-action> [-action-option] [-global-option] [[--] target [target options]]    
See "man inspxe-cl" for more information.

Using Intel MKL

Intel Math Kernel Libraries (MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, FFT, and vector math.

To compie for the host nodes:

% mpiicc -mkl -o test_mkl test_mkl.c

To compile for the MIC cards:

% mpiicc -mmic -mkl -o test_mkl.mic test_mkl.c

Using Intel TBB

Intel Threading Building Blocks (TBB) is a C++ template library developed by Intel Corporation for writing software programs that take advantage of multi-core processors.  TBB is a part of the Intel Composer package, and available via the default loaded "intel" module.

To compie for the host nodes (use "mpiicpc" for the hybrid MPI/TBB codes):

% icpc -ltbb -o test_tbb test_tbb.cpp

To compile for the MIC cards (use "mpiicpc" for hybrid MPI/TBB codes):

% icpc -mmic -ltbb -lpthread -o test_tbb.mic test_tbb.cpp

Intel Tools Training Slides, Oct 23, 2013

Further Information at Intel Developer Zone for MIC