Babbage is a NERSC internal cluster containing the Intel Phi coprocessor, which is the first product based on Intel Many Integrated Core (MIC) architecture. NERSC is evaluating this cluster as part of its "Application Readiness" effort aimed at learning how to assist users migrating to energy-efficient, highly parallel architectures. Babbage is not available to general NERSC users.
There is one login node named “bint01”, and 45 compute nodes named “bcxxxx”, with two MIC cards and two Intel Xeon "host" processors within each compute node.
Each Xeon processor (Sandy Bridge EP) contains 8 cores, with 2 hardware threads (HT to be enabled) each core, and 128 GB of memory per node.
The MIC card hostnames are “bcxxxx-mic0”, “bcxxxx-mic1” and can be accessed directly from other nodes or other MIC-cards. Each MIC card contains 60 cores, with 4 hardware threads each core, and 8 GB of memory per card.
Use ssh to babbage.nersc.gov.
The default shell is currently bash. You can change your default shell through NIM.
The default loaded modules are as follows:
% module list
Currently Loaded Modulefiles:
1) modules 2) nsg/1.2.0 3) torque/4.2.1 4) moab/7.1.3 5) intel/14.0.0 6) impi/4.1.1
All NERSC global filesystems are mounted (except /projectb)
Running in native mode, where all code runs directly on the MIC card, is recommended. To do this, compile using the flag "-mmic."
Intel compiler is available and load by default on Babbage. Once you have logged in to the login node:
- Use ifort, icc, or icpc to compile serial Fortran, C, or C++ codes.
- Use mpiifort, mpiicc, or mpiicpc to compile parallel Fortran, C, or C++ MPI codes.
- Use the “-openmp” flag for OpenMP codes.
- Repeat: use the "-mmic" flag to compile for running natively on the MIC cards.
Other useful compiler options:
- -fcode-asm -S or -fsource-asm -S: two ways of generating source-annotated assembler
- -vec_report6: generate vectorization reports indicating loops that are successfully or not successfully vectorized, plus additional information about any proven or assumed dependences.
- -openmp-report1: generate the OpenMP parallelization reports indicating loops, regions, and sections successfully parallelized
- -guide-vec=2: causes the compiler to generate messages suggesting ways to improve vectorization
- -free: Specifies source files are in free format.
The MPI library on Babbage is Intel MPI (loaded byy default) and there are several new environment variables and command-line options that are useful or necessary.
To run interactively:
bint01% qsub -I -V -l nodes=3 -l walltime=00:30:00
Below is a sample batch script to run a hybrid MPI/OpenMP job using across the 4 MIC cards that belong to the 2 compute nodes that are allocated to the batch job. There are total of 4 MPI tasks, 1 MPI task per MIC card, and 60 OpenMP threads per MPI task. Use the "qsub" command to submit a batch script to the batch system.
#PBS -q regular
#PBS -l nodes=2
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
export I_MPI_MPIRUN_CLEANUP=1 (# plan to be added into the impi module file)
mpirun.mic -n 4 -hostfile micfile.$PBS_JOBID -ppn 1 ./myexe.mic (# sometimes the full path to the executable is needed, otherwise you may see a "no such file or directory" error).
Here "get_micfile" is a NERSC utility script that mimics the $PBS_NODEFILE but prints out the following to a file "mic.$PBS_JOBID" in your submit directory (assume the two nodes your batch job is allocated to are bc1013 and bc1012):
If you would like to run on just one or any number of the MIC cards that are allocated to your batch job, you can edit a custom hostfile and use "--hostname my_custum_hostfile" in the above mpirun command instead. You can also spell out the MIC cards as below:
mpirun.mic -n 2 -hosts bc1013-mic0,bc1012-mic1 -ppn 1 ./myexe.mic
Environment variables can also be passed in a sample mpirun line as below:
mpirun.mic -n 16 -hosts bc1011-mic1 -env OMP_NUM_THREADS 2 -env KMP_AFFINITY balanced ./myexe.mic
The "-genvall" argument to the mpirun command is recommended to export all environment variables to all processes.
Thread Affinity Options
The following choices can be used for env KMP_AFFINITY to set OpenMP thread affinity.
- none: this is the default setting on the compute hosts.
- compact: binds threads as close to each other as possible (This is the setting on the MIC cards.)
- scatter: binds threads as far apart to each other as possible
- balanced: this option only exists to MIC cards. It first scatter threads to each core, so that each core has at least one thread, and it sets thread numbers utilizing the different hardware threads of the same core are close to each other.
- explicit: use explicit bindings
There is also a new env "KMP_PLACE_THREADS" in new Intel compiler version 184.108.40.206. Details of OpenMP Thread Affinity Control can be found here.
Below is an example of setting KMP_AFFINITY to various options to allocate 6 OpenMP threads on one MIC card: (For illustration simplicity, assume each MIC card has only 3 cores instead of 60 cores).
|Node||Core 1||Core 2||Core 3|
|Node||Core 1||Core 2||Core 3|
|Node||Core 1||Core 2||Core 3|
Process and Thread Affinity Options
An environment variable I_MPI_PIN_DOMAIN can be used to set MPI process affnity. Thread affinity then works within process affinity. The value of I_MPI_PIN_DOMAIN can be set in 3 different formats.
- Multi-core shape: I_MPI_PIN_DOMAIN=<mc-shape>, where <mc-shape> can be core, socket, node, cache1, cache2, cache3, cache
- Explicit shape: I_MPI_PIN_DOMAIN=<size>:<layout>, where <size> can be omp, auto, or an explicit value; <layout> can be platform, compact, or scatter.
- Explicit domain mask: I_MPI_PIN_DOMAIN=<masklist>, where <masklist> is a list of mask values.
More detais of setting I_MPI_PIN_DOMAIN can be found at two Intel documentations:
Notice the core numbering are differnet on the MIC cards than on the host nodes. On the host nodes, core number starts with 0, while on the MIC cards, core number starts from 1, since core 0 is reserved for the operating system. Mult-core shape and Explicit shape schemes as listed above will automatically account for this.
Running Jobs on the Host
You may want to run on the Xeon processors on the host for performance comparison with running on the MIC cards.
- To compile, do not use the "-mmic" flag.
- Use "get_hostfile" instead of "get_micfile" in the example batch script, and use "-hostfile hostfile.$PBS_JOBID" instead of "-hostfile micfile.$PBS_JOBID" in the mpirun command. You can also edit a custom hostfile, with lines such as "bc1013-ib" instead of "bc1013-mic0".
- Use "mpirun" instead of "mpirun.mic" in the job launch line.
- There are 2 Xeon processors per host compute node, and 8 cores per processor, 2 hardware threads per core, so the maximum MPI tasks times OpenMP threads per compute node should be 32 (with HT enabled) or 16 (with HT not enabled, this is the current situation) when running on the host, as compared to the maximum number of 240 on each MIC card.
- The affinity option "balanced" does not exist on the host node. Use other options instead.
Batch Queue Structures
|Submit Queue||Execution Queue||Nodes||Max Wallclock||Relative Priority||Run Limit||Queued Limit|
Batch jobs are managed via Torque/Moab. The usual Torque commands such as "qstat", "qdel" can be used to monitor batch jobs. The Moab commands such as "showq", "showbf" are also quite useful.
Programming and Running Tips
- The default stack size is often too small. If your application gets a segmentation fault shortly after it started, consider increase the stack size limit on the MIC cards. When using just one MIC card, you can ssh to it directly, and increase the stack size via "ulimit -s unlimited" on the card and run again. For jobs using more than one MIC cards, the above solution does not work. The trick is to wrap the executable in a script, for example:
% cat ulimit.sh
ulimit -s unlimited
Run fails without the script:
% mpirun.mic -n 4-hostfile mic.$PBS_JOBJD -ppn 2 ./myexe.mic
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Run succeeds with the script:
% mpirun.mic -n 4-hostfile mic.$PBS_JOBJD -ppn 2 ./ulimit.sh
... <with correct results> ...
- Consider using the following compiler options for performance tuning:
MIC COMPILER OPTIONS Option Explanation -opt-assume-safe-padding When -opt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded past the end of the object. This means that code can access up to 64 bytes beyond what is specified in your program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option. -opt-streaming-stores always Enables generation of streaming stores for optimization. It helps especially when the application is memory bound. Do not read the original content of entire cache line from memory when we overwriting its whole content completely. -opt-streaming-cache-evict=0 Turn off all cache line evicts. -mP2OPT_hlo_pref_use_outer_strategy=F -mP2OPT is an internal developer switch. Execution Queue Nodes PRECISION COMPILER OPTIONS (USE WITH EXTREME CAUTION, MAKE SURE RESULTS ARE STILL CORRECT) Option Explanation -fimf_precision=low Lets you specify a level of accuracy (precision) that the compiler should use when determining which math library functions to use. Low is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions. -fimf-domain-exclusion=15 Indicates the input arguments domain on which math functions must provide correct results. 15 specifies extremes, nans, infinities, denormals. -fp-model fast=1 The compiler uses more aggressive optimizations on floating-point calculations. -no-prec-div The compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. It gives slightly less precise results than full IEEE division. -no-prec-sqrt The compiler does a faster but less precise implementation of square root calculation.
Using Intel Trace Analyzer and Collector
Intel Trace Analyzer and Collector (ITAC) is a tool to help understand the MPI application behavior, quickly find bottlenecks and achieve high performance on parallel applications.
To use ITAC on Babbage, simply load the module:
% module load itac
Then compile with the flag "-trace" (with VT library to trace entrance of each MPI call), or "-tcollect" (with full tracing). At run time, add the "-trace" flag to the mpirun.mic option. A "*.stf" file will be generated, which can be used via "traceanalyzer" command to open a GUI. ITAC can be used on both the host and on the MIC cards.
Below is a sample session running on a MIC card:
-bash-4.1$ module load itac
-bash-4.1$ qsub -I -V -lnodes=1
qsub: waiting for job 4891.bint01.nersc.gov to start
qsub: job 4891.bint01.nersc.gov ready
-bash-4.1$ cd MIC/test_codes/
-bash-4.1$ module list
Currently Loaded Modulefiles:
1) modules 2) nsg/1.2.0 3) torque/4.2.1 4) moab/7.1.3 5) intel/14.0.0 6) impi/4.1.1 7) itac/8.1.3
-bash-4.1$ uname -a
Linux bc1005 2.6.32-279.22.1.el6.nersc3_r33_0.x86_64 #1 SMP Sat May 18 20:01:09 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux
-bash-4.1$ mpiifort -mmic -trace -o mpi-hello.mic mpi-hello.f
-bash-4.1$ mpirun.mic -host bc1005-mic0 -np 4 -ppn 4 -trace ./mpi-hello.mic
myCPU is 2 of 4
myCPU is 1 of 4
myCPU is 3 of 4
myCPU is 0 of 4
 Intel(R) Trace Collector INFO: Writing tracefile mpi-hello.mic.stf in /global/u1/y/yunhe/MIC/test_codes
-bash-4.1$ traceanalyzer mpi-hello.mic.stf
Using Intel Advisor
Intel Advisor is a tool to help users to locate areas and add parallelism to the code areas where parallelism may provide significant benefit to the C/C++, and Fortran serial or MPI applications. To use Intel Advisor on Babbage:
% module load advisor
Then use the "advixe-gui" GUI interface or "advixe-cl" command line option to launch.
Basic usage: advixe-cl <-action> [-option] [[--] application [application options]]
See "man advixe-cl" for more information.
Using Intel Inspector
Intel Inspector is a tool to help detect memory and threading problems. To use Intel Inspector on Babbage:
% module load inspector
Then use the "inspxe-gui" GUI interface or "inspxe-cl" command line option to launch.
Basic usage: inspxe-cl <-action> [-action-option] [-global-option] [[--] target [target options]]
See "man inspxe-cl" for more information.
Using Intel MKL
Intel Math Kernel Libraries (MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, FFT, and vector math.
To compie for the host nodes:
% mpiicc -mkl -o test_mkl test_mkl.c
To compile for the MIC cards:
% mpiicc -mmic -mkl -o test_mkl.mic test_mkl.c
Using Intel TBB
Intel Threading Building Blocks (TBB) is a C++ template library developed by Intel Corporation for writing software programs that take advantage of multi-core processors. TBB is a part of the Intel Composer package, and available via the default loaded "intel" module.
To compie for the host nodes (use "mpiicpc" for the hybrid MPI/TBB codes):
% icpc -ltbb -o test_tbb test_tbb.cpp
To compile for the MIC cards (use "mpiicpc" for hybrid MPI/TBB codes):
% icpc -mmic -ltbb -lpthread -o test_tbb.mic test_tbb.cpp
Intel Tools Training Slides, Oct 23, 2013
Further Information at Intel Developer Zone for MIC