NERSC's next supercomputer ("Cori") will be a Cray XC system featuring the next generation Intel Knights Landing (KNL) Manycore Integrated Core (MIC) architecture. Babbage is a NERSC testbed containing the current generation Intel Knight's Corner (KNC) coprocessor. NERSC has been using this cluster as part of its "Application Readiness" effort aimed at learning how to assist users migrating to energy-efficient, highly-parallel architectures.
NERSC users can use Babbage (and Edison) to help prepare applications for Cori. Users are encouraged to use Babbage in "native" mode in which both the operating system and user applications run on the Phi. Cori will also run in this "self hosted" mode only. Keep in mind that MPI performance across nodes (and Phi cards) on Babbage is not optimal. We recommend using Babbage to improve single-node performance of your code, especially vectorization and thread scalability. More background information about manycore processors, programming model changes, and what this means for Cori can be found on the Cori web page.
NERSC users can apply a Babbage account via Request for Babbage Account Form.
Babbage is a testbed system and is not considered a NERSC production system. Therefore, policies regarding user access and staff support differ substantially from other NERSC systems. Although every effort will be made to provide quality service, the system can become unavailable with short notice. Problem fixes and software support may also not be at the same high level or promptness as production systems at NERSC.
There is one login node, and 45 compute nodes named “bcxxxx,” with two MIC cards and two Intel Xeon "host" processors within each compute node. Each Xeon processor (Sandy Bridge EP) contains 8 cores, capable of 2 hardware thread per core (hyperthreading is enabled) and 128 GB of memory per node.
The MIC cards are connected to the host via PICe bus. The MIC hostnames are “bcxxxx-mic0”, “bcxxxx-mic1” and they can be accessed directly from other nodes or other MIC-cards. Each MIC card contains 60 cores, with 4 hardware threads per core and 8 GB of memory per card. Cores are interconnected by a high-speed bidirectional ring. Each core has a 512 KB L2 cache locally with high speed access to all other L2 caches (fully cache coherent).
Each core supports 64-bit x86 instructions with 512-bit wide SIMD Vector ISA, can execute 8 double-precision (or 16 single precision or interger) operations per cycle. With added Fused Multiply-Add (FMA), it can execute 16 double-precision or 32 single precision FLOPS/cycle. Peak performance for each MIC card is 1 TFlop/sec (double precision).
Note: Please do not ssh to the compute nodes and MIC cards directly to avoid intefering with other users jobs on them. You should gain access to compute nodes exclusively via a batch script, or an interactive batch session.
% ssh babbage.nersc.gov
Once you are logged in you may notice that the name of the login node is "bint01."
The default login shell is currently csh. You can change your default login shell through the NERSC NIM web-based interface.
The default loaded modules are as follows:
% module list
Currently Loaded Modulefiles:
1) modules 2) nsg/1.2.0 3) slurm/default 4) intel/15.0.update1 5) impi/5.0.update1 6) usg-default-modules/1.1
Note that the set of default modules will likely change as the system matures.
All NERSC global file systems are mounted (except /projectb).
Running in native mode, where all code runs directly on the MIC card, is recommended. To do this, compile using the flag "-mmic."
Intel compilers are available and loaded by default on Babbage. Once you have logged in to the login node:
- Use ifort, icc, or icpc to compile serial Fortran, C, or C++ codes.
- Use mpiifort, mpiicc, or mpiicpc to compile MPI parallel Fortran, C, or C++ codes.
- Use the “-openmp” flag for OpenMP codes.
- Repeat: use the "-mmic" flag to compile for running natively on the MIC cards.
- The default compiler optimization is -O2.
- Vectorization (-vec) is on by default with -O2 or higher. Use "-novec -nosimd" to turn off vectorization (only recommended to compare with the default "-vec" performance result to confirm vectorization effect).
Other useful compiler options:
- -align array64byte
Fortran only. Align all static array data to 64-byte memory address boundaries to ensure the array data can be loaded from memory to cache optimally.
- -fcode-asm or -fsource-asm
Two ways of generating source-annotated assembler
It combines the now deprecated 4 options (-opt-report, -vec-report, -openmp-report, -par-report) in previous compiler versions.
Specify Fortran source files are in free format.
Babbage batch jobs are scheduled by SLURM batch scheduler. Please refer to Running Jobs under SLURM on Babbage.
The MPI library on Babbage is Intel MPI (loaded by default) and there are several new environment variables and command-line options that are useful or necessary.
To run interactively:
bint01% salloc -N 2 -t 00:30:00 -p debug
Below is a sample batch script to run a hybrid MPI/OpenMP job across the 4 MIC cards belonging to the 2 compute nodes allocated to the batch job. There are total of 4 MPI tasks, 1 MPI task per MIC card (-ppn 1), and 30 OpenMP threads per MPI task. Use the "qsub" command to submit a batch script.
#SBATCH -p regular
#SBATCH -N 2
#SBATCH -t 02:00:00
#SBATCH -J my_job
#SBATCH -e my_job.%j.err
#SBATCH -o my_job.%j.out
mpirun.mic -n 4 -hostfile micfile.$SLURM_JOB_ID -ppn 1 ./myexe.mic
Sometimes the full path to your executable is needed, otherwise you may see a "no such file or directory" error. In the above script, "get_micfile" is a NERSC utility that mimics $SLURM_NODELIST but creates the file "mic.$SLURM_JOB_ID" in your submit directory. Assuming the two compute nodes allocated to your batch job are bc1013 and bc1012, this file would contain:
You can also create a custom hostfile and use "--hostfile my_custum_hostfile" in the mpirun.mic command. You can also spell out the MIC cards, as below:
mpirun.mic -n 2 -hosts bc1013-mic0,bc1012-mic1 -ppn 1 ./myexe.mic
Environment variables can also be passed as shown below below:
mpirun.mic -n 16 -hosts bc1011-mic1 -env OMP_NUM_THREADS 2 -env KMP_AFFINITY balanced ./myexe.mic
Thread Affinity Options
The following choices can be used for env KMP_AFFINITY to set OpenMP thread affinity.
This is the default setting on the compute hosts.
Binds threads as close to each other as possible; this is the setting on the MIC cards
Binds threads as far apart to each other as possible
This option applies only to MIC cards. It first scatters threads to each core, so that each core has at least one thread, and it sets thread numbers utilizing the different hardware threads of the same core are close to each other
Use explicit bindings
Below is an example of setting KMP_AFFINITY to various options to allocate 6 OpenMP threads on one MIC card. For illustration simplicity, assume each MIC card has only 3 cores instead of 60 cores.
|Node||Core 1||Core 2||Core 3|
|Node||Core 1||Core 2||Core 3|
|Node||Core 1||Core 2||Core 3|
This is a new environment variable available only for the MIC cards. It does not replace KMP_AFFINITY, but works with it to set exact but still generic thread placement.
The format of the setting is: <n>Cx<m>T,<o>O, which means to use <n> Cores times <m> Threads with <o> of cores Offset. Example:
setenv KMP_PLACE_THREADS 40Cx3T,1O,
which means to use 40 physical cores, and 3 threads (HT2,3,4) per core.
NOTE: the operating system (OS) always runs on logical processor core 0, which lives on physical core 59 on Babbage. OS procs on core 59 are threads 0, 237, 238, and239. Please avoid using proc 0; i.e., use max_threads=236 on Babbage.
Details of OpenMP Thread Affinity Control can be found here.
Process and Thread Affinity Options
An environment variable I_MPI_PIN_DOMAIN can be used to set MPI process affnity. Thread affinity then works within process affinity. The value of I_MPI_PIN_DOMAIN can be set in 3 different formats.
- Multi-core shape
I_MPI_PIN_DOMAIN=<mc-shape>, where <mc-shape> can be core, socket, node, cache1, cache2, cache3, cache
- Explicit shape
I_MPI_PIN_DOMAIN=<size>:<layout>, where <size> can be omp, auto, or an explicit value; Here the value of <size> ranges from 1 to 240, which is the max number of logical cores. auto is the default. <layout> can be platform, compact, or scatter. scatter is the default.
- Explicit domain mask
I_MPI_PIN_DOMAIN=<masklist>, where <masklist> is a list of mask values.
More detais of setting I_MPI_PIN_DOMAIN can be found at two Intel documents:
Notice that core numbering is different on the MIC cards than on the host nodes. On the host nodes, core numbering starts with 0, while on the MIC cards, core numbering starts from 1, since core 0 is reserved for the operating system. Mult-core shape and explicit shape schemes as listed above will automatically account for this.
Nested OpenMP is supported on Babbage. Please see more informaiton on example code and thread affinity control settings here.
Running Jobs on the Host
To run on the Xeon processors on the host do the following:
- To compile, do not use the "-mmic" flag.
- Use "get_hostfile" instead of "get_micfile" in your batch script, and use "-hostfile hostfile.$SLURM_JOB_ID" in the mpirun command. You can also create a custom hostfile with lines such as "bc1013-ib."
- Use "mpirun" instead of "mpirun.mic" in the job launch line.
- When running on the Babbage host processors, the maximum MPI tasks times OpenMP threads per compute node is currently 16, because HT is not enabled. This compares with the maximum number of 240 on each MIC card.
- The affinity option "balanced" does not exist on the host node. Use other options instead.
Batch Queue Structures
|Submit Queue||Execution Queue||Nodes||Max Wallclock||Relative Priority||Run Limit||Queued Limit|
Batch jobs are managed via SLURM. The usual SLURM commands such as "squeue", "scancel" can be used to monitor batch jobs.
Programming and Running Tips
When building software and libraries on MIC cards using autoconf/configure scripts, sometimes a test program needs to be run. Since the build is a cross-compile from the login node (or host compute nodes), the binary generated from the test program is for running on the MIC cards, so this test program will fail during the configure process (binaries are not compatible between the host nodes and MIC cards).
In order for such configure to succeed, and a resulting Makefile can be generated to be used for successfully building the intended software or libraries for the MIC cards, we suggest two workarounds:
The first option to try is to use the "--host=x86_64-unknown-linux-gnu" option for configure so that many test programs can be skipped. If this fails, another trick is to define "-DMIC" for the the compiler options such as CC, CXX, FC, etc. used in "configure": export CC="icc –DMIC", … . Then replace all "-DMIC" in the generated Makefile with "-mmic", then compile and build.
files=$(find ./* -name Makefile)
perl –p –i –e 's/-DMIC/-mmic/g' $files
find . -name Makefile | xargs sed -i 's/-DMIC/-mmic/g'
- Intel MPI dynamically selects the most appropriate network fabric for communications. Inter node communication uses "shm", intra node communication uses tcp, or dapl and ofa based on Infiniband. Use environmnt variable "I_MPI_FABRICS" to "intranode fabric: internode fabric" at run time to specify network fabric. The default fabric is "shm:dapl". Available I_MPI_FABRICS choices on Babbage are "shm:dapl". "shm:ofa", "shm:tcp". Try different fabrics with your application to choose the one that helps performance the most. MPI fabrics used be displayed if environemtn variable "I_MPI_DEBUG" to is set to 2 or higher.
- Consider using the following compiler options for performance tuning:
|-opt-assume-safe-padding||When -opt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded past the end of the object. This means that code can access up to 64 bytes beyond what is specified in your program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option.|
|-opt-streaming-stores always||Enables generation of streaming stores for optimization. It helps especially when the application is memory bound. Do not read the original content of entire cache line from memory when we overwriting its whole content completely.|
|-opt-streaming-cache-evict=0||Turn off all cache line evicts.|
|-mP2OPT_hlo_pref_use_outer_strategy=F||-mP2OPT is an internal developer switch.|
|-fimf_precision=low||Lets you specify a level of accuracy (precision) that the compiler should use when determining which math library functions to use. Low is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions.|
|-fimf-domain-exclusion=15||Indicates the input arguments domain on which math functions must provide correct results. 15 specifies extremes, nans, infinities, denormals.|
|-fp-model fast=1||The compiler uses more aggressive optimizations on floating-point calculations.|
|-no-prec-div||The compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. It gives slightly less precise results than full IEEE division.|
|-no-prec-sqrt||The compiler does a faster but less precise implementation of square root calculation.|
Users should explore single node performance of your code on Babbage in order to prepare your application for N8 architecture (Intel KNL). Fully utilize vectorization and thread scalability on the Babbage KNC cards are especially important.
- Vectorization: Vectorization introduction and information for Edison and Babbage.
- Using OpenMP on Edison. Programming, Running, and example batch scripts for using MPI/OpenMP on Edison.
- OpenMP Resources: a collection of online OpenMP resources and tutorials.
- NERSC Application Readiness Case Studies: Some examples of challenges and strategies used to optimize scientific applications and kernel codes performance on Babbage.
Using Intel Trace Analyzer and Collector
Intel Trace Analyzer and Collector (ITAC) is a tool to help understand the MPI application behavior, quickly find bottlenecks and achieve high performance on parallel applications.
To use ITAC on Babbage, simply load the module:
% module load itac
Then compile with the flag "-trace" (with VT library to trace entrance of each MPI call), or "-tcollect" (with full tracing). At run time, add the "-trace" flag to the mpirun.mic option. A "*.stf" file will be generated, which can be used via the "traceanalyzer" command to open a GUI. ITAC can be used on both the host and on the MIC cards.
Below is a sample session running on a MIC card:
-bash-4.1$ salloc -N 1
salloc: Granted job allocation 1177
-bash-4.1$ module load itac
-bash-4.1$ cd MIC/test_codes
-bash-4.1$ uname -a
Linux bc1005 2.6.32-279.22.1.el6.nersc3_r33_0.x86_64 #1 SMP Sat May 18 20:01:09 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux
-bash-4.1$ mpiifort -mmic -trace -o mpi-hello.mic mpi-hello.f
-bash-4.1$ mpirun.mic -host bc1005-mic0 -np 4 -ppn 4 -trace ./mpi-hello.mic
myCPU is 2 of 4
myCPU is 1 of 4
myCPU is 3 of 4
myCPU is 0 of 4
 Intel(R) Trace Collector INFO: Writing tracefile mpi-hello.mic.stf in /global/u1/y/yunhe/MIC/test_codes
-bash-4.1$ traceanalyzer mpi-hello.mic.stf
Using Intel VTune
Intel VTune amplifier is a performance analysis and optimizing tool. To use VTune on Babbage, simply load the module:
% module load vtune
Below is a sample session running on a MIC card with VTune command line:
-bash-4.1$ salloc -N 1 -t 30:00
salloc: Granted job allocation 1178
-bash-4.1$ echo $DISPLAY
-bash-4.1$ cd $SLURM_SUBMIT_DIR
-bash-4.1$ mpiifort -mmic -openmp -O3 -o jacobi_mpiomp.mic jacobi_mpiomp.f90
-bash-4.1$ uname -a (#to find out which node you are on, say it is bc0903)
-bash-4.1$ export OMP_NUM_THREADS=20
-bash-4.1$ module load vtune
-bash-4.1$ amplxe-cl -collect advanced-hotspots -target-system=mic-host-launch -app-working-dir /global/homes/y/yunhe/MIC/test_codes -- mpirun.mic -n 8 -host bc0903-mic0 /global/homes/y/yunhe/MIC/test_codes/jacobi_mpiomp.mic
You can also do "-collect general-exploration" or "-collect bandwidth", or add additional flags in the above command line by using the show "command line" button from GUI.
The Vtune GUI can be started as:
from within a batch script as above.
You can also do "-collect knc-general-exploration" or "-collect knc-bandwidth", or add additional flags in the above command line by using the show "command line" button from GUI. See "amplxe-cl" man page for more command line options.
The Vtune GUI can be started as:
from within a batch script as above. GUI can be used to open performance data collected from command line session, or start a new analysis directly. See our VTune web page for more information.
Using Intel Advisor
The goal of the Advisor is to help you find sections of your application to which parallelism can be added to give you the best performance gains and scalability while maintaining correct results.
To use Intel Advisor on Babbage:
% module load advisor
Then use the "advixe-gui" GUI interface or "advixe-cl" command line option to launch.
Basic usage: advixe-cl <-action> [-option] [[--] application [application options]]
See "man advixe-cl" for more information.
Using Intel Inspector
Intel Inspector is a tool to help detect memory and threading problems. To use Intel Inspector on Babbage:
% module load inspector
Then use the "inspxe-gui" GUI interface or "inspxe-cl" command line option to launch.
Basic usage: inspxe-cl <-action> [-action-option] [-global-option] [[--] target [target options]]
See "man inspxe-cl" for more information.
Using Intel MKL
Intel Math Kernel Libraries (MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, FFT, and vector math. MKL path and environment variable $MKLROOT are defined as part of the default loaded "intel" module.
To compile for the host nodes:
% mpiicc -mkl -o test_mkl test_mkl.c
To compile for the MIC cards:
% mpiicc -mmic -mkl -o test_mkl.mic test_mkl.c
Using Intel TBB
Intel Threading Building Blocks (TBB) is a C++ template library for writing software programs that take advantage of multi-core processors. TBB is a part of the Intel Composer package and is available via the default loaded "intel" module.
To compile for the host nodes (use "mpiicpc" for the hybrid MPI/TBB codes):
% icpc -ltbb -o test_tbb test_tbb.cpp
To compile for the MIC cards (use "mpiicpc" for hybrid MPI/TBB codes):
% icpc -mmic -ltbb -lpthread -o test_tbb.mic test_tbb.cpp
Using Intel SDE
Intel SDE (Software Development Emulator) is a tool for simulate processors such as KNL, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, etc. To use Intel SDE on Babbage:
% module load sde
Then use the "sde64 -help" or "sde64 -help-long" for usage help.
Basic usage: sde64 [args] -- application [application-args]
Using Allinea DDT and MAP
To use OpenCL, just load the "opencl" module, then compile and link with $OPENCL.
% module load opencl
% mpiicpc my_opencl_code.cpp $OPENCL (or % icpc my_opencl_code.cpp $OPENCL)
Some OpenCL exercises for SC14 can be found here .
Mini Tutorial on Babbage
A mini tutorial about Babbage introduction and how Babbage can be used to prepare for Cori was presented at the NUG monthly meeting in July 2014. Slides can be found here.
Intel Tools Training Slides, Oct 23, 2013
Further Information at Intel Developer Zone for MIC