NERSCPowering Scientific Discovery Since 1974

Babbage

Introduction

NERSC's next supercomputer ("Cori") will be a Cray XC system featuring the next generation Intel Knights Landing (KNL) Manycore Integrated Core (MIC) architecture. Babbage is a NERSC testbed containing the current generation Intel Knight's Corner (KNC) coprocessor.  NERSC has been using this cluster as part of its "Application Readiness" effort aimed at learning how to assist users migrating to energy-efficient, highly-parallel architectures.  

NERSC users can use Babbage (and Edison) to help prepare applications for Cori.  Users are encouraged to use Babbage in "native" mode in which both the operating system and user applications run on the Phi. Cori will also run in this "self hosted" mode only. Keep in mind that MPI performance across nodes (and Phi cards) on Babbage is not optimal. We recommend using Babbage to improve single-node performance of your code, especially vectorization and thread scalability.  More background information about manycore processors, programming model changes, and  what this means for Cori can be found on the Cori web page.

NERSC users can apply a Babbage account via Request for Babbage Account Form.

Support

Babbage is a testbed system and is not considered a NERSC production system. Therefore, policies regarding user access and staff support differ substantially from other NERSC systems.  Although every effort will be made to provide quality service, the system can become unavailable with short notice. Problem fixes and software support may also not be at the same high level or promptness as production systems at NERSC. 

Configuration

There is one login node, and 45 compute nodes named “bcxxxx,” with two MIC cards and two Intel Xeon "host" processors within each compute node. Each Xeon processor (Sandy Bridge EP) contains 8 cores, capable of 1 hardware thread per core (hyperthreading is not currently enabled) and 128 GB of memory per node. 

The MIC card hostnames are  “bcxxxx-mic0”, “bcxxxx-mic1” and they can be accessed directly from other nodes or other MIC-cards.  Each MIC card contains 60 cores, with 4 hardware threads per core and 8 GB of memory per card.  However, read the information below, because not all cores/threads should be used for running your program.

Note: Please do not ssh to the compute nodes and MIC cards directly to avoid intefering with other users jobs on them. You should gain access to compute nodes exclusively via a batch script, or an interactive batch session. 

Access

% ssh babbage.nersc.gov

Once you are logged in you may notice that the name of the login node is "bint01."

Programming Environment

The default login shell is currently csh.  You can change your default login shell through the NERSC NIM web-based interface.

The default loaded modules are as follows:

% module list
Currently Loaded Modulefiles:
1) modules 2) nsg/1.2.0 3) torque/4.2.1 4) moab/7.1.3 5) intel/14.0.0 6) impi/4.1.1

Note that the set of default modules will likely change as the system matures.

Filesystems

All NERSC global file systems are mounted (except /projectb).

Programming

Running in native mode, where all code runs directly on the MIC card, is recommended.  To do this, compile using the flag "-mmic."

Compilers:

Intel compilers are available and loaded by default on Babbage.  Once you have logged in to the login node:

    • Use ifort, icc, or icpc to compile serial Fortran, C, or C++ codes.
    • Use mpiifort, mpiicc, or mpiicpc to compile MPI parallel Fortran, C, or C++ codes. 
    • Use the “-openmp” flag for OpenMP codes. 
    • Repeat: use the "-mmic" flag to compile for running natively on the MIC cards.

Other useful compiler options:

    • -align array64byte
      Fortran only. Align all static array data to 64-byte memory address boundaries to ensure the array data can be loaded from memory to cache optimally.
    • -fcode-asm -S or -fsource-asm -S
      Two ways of generating source-annotated assembler
    • -vec_report6
      Generate vectorization reports indicating loops that are successfully or not successfully vectorized, plus additional information about any proven or assumed vector dependencies
    • -openmp-report1
      Generate OpenMP parallelization reports indicating loops, regions, and sections successfully parallelized
    • -guide-vec=2
      Generate messages suggesting ways to improve vectorization
    • -free
      Specify Fortran source files are in free format.

Running Jobs

The MPI library on Babbage is Intel MPI (loaded by default) and there are several new environment variables and command-line options that are useful or necessary.

To run interactively:

bint01% qsub -I -V -l nodes=2 -l walltime=00:30:00

Below is a sample batch script to run a hybrid MPI/OpenMP job across the 4 MIC cards belonging to the 2 compute nodes allocated to the batch job. There are total of 4 MPI tasks, 1 MPI task per MIC card (-ppn 1), and 30 OpenMP threads per MPI task.  Use the "qsub" command to submit a batch script.

#PBS -q regular
#PBS -l nodes=2
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=30
export KMP_AFFINITY=balanced
get_micfile
mpirun.mic -n 4 -hostfile micfile.$PBS_JOBID -ppn 1 ./myexe.mic

Sometimes the full path to your executable is needed, otherwise you may see a "no such file or directory" error.  In the above script, "get_micfile" is a NERSC utility that mimics $PBS_NODEFILE but creates the file "mic.$PBS_JOBID" in your submit directory.  Assuming the two compute nodes allocated to your batch job are bc1013 and bc1012, this file would contain:

bc1013-mic0
bc1013-mic1
bc1012-mic0
bc1012-mic1

Alternatively, you can add "mics=2" in the node request line as below, and use $PBS_MICFILE as the nodefile for the two MIC cards on each node.  See the sample batch script below:

#PBS -q regular
#PBS -l nodes=2:mics=2
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=30
export KMP_AFFINITY=balanced
mpirun.mic -n 4 -hostfile $PBS_MICFILE -ppn 1 ./myexe.mic

You can also create a custom hostfile and use "--hostfile my_custum_hostfile" in the mpirun.mic command.  You can also spell out the MIC cards, as below:

mpirun.mic -n 2 -hosts bc1013-mic0,bc1012-mic1 -ppn 1 ./myexe.mic

Environment variables can also be passed as shown below below:

mpirun.mic -n 16 -hosts bc1011-mic1 -env OMP_NUM_THREADS 2 -env KMP_AFFINITY balanced ./myexe.mic

The "-genvall" argument to the mpirun.mic command is recommended to export all environment variables to all processes. 

Thread Affinity Options

KMP_AFFINITY

 The following choices can be used for env KMP_AFFINITY to set OpenMP thread affinity.

    • none
      This is the default setting on the compute hosts.
    • compact
      Binds threads as close to each other as possible; this is the setting on the MIC cards
    • scatter
      Binds threads as far apart to each other as possible
    • balanced
      This option applies only to MIC cards. It first scatters threads to each core, so that each core has at least one thread, and it sets thread numbers utilizing the different hardware threads of the same core are close to each other
    • explicit
      Use explicit bindings

Below is an example of setting KMP_AFFINITY to various options to allocate 6 OpenMP threads on one MIC card.  For illustration simplicity, assume each MIC card has only 3 cores instead of 60 cores.

1) KMP_AFFINITY=compact

NodeCore 1Core 2Core 3
HT1HT2HT3HT4HT1HT2HT3HT4HT1HT2HT3HT4
Thread 
0 1 2 3 4 5            

2) KMP_AFFINITY=scatter

NodeCore 1Core 2Core 3
HT1HT2HT3HT4HT1HT2HT3HT4HT1HT2HT3HT4
Thread 
0 3     1 4     2 5

3) KMP_AFFINITY=balanced

NodeCore 1Core 2Core 3
HT1HT2HT3HT4HT1HT2HT3HT4HT1HT2HT3HT4
Thread 
0 1     2 3     4 5    

KMP_PLACE_THREADS

This is a new environment variable available only for the MIC cards.  It does not replace KMP_AFFINITY, but works with it to set exact but still generic thread placement.

The format of the setting is: <n>Cx<m>T,<o>O, which means to use <n> Cores times <m> Threads with <o> of cores Offset.  Example:

setenv KMP_PLACE_THREADS 40Cx3T,1O,

which means to use 40 physical cores, and 3 threads (HT2,3,4) per core.

NOTE:  the operating system (OS) always runs on logical processor core 0, which lives on physical core 59 on Babbage.   OS procs on core 59 are threads 0, 237, 238, and239.  Please avoid using proc 0; i.e., use max_threads=236 on Babbage.

Details of OpenMP Thread Affinity Control can be found here.

Process and Thread Affinity Options

An environment variable I_MPI_PIN_DOMAIN can be used to set MPI process affnity.  Thread affinity then works within process affinity.  The value of I_MPI_PIN_DOMAIN can be set in 3 different formats.

    • Multi-core shape
      I_MPI_PIN_DOMAIN=<mc-shape>, where <mc-shape> can be core, socket, node, cache1, cache2, cache3, cache
    • Explicit shape
      I_MPI_PIN_DOMAIN=<size>:<layout>, where <size> can be omp, auto, or an explicit value;  Here the value of <size> ranges from 1 to 240, which is the max number of logical cores. auto is the default.  <layout> can be platform, compact, or scatter.  scatter is the default.
    • Explicit domain mask
      I_MPI_PIN_DOMAIN=<masklist>, where <masklist> is a list of mask values. 

More detais of setting I_MPI_PIN_DOMAIN can be found at two Intel documents:

Notice that core numbering is different on the MIC cards than on the host nodes. On the host nodes, core numbering starts with 0, while on the MIC cards, core numbering starts from 1, since core 0 is reserved for the operating system.  Mult-core shape and explicit shape schemes as listed above will automatically account for this.

Running Jobs on the Host 

To run on the Xeon processors on the host do the following:  

    • To compile, do not use the "-mmic" flag.  
    • Use "get_hostfile" instead of "get_micfile" in your batch script, and use "-hostfile hostfile.$PBS_JOBID"  in the mpirun command.  You can also create a custom hostfile with lines such as "bc1013-ib." 
    • Use "mpirun" instead of "mpirun.mic" in the job launch line.
    • When running on the Babbage host processors, the maximum MPI tasks times OpenMP threads per compute node is currently 16, because HT is not enabled. This compares with the maximum number of 240 on each MIC card.
    • The affinity option "balanced" does not exist on the host node. Use other options instead.

Batch Queue Structures

Submit QueueExecution QueueNodesMax WallclockRelative PriorityRun LimitQueued Limit
debug debug  1-4 30 mins 1 2 2
regular reg_small  1-11 12 hrs 2 4 4
reg_med 12-23 6 hrs 2 2 2
reg_big 24-45 2 hrs 2 1 1

Batch jobs are managed via Torque/Moab. The usual Torque commands such as "qstat", "qdel" can be used to monitor batch jobs.  The Moab commands such as "showq", "showbf" are also quite useful.

Programming and Running Tips

  • When building software and libraries on MIC cards using autoconf/configure scripts, sometimes a test program needs to be run.  Since the build is a cross-compile from the login node (or host compute nodes), the binary generated from the test program is for running on the MIC cards, so this test program will fail during the configure process (binaries are not compatible between the host nodes and MIC cards).

    In order for such configure to succeed, and a resulting Makefile can be generated to be used for successfully building the intended software or libraries for the MIC cards, we suggest two workarounds: 

    The first option to try is to use the "--host=x86_64-unknown-linux-gnu" option for configure so that many test programs can be skipped. If this fails, another trick is to define "-DMIC" for the the compiler options such as CC, CXX, FC, etc. used in "configure": export CC="icc –DMIC", … . Then replace all "-DMIC" in the generated Makefile with "-mmic", then compile and build.

             files=$(find ./* -name Makefile)
             perl –p –i –e 's/-DMIC/-mmic/g' $files

    or: 

             find . -name Makefile | xargs sed -i 's/-DMIC/-mmic/g'

  • Intel MPI dynamically selects the most appropriate network fabric for communications. Inter node communication uses "shm", intra node communication uses tcp, or dapl and ofa based on Infiniband. Use environmnt variable "I_MPI_FABRICS" to "intranode fabric: internode fabric" at run time to specify network fabric.  The default fabric is "shm:dapl".  Available I_MPI_FABRICS choices on Babbage are "shm:dapl". "shm:ofa", "shm:tcp".   Try different fabrics with your application to choose the one that helps performance the most.  MPI fabrics used be displayed if environemtn variable "I_MPI_DEBUG" to is set to 2 or higher. 
  • Consider using the following compiler options for performance tuning:
MIC COMPILER OPTIONS
OptionExplanation
-opt-assume-safe-padding When -opt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded past the end of the object. This means that code can access up to 64 bytes beyond what is specified in your program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option.
-opt-streaming-stores always Enables generation of streaming stores for optimization. It helps especially when the application is memory bound. Do not read the original content of entire cache line from memory when we overwriting its whole content completely.
-opt-streaming-cache-evict=0 Turn off all cache line evicts.
-mP2OPT_hlo_pref_use_outer_strategy=F -mP2OPT is an internal developer switch.
PRECISION COMPILER OPTIONS (USE WITH EXTREME CAUTION, MAKE SURE RESULTS ARE STILL CORRECT)
OptionExplanation
-fimf_precision=low Lets you specify a level of accuracy (precision) that the compiler should use when determining which math library functions to use. Low is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions.
-fimf-domain-exclusion=15 Indicates the input arguments domain on which math functions must provide correct results. 15 specifies extremes, nans, infinities, denormals.
-fp-model fast=1 The compiler uses more aggressive optimizations on floating-point calculations.
-no-prec-div The compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. It gives slightly less precise results than full IEEE division.
-no-prec-sqrt The compiler does a faster but less precise implementation of square root calculation.

Performance

Users should explore single node performance of your code on Babbage in order to prepare your application for N8 architecture (Intel KNL). Fully utilize vectorization and thread scalability on the Babbage KNC cards are especially important.   

Using Intel Trace Analyzer and Collector

Intel Trace Analyzer and Collector (ITAC) is a tool to help understand the MPI application behavior, quickly find bottlenecks and achieve high performance on parallel applications.

To use ITAC on Babbage, simply load the module:

% module load itac

Then compile with the flag "-trace" (with VT library to trace entrance of each MPI call), or "-tcollect" (with full tracing).   At run time, add the "-trace" flag to the mpirun.mic option.  A "*.stf" file will be generated, which can be used via the "traceanalyzer" command to open a GUI.  ITAC can be used on both the host and on the MIC cards. 

Below is a sample session running on a MIC card:

-bash-4.1$ module load itac
-bash-4.1$ qsub -I -V -lnodes=1
qsub: waiting for job 4891.bint01.nersc.gov to start
qsub: job 4891.bint01.nersc.gov ready
-bash-4.1$ cd MIC/test_codes
-bash-4.1$ uname -a         
Linux bc1005 2.6.32-279.22.1.el6.nersc3_r33_0.x86_64 #1 SMP Sat May 18 20:01:09 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux

-bash-4.1$ mpiifort -mmic -trace -o mpi-hello.mic mpi-hello.f
-bash-4.1$ mpirun.mic -host bc1005-mic0 -np 4 -ppn 4 -trace  ./mpi-hello.mic
myCPU is 2 of 4
myCPU is 1 of 4
myCPU is 3 of 4
myCPU is 0 of 4
[0] Intel(R) Trace Collector INFO: Writing tracefile mpi-hello.mic.stf in /global/u1/y/yunhe/MIC/test_codes
-bash-4.1$ traceanalyzer mpi-hello.mic.stf

Using Intel VTune

Intel VTune amplifier is a performance analysis and optimizing tool.  To use VTune on Babbage, simply load the module:

% module load vtune

Below is a sample session running on a MIC card with VTune command line:

-bash-4.1$ qsub -I -V -l nodes=1 -lwalltime=30:00
qsub: waiting for job 23132.bint01.nersc.gov to start
qsub: job 23132.bint01.nersc.gov ready

-bash-4.1$ echo $DISPLAY
bint01:12.0
-bash-4.1$ cd $PBS_O_WORKDIR
-bash-4.1$ mpiifort -mmic -openmp -O3 -o jacobi_mpiomp.mic jacobi_mpiomp.f90
-bash-4.1$ get_micfile
-bash-4.1$ export NUM_THREADS=4
-bash-4.1$ module load vtune

-bash-4.1$ amplxe-cl -collect knc-hotspots -app-working-dir /global/homes/y/yunhe/MIC/test_codes -- mpirun.mic -n 8 -ppn 2 -hostfile micfile.$PBS_JOBID /global/homes/y/yunhe/MIC/test_codes/jacobi_mpiomp.mic
 
You can also do "-collect knc-general-exploration" or "-collect knc-bandwidth", or add additional flags in the above command line by using the show "command line" button from GUI.

The Vtune GUI can be started as:
-bash-4.1$ amplxe-gui
from within a batch script as above.

You can also do "-collect knc-general-exploration" or "-collect knc-bandwidth", or add additional flags in the above command line by using the show "command line" button from GUI.   See "amplxe-cl" man page for more command line options.

The Vtune GUI can be started as:

-bash-4.1$amplxe-gui


from within a batch script as above.  GUI can be used to open performance data collected from command line session, or start a new analysis directly. See our VTune web page for more information.

Using Intel Advisor

The goal of the Advisor is to help you find sections of your application to which parallelism can be added to give you the best performance gains and scalability while maintaining correct results.

To use Intel Advisor on Babbage:

% module load advisor

Then use the "advixe-gui" GUI interface or "advixe-cl" command line option to launch.   
Basic usage: advixe-cl <-action> [-option] [[--] application [application options]]    
See "man advixe-cl" for more information.

Using Intel Inspector

Intel Inspector is a tool to help detect memory and threading problems. To use Intel Inspector on Babbage:

% module load inspector

Then use the "inspxe-gui" GUI interface or "inspxe-cl" command line option to launch. 
Basic usage: inspxe-cl <-action> [-action-option] [-global-option] [[--] target [target options]]    
See "man inspxe-cl" for more information.

Using Intel MKL

Intel Math Kernel Libraries (MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, FFT, and vector math.  MKL path and environment variable $MKLROOT are defined as part of the default loaded "intel" module.

To compile for the host nodes:

% mpiicc -mkl -o test_mkl test_mkl.c

To compile for the MIC cards:

% mpiicc -mmic -mkl -o test_mkl.mic test_mkl.c

Using Intel TBB

Intel Threading Building Blocks (TBB) is a C++ template library for writing software programs that take advantage of multi-core processors.  TBB is a part of the Intel Composer package and is available via the default loaded "intel" module.

To compile for the host nodes (use "mpiicpc" for the hybrid MPI/TBB codes):

% icpc -ltbb -o test_tbb test_tbb.cpp

To compile for the MIC cards (use "mpiicpc" for hybrid MPI/TBB codes):

% icpc -mmic -ltbb -lpthread -o test_tbb.mic test_tbb.cpp

Using Intel SDE

Intel SDE (Software Development Emulator) is a tool for simulate processors such as KNL, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, etc. To use Intel SDE on Babbage:

% module load sde

Then use the "sde64 -help" or "sde64 -help-long" for usage help. 
Basic usage: sde64 [args] -- application [application-args]

Mini Tutorial on Babbage

A mini tutorial about Babbage introduction and how Babbage can be used to prepare for Cori was presented at the NUG monthly meeting in July 2014. Slides can be found here

Intel Tools Training Slides, Oct 23, 2013

Further Information at Intel Developer Zone for MIC