NERSCPowering Scientific Discovery Since 1974

Example Batch Scripts for KNL

Introduction 

This page provides some example batch scripts for running on Cori's KNL nodes.

The KNL architecture is complex but affords a great deal of flexibility for experienced users. More details of the KNL processor memory and NUMA nodes can be found here. We recommend that users try NERSC's default memory mode (quad,cache) and get things working before experimenting with different modes.

General Recommendations and FAQ for Running Jobs

Running on KNL nodes is very similar to running on Haswell nodes. The primary differences are the fact that KNL nodes have 68 physical cores per node with 4 logical cores per physical core, and there are multiple memory modes available. Please make sure to read these recommendations and Running Jobs FAQ first, especially regarding the use of srun "-c" and "--cpu_bind=cores" options.

Job Script Generator

An interactive Job Script Generator is available at MyNERSC to provide some guidance on getting optimal process and thread binding for running hybrid MPI/OpenMP programs on Edison, Cori Haswell, and Cori KNL. Select your machine, number of nodes, and more, and a skeleton script will be generated for you. Please note that you will still need to add some additional information, such as the walltime and of course the name of your executable, to complete the script.

Checking Process and Thread Affinity

Pre-built binaries from a small test code with pure MPI or hybrid MPI/OpenMP can be used to check affinity.  Please see details here.

Example Pure MPI Batch Scripts

The simplest case is running the same number of MPI tasks as there are physical cores on each node. The example below uses 2 KNL nodes in quad cache mode, 68  MPI ranks per node for a total of 136 MPI tasks.  

#!/bin/bash -l
#SBATCH -N 2
#SBATCH -p debug
#SBATCH -C knl,quad,cache
#SBATCH -t 30:00
#SBATCH -L SCRATCH

export OMP_NUM_THREADS=1 # only needed for hybrid MPI/OpenMP codes built with "-qopenmp" flag

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

srun -n 136 ./mycode.exe

The next example  uses 2 KNL nodes in quad cache mode, with 32  MPI ranks per node.  This is not fully packed  MPI (the number of MPI ranks per node is not evenly divisible by 68) so the following two options are needed on the srun command line: "-c" and "--cpu_bind=cores". The value following the "-c" option sets the number of logical cores to allocate per MPI task. For this example it should be set as floor(68/32)*4=8 which allocate 8 logical cores per MPI task. Since there are 4 logical cores per physical core, this will bind each MPI task to 2 physical cores.

#!/bin/bash -l
#SBATCH -N 2
#SBATCH -p debug
#SBATCH -C knl,quad,cache
#SBATCH -S 4
#SBATCH -t 30:00
#SBATCH -L project

export OMP_NUM_THREADS=1 # only needed for hybrid MPI/OpenMP codes built with "-qopenmp" flag

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

srun -n 64 -c 8 --cpu_bind=cores ./mycode.exe

Example MPI/OpenMP Batch Scripts

The example below uses 2 KNL quad flat nodes, 32 MPI ranks per node, 2 OpenMP threads per MPI task. Set 2 cores for core specialization.  MCDRAM usage in NUMA domain 1 is preferred. This is not fully packed pure MPI, #MPI_per_node is not a divisor of 68, so both "-c" and "--cpu_bind=cores" are needed.   The "-c" value is set as floor(68/32)*4=8,  meaning to allocate 8 logical cores (CPUs) per MPI task, so each MPI task will bind to 2 physical cores. 

#!/bin/bash -l
#SBATCH -N 2
#SBATCH -p regular
#SBATCH -C knl,quad,flat
#SBATCH -S 4 # use core specialization
#SBATCH -t 3:00:00
#SBATCH -L SCRATCH,project

export OMP_NUM_THREADS=2
export OMP_PROC_BIND=true #"spread" is also good for Intel and CCE compilers
export OMP_PLACES=threads

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

srun -n 64 -c 8 --cpu_bind=cores numactl -p 1 ./mycode.exe

The example below uses 2 KNL snc2 flat nodes, 32 MPI ranks per node, 2 OpenMP threads per MPI task. Set 2 cores aside for core specialization.  MCDRAM usage in NUMA domains 2 and 3 are preferred. This is not fully packed pure MPI, #MPI_per_node is not a divisor of 68, so both "-c" and "--cpu_bind=cores" are needed.   The "-c" value is set as floor(68/32)*4=8,  meaning to allocate 8 logical cores (CPUs) per MPI task, so each MPI task will bind to 2 physical cores.

#!/bin/bash -l
#SBATCH -N 2
#SBATCH -p regular
#SBATCH -C knl,snc2,flat
#SBATCH -S 4  # use core specialization      
#SBATCH -t 3:00:00
#SBATCH -L SCRATCH

export OMP_NUM_THREADS=2
export OMP_PROC_BIND=true #"spread" is also good for Intel and CCE compilers
export OMP_PLACES=threads

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

srun -n 64 -c 8 --cpu_bind=cores numactl -p 2,3 ./mycode.exe

The example below uses 1 KNL quad cache nodes, 128 MPI ranks per node, 2 OpenMP threads per MPI task. This is not fully packed pure MPI, the #MPI_per_node is not a divisor of 68, the "-c" value is set as floor(68/32)*4=8,  meaning to allocate 8 logical cores (CPUs) per MPI task, so each MPI task will bind to 2 physical cores.  Since there are more MPI tasks than number of physical cores per node, "--cpu_bind=threads" is needed instead of "--cpu_bind=cores".

#!/bin/bash -l
#SBATCH -N 1
#SBATCH -p debug
#SBATCH -C knl,quad,cache
#SBATCH -S 4  # use core specialization      
#SBATCH -t 30:00
#SBATCH -L SCRATCH,project

export OMP_NUM_THREADS=2

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

srun -n 128 -c 2 --cpu_bind=threads ./mycode.exe

Memory Binding Options for MCDRAM

For cache mode, there is no special need to specify using MCDRAM.  For flat mode, binding memory to the corresponding NUMA domains to use MCDRAM.  "numactl -m ..." (forced) or "numactl -p ..." (preferred) can be used for memory binding.  SLURM also has a "--mem_bind option" (forced binding only, preferred binding coming soon).  Please see examples below:

# Quad example:
% srun -n 16 -c 16 --cpu_bind=cores numactl -p 1 ./app #-p 1 is preferred, use -m 1 to enforce
% srun -n 16 -c 16 --cpu_bind=cores --mem_bind=mem_map:1./app
# SNC2 example:
% srun -n 16 -c 16 --cpu_bind=cores numactl -p 2,3 ./app # -p 2,3 is preferred, use -m 2,3 to enforece
% srun -n 16 -c 16 --cpu_bind=cores --mem_bind=map_mem: 2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3 ./app

Running Large Jobs (over 1500 MPI tasks) Considerations

Large jobs may take a longer to start up, especially on KNL nodes. The srun option --bcast=<destination_path> is recommended for large jobs requesting over 1500 MPI tasks. By default SLURM loads the executable to the allocated compute nodes from the current working directory, this may take long time when the file system (where the executable resides) is slow. With the --bcast=/tmp/myjob, the executable will be copied to the /tmp/myjob directory. Since /tmp is part of the memory on the compute nodes, it can speed up the job startup time.  

Please use the following srun commands instead of normal srun commands:

% sbcast --compress=lz4 ./mycode.exe /tmp/mycode.exe     # here -C is to compress first
% srun <srun options> numactl <numactl options> /tmp/mycode.exe
# or in the case of when numactl is not needed: % srun --bcast=/tmp/mycode.exe --compress=lz4 <srun options> ./mycode.exe

Use Core Specialization (to isolate system overhead to specific cores)

Core specialization is a feature designed to isolate system overhead (system interrupts, etc.) to designated cores on a compute node. It is generally helpful for running on KNL, especially if the application does not plan to use all physical cores on a 68-core compute node. Set aside 2 or 4 cores for core specialization is recommended.

The srun flag for core specialization is "-S" or "--core-spec".  It only works in a batch script with sbatch.  It can not be requested as a flag with salloc for interactive batch, since salloc is already a wrapper script for srun. 

 

Using the Burst Buffer

Users can access the Burst Buffer from both KNL and Haswell nodes. The following example script demonstrates a request of a 100GB allocation on the Burst Buffer in a "scratch" type, that will last for the duration of the job. It also stages data in and out of the allocation - data specified in the stage_in command will be moved onto your Burst Buffer allocation before the start of the compute job, and data specified in stage_out will be moved out after the job finishes. Your Burst Buffer allocation can be accessed using the variable $DW_JOB_STRIPED from within the compute job. Full details of the many subtleties involved in accessing the Burst Buffer, and more example scripts and use cases are given here

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -C knl
#SBATCH -t 00:05:00
#DW jobdw capacity=100GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file
#DW stage_out source=$DW_JOB_STRIPED/directoryname destination=/global/cscratch1/sd/username/path/to/directoryname type=directory
srun a.out INSERT_YOUR_CODE_OPTIONS_HERE

More Advanced Options

Example batch scripts with advanced options that apply to both Haswell and KNL nodes can be found here, including running xfer jobs, using the realtime partition, requesting specific nodes, running MPMD jobs, running CCM jobs, running binaries built with Intel MPI and OpenMPI, running multiple parallel jobs sharing nodes, etc.