NERSCPowering Scientific Discovery Since 1974

Running Scripts

Run serial Python scripts on a login node, or on a compute node in an interactive session (started via salloc) or batch job (submitted via sbatch) as you normally would in any Unix-like environment.  On login nodes, please be mindful of resource consumption since those nodes are shared by many users at the same time.  

Parallel Python scripts launched in an interactive (salloc) session or batch job (sbatch), such as those using MPI via the mpi4py module, must use srun to launch:

srun -n 64 python ./hello-world.py

Matplotlib on Compute Nodes

Using Matplotlib to interactively plot on the login nodes is easy, especially if you use NX. But if you are running a Python script on compute nodes that imports Matplotlib, even if it doesn't make any plot files, it is important to specify an appropriate "backend." There are a few ways to do this, one is to simply tell Matplotlib to use a particular backend in your script as below:

import matplotlib
matplotlib.use( "Agg" )
import matplotlib.pyplot as plt

The "Agg" backend is guaranteed to be available, but there are other choices. If a backend is not specified in some way, then Matplotlib will seek out an X11 connection on the compute nodes in your job and the result is that it your job may simply wait until the wall-clock limit is reached. More technical details are available in the Matplotlib FAQ, "What is a Backend?" and the matplotlib.use API documentation.

Parallelism in Python

Many scientists have come to appreciate Python's power for developing scientific computing applications. Creating such applications that scale in modern high-performance computing environments can be a challenge. There are a number of approaches to parallel processing in Python. Here we describe approaches that we know work for users at NERSC. For advice on scaling up Python applications, see this page.

MPI for Python (mpi4py)

MPI standard bindings to the Python programming language.  Documentation on mpi4py is available here and useful collection of example scripts can be found here.  An example of using mpi4py on a Cori compute node (using Anaconda Python 2.7) is shown below.

% cat mympi.py
#!/usr/bin/env python
from mpi4py import MPI
me = MPI.COMM_WORLD.Get_rank()
nproc = MPI.COMM_WORLD.Get_size()
print me, nproc

% cat runit
#!/bin/bash
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -n 32 #SBATCH -t 00:05:00 #SBATCH -p debug module load python srun -n 32 python ./mympi.py % sbatch runit
Submitted batch job 929783
% cat slurm-929783.out ... 9 32
14 32
3 32 ... 0 32 ...

Python's Multiprocessing Module

Python's standard library provides a multiprocessing package that supports spawning of processes. This can be used to achieve some level of parallelism within a single compute node. It cannot be used to achieve parallelism across compute nodes. For that, users are referred to the discussion on mpi4py below.

If you are using the multiprocessing module, it is advised that you tell srun to use all the threads available on the node with the "-c" argument.  For example, on Cori use:

srun -n 1 -c 64 python script-using-multiprocessing.py 

NOTE: Python multiprocessing achieves process-level parallelism through fork(). By default you can only expect multiprocessing to do a "pretty good" job of load-balancing tasks. For more fine-grained control of parallelism within a node, consider parallelism via Cython or writing C/C++/Fortran extensions that take advantage of OpenMP or threads.

FURTHER NOTE: Staff at various other centers go so far as to recommend strongly against using multiprocessing at all in an HPC context because of issues with affinity of forked processes; Python multiprocessing's shared memory model interacting poorly with many MPI implementations, threaded libraries, and libraries using shared memory; and debuggers and performance tools have trouble following forked processes. We suppose that it can have limited application in specific cases, provided users are informed of the issues.

Multiprocessing Interaction with OpenMP

If your multiprocessing code makes calls to a threaded library like numpy with threaded MKL support then you need to consider oversubscription of threads. While process affinity can be controlled to some degrees in certain contexts (e.g. Python distributions that implement os.sched_{get,set}affinity) it is generally easier to reduce the number of threads used by each process. Actually it is most advisable to set it to a single thread. In particular for OpenMP:

export OMP_NUM_THREADS=1

Furthermore, use of Python multiprocessing on KNL you are advised to specify

export KMP_AFFINITY=disabled

as explained here. (TBD)

Issues Combining Multiprocessing and MPI

Users have been able to combine Python multiprocessing and mpi4py to achieve hybrid parallelism on NERSC systems, but not without issues. If you decide to try to combine mpi4py and Python multiprocessing, be advised that on the NERSC Cray systems (Cray MPICH) one must set the following environment variable:

export MPICH_GNI_FORK_MODE=FULLCOPY

See the "mpi_intro" man-page for details. Again we advise that combining Python multiprocessing and mpi4py qualifies as a "hack" that may work for developers in the short term. Users are strongly encouraged to consider alternatives.