NERSCPowering Scientific Discovery Since 1974

Additional Programming Models


While we often provide quick-start documentation for compiling and running applications using MPI and OpenMP on NERSC systems, the same does not always exist for other supported parallel programming models such as UPC or Chapel.

At the same time, we know that these alternative programming models may play a valuable role in enabling application readiness for next-generation architectures. Therefore, in order to enable users to more easily begin running applications written using these models, we here provide brief synopses of the steps involved in getting started with the latter on Cori.

Although the list of programming models covered here is not exhaustive, we expect it to grow with time based on our interactions with NERSC users.

Note that although all of the examples below are executed under interactive salloc sessions, the compilation procedure and application launch command would be the same for a batch submission.

Unified Parallel C

Unified Parallel C (UPC) is supported on Cori through two different implementations: Berkeley UPC and Cray UPC.

Berkeley UPC

Berkeley UPC (BUPC) provides a portable UPC programming environment consisting of a source translation front-end (which in turn relies on a user-supplied C compiler underneath) and a runtime library based on GASNet. The latter is able to take advantage of advanced communications functionality of the Cray Aries interconnect on Cori, such as remote direct memory access (RDMA).

BUPC is available via the bupc module on Cori, which provides both the upcc compiler wrapper, as well as the upcrun launcher wrapper (which correctly initializes the environment and calls srun). Further, all three supported programming environments on Cori (Intel, GNU, and Cray) are supported by BUPC for use as the underlying C compiler.

There are a number of environment variables that affect the execution environment of your UPC application compiled with BUPC, all of which can be found in the BUPC documentation. One of the most important is UPC_SHARED_HEAP_SIZE, which controls the size of the shared symmetric heap used to service shared memory allocations. If you encounter errors at application launch related to memory allocation, you will likely want to start by adjusting this variable.

Compiling and running a simple application with BUPC on Cori is fairly straightforward. First, to compile:

$ module load bupc
## Loaded module 'bupc/2.22.0-5.2.82-intel-2016.0.109' based on current PrgEnv
## and compiler. If you change PrgEnv or compiler modules, then you should
## run 'module switch bupc bupc' to get the correct bupc module.
$ cat cpi.c
/* The ubiquitous cpi program.
Compute pi using a simple quadrature rule
in parallel
Usage: cpi [intervals_per_thread] */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <upc_relaxed.h>
/* Add up all the inputs on all the threads.
When the collective spec becomes finalised this
will be replaced */
shared double reduce_data[THREADS];
shared double reduce_result;
double myreduce(double myinput)
#if defined(__xlC__)
// Work-around Bug 3228
*(volatile double *)(&myinput);
if(MYTHREAD == 0) {
double result = 0;
int i;
for(i=0;i < THREADS;i++) {
result += reduce_data[i];
reduce_result = result;
/* The function to be integrated */
double f(double x)
double dfour=4;
double done=1;
return(dfour/(done + (x*x)));
/* Implementation of a simple quadrature rule */
double integrate(double left,double right,int intervals)
int i;
double sum = 0;
double h = (right-left)/intervals;
double hh = h/2;
/* Use the midpoint rule */
double midpt = left + hh;
for(i=0;i < intervals;i++) {
sum += f(midpt + i*h);
int main(int argc,char **argv)
double mystart, myend;
double myresult;
double piapprox;
int intervals_per_thread = INTERVALS_PER_THREAD_DEFAULT;
double realpi=3.141592653589793238462643;
/* Get the part of the range that I'm responsible for */
mystart = (1.0*MYTHREAD)/THREADS;
myend = (1.0*(MYTHREAD+1))/THREADS;
if(argc > 1) {
intervals_per_thread = atoi(argv[1]);
piapprox = myreduce(integrate(mystart,myend,intervals_per_thread));
if(MYTHREAD == 0) {
printf("Approx: %20.17f Error: %23.17f\n",piapprox,fabs(piapprox - realpi));
$ upcc cpi.c -o cpi.x

And then run, in this case in a interactive salloc session:

$ salloc -N 2 -t 10:00 -p debug
$ upcrun -n 4 ./cpi.x
UPCR: UPC thread 0 of 4 on nid01901 (pshm node 0 of 2, process 0 of 4, pid=36911)
UPCR: UPC thread 1 of 4 on nid01901 (pshm node 0 of 2, process 1 of 4, pid=36912)
UPCR: UPC thread 2 of 4 on nid01902 (pshm node 1 of 2, process 2 of 4, pid=35611)
UPCR: UPC thread 3 of 4 on nid01902 (pshm node 1 of 2, process 3 of 4, pid=35612)
Approx: 3.14159317442312691 Error: 0.00000052083333379

Cray UPC

UPC is directly supported under Cray's compiler environment through their PGAS runtime library (providing similar performance-enabling RDMA functionality to GASNet). To enable UPC support in your C code, simply switch to the Cray compiler environment and supply the '-h upc' option when calling cc. 

Because of its dependence on Cray's PGAS runtime, you may find the additional documentation available on the intro_pgas man page valuable. Specifically, two key environment variables introduced there are: 

  • XT_SYMMETRIC_HEAP_SIZE: Limits the size of the symmetric heap used to service shared memory allocations, analogous to BUPC's UPC_SHARED_HEAP_SIZE
  • PGAS_MEMINFO_DISPLAYCan be set to '1' in order to enable diagnostic output at launch regarding memory utilization.

In addition, there is one additional potential issue to be aware of: virtual memory limits in interactive salloc sessions. If you encounter errors on application launch similar to:

PE 0: ERROR: failed to attach XPMEM segment (at or around line 23 in __pgas_runtime_error_checking() from file ...)

then you may need to release your virtual memory limits by running:

ulimit -v unlimited

With all of this in mind, compiling and running a simple UPC application, analogous to the above example for BUPC but now using the Cray compilers, would look like:

$ module swap PrgEnv-intel PrgEnv-cray
$ cc -h upc cpi.c -o cpi.x $ salloc -N 2 -t 10:00 -p debug
$ ulimit -v unlimited # may not be necessary
$ srun -n 4 ./cpi.x
Approx:  3.14159317442312691 Error:     0.00000052083333379 

Coarray Fortran

Coarray fortran (CAF) is supported on Cori through two different implementations as well: Cray CAF and Intel CAF.

Cray CAF

Like UPC, Coarray fortran is directly supported under Cray's compiler environment through their PGAS runtime library. To enable CAF support in your fortran code, simply switch to the Cray compiler environment and supply the '-h caf' option when calling ftn.

Because of the shared dependence on libpgas, both PGAS_MEMINFO_DISPLAY and XT_SYMMETRIC_HEAP_SIZE remain relevant, as does the guidance supplied in the intro_pgas man page. Further, for the time being, you again may need to manually unlimit your virtual memory limits when running Cray PGAS applications in an interactive salloc session.

For example, compiling and executing (again, in an salloc session) a simple CAF application under the Cray compilers might look like:

$ module swap PrgEnv-intel PrgEnv-cray
$ cat caf_hello.f90
program Hello_World
implicit none
integer :: i
character(len=20) :: name[*]
if (this_image() == 1) then
write(*,'(a)',advance='no') 'Enter your name: '
read(*,'(a)') name do i = 2, num_images()
name[i] = name
end do
end if
sync all
write(*,'(3a,i0)') 'Hello ',trim(name),' from image ', this_image()
end program Hello_world
$ ftn -h caf caf_hello.f90 -o caf_hello.x
$ salloc -N 2 -t 10:00 -p debug
$ ulimit -v unlimited # may not be necessary
$ srun -n 4 ./caf_hello.x
Enter your name: Cori
Hello Cori from image 1
Hello Cori from image 2

Intel CAF

Coarray fortran is also supported under the Intel fortran compilers, although it does not use a native PGAS support library under the hood. Instead, it uses a portable runtime library based on Intel MPI. Further, Intel-compiled CAF applications provide an additional layer of abstraction over their use of an MPI-based runtime in the form of an integrated launcher (i.e. the application can be run directly without explicitly calling mpirun or the like).

To enable CAF support, use the -coarray argument to Intel fortran. Specifically you will typically want to use -coarray=distributed in order to support distributed-memory CAF (i.e. across nodes). See the ifort man page for more details.

In addition, although the integrated launcher avoids exposing the user to the implicit dependence on MPI, it can be difficult to configure. Thus, we instead advise you to use the native SLURM launcher (srun) on Cori. This can be done by specifying the '-switch no_launch' argument to ifort. Further, in order to properly enable integration between SLURM and Intel MPI, one needs to also set the I_MPI_PMI_LIBRARY environment variable to point to the SLURM PMI library (we also recommend adjusting I_MPI_FABRICS to a more restrictive set of shm and tcp only). 

Taken together, running the same simple CAF application as demonstrated above, but now under Intel fortran, might look something like: 

$ # make sure PrgEnv-intel is loaded
$ module load impi
$ ifort -coarray=distributed -switch no_launch caf_hello.f90 -o caf_hello.x
$ salloc -N 2 -t 10:00 -p debug
$ export I_MPI_PMI_LIBRARY=/opt/slurm/default/lib/
$ srun -n 4 ./caf_hello.x
Enter your name: Cori
Hello Cori from image 1
Hello Cori from image 2


Chapel is a parallel programming language that provides both task- and data-parallel programming constructs, as well as a PGAS-like model for accessing shared objects (e.g. arrays). Because it is chiefly developed by Cray, Chapel is directly supported on Cori through the Cray-supplied chapel module. 

By default, the Chapel compiler (chpl) creates two binaries:

  • One wrapper / launcher binary, which is given either the default binary name of a.out or that which you supply via the -o compiler option; and
  • The actual binary that results from building your Chapel application, which has the same name as the launcher binary but with _real appended to it.

While the separate wrapper can provide a convenient way to launch your application on certain systems, we recommend launching the "real" application (e.g. a.out_real) directly via SLURM's srun on Cori. Specifically, we recommend explicitly configuring the following when calling srun:

  • the overall number of nodes via --nodes=<num nodes>
  • the overall number of tasks equal to the number of nodes via --ntasks=<num nodes> --ntasks-per-node=1
  • the number of cpus per task equal to the available hardware threads per node via --cpus-per-task=64

This combination on Cori seems to give Chapel's own runtime library the most latitude in terms of thread / task placement and scheduling. Note that in this case, you would also want to launch your executable with the number of locales (Chapel's concept of a locality domain) equal to the number of nodes via the -nl option. You can find more details on the locale concept and more in the official Chapel documentation

With this in mind, compiling a simple Chapel application on Cori may look something like: 

$ module load chapel
$ cat hello-coforall.chpl
module Hello {
config const audience = "world";
proc main() {
coforall loc in Locales do
on loc do
writeln("Hello, ", audience, " from node ",, " of ", numLocales);
} $ chpl hello-coforall.chpl -o hello-coforall
$ ls hello-coforall*
hello-coforall*  hello-coforall.chpl  hello-coforall_real*

and then execution:

$ salloc -N 2 -t 10:00 -p debug
$ srun --nodes=2 --ntasks=2 --ntasks-per-node=1 --cpus-per-task=64 ./hello-coforall_real -v -nl 2
QTHREADS: Using 32 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8388608 byte stack size.
executing on node 1 of 2 node(s): nid00565
QTHREADS: Using 32 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8388608 byte stack size.
executing on node 0 of 2 node(s): nid00564
Hello, world from node 0 of 2
Hello, world from node 1 of 2