NERSCPowering Scientific Discovery Since 1974

Compile and Run Alternative Programming Models

Overview

While we often provide quick-start documentation for compiling and running applications using MPI and OpenMP on NERSC systems, the same does not always exist for other supported parallel programming models such as UPC or Chapel.

At the same time, we know that these alternative programming models may play a valuable role in enabling application readiness for next-generation architectures. Therefore, in order to enable users to more easily begin running applications written using these models, we here provide brief synopses of the steps involved in getting started with the latter on Cori.

Although the list of programming models covered here is not exhaustive, we expect it to grow with time based on our interactions with NERSC users.

Note that although all of the examples below are executed under interactive salloc sessions, the compilation procedure and application launch command would be the same for a batch submission.

Unified Parallel C

Unified Parallel C (UPC) is supported on Cori through two different implementations: Berkeley UPC and Cray UPC.

Berkeley UPC

Berkeley UPC (BUPC) provides a portable UPC programming environment consisting of a source translation front-end (which in turn relies on a user-supplied C compiler underneath) and a runtime library based on GASNet. The latter is able to take advantage of advanced communications functionality of the Cray Aries interconnect on Cori, such as remote direct memory access (RDMA).

BUPC is available via the bupc module on Cori, which provides both the upcc compiler wrapper, as well as the upcrun launcher wrapper (which correctly initializes the environment and calls srun). Further, all three supported programming environments on Cori (Intel, GNU, and Cray) are supported by BUPC for use as the underlying C compiler.

There are a number of environment variables that affect the execution environment of your UPC application compiled with BUPC, all of which can be found in the BUPC documentation. One of the most important is UPC_SHARED_HEAP_SIZE, which controls the size of the shared symmetric heap used to service shared memory allocations. If you encounter errors at application launch related to memory allocation, you will likely want to start by adjusting this variable.

Compiling and running a simple application with BUPC on Cori is fairly straightforward. First, to compile:

$ module load bupc
## Loaded module 'bupc/2.26.0-6.0.4-intel-18.0.0.128' based on current PrgEnv
## and compiler. If you change PrgEnv or compiler modules, then you should
## run 'module switch bupc bupc' to get the correct bupc module.
$ cat cpi.c
/* The ubiquitous cpi program.
Compute pi using a simple quadrature rule
in parallel
Usage: cpi [intervals_per_thread] */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <upc_relaxed.h>
#define INTERVALS_PER_THREAD_DEFAULT 100
/* Add up all the inputs on all the threads.
When the collective spec becomes finalised this
will be replaced */
shared double reduce_data[THREADS];
shared double reduce_result;
double myreduce(double myinput)
{
#if defined(__xlC__)
// Work-around Bug 3228
*(volatile double *)(&myinput);
#endif
reduce_data[MYTHREAD]=myinput;
upc_barrier;
if(MYTHREAD == 0) {
double result = 0;
int i;
for(i=0;i < THREADS;i++) {
result += reduce_data[i];
}
reduce_result = result;
}
upc_barrier;
return(reduce_result);
}
/* The function to be integrated */
double f(double x)
{
double dfour=4;
double done=1;
return(dfour/(done + (x*x)));
}
/* Implementation of a simple quadrature rule */
double integrate(double left,double right,int intervals)
{
int i;
double sum = 0;
double h = (right-left)/intervals;
double hh = h/2;
/* Use the midpoint rule */
double midpt = left + hh;
for(i=0;i < intervals;i++) {
sum += f(midpt + i*h);
}
return(h*sum);
}
int main(int argc,char **argv)
{
double mystart, myend;
double myresult;
double piapprox;
int intervals_per_thread = INTERVALS_PER_THREAD_DEFAULT;
double realpi=3.141592653589793238462643;
/* Get the part of the range that I'm responsible for */
mystart = (1.0*MYTHREAD)/THREADS;
myend = (1.0*(MYTHREAD+1))/THREADS;
if(argc > 1) {
intervals_per_thread = atoi(argv[1]);
}
piapprox = myreduce(integrate(mystart,myend,intervals_per_thread));
if(MYTHREAD == 0) {
printf("Approx: %20.17f Error: %23.17f\n",piapprox,fabs(piapprox - realpi));
}
return(0);
}
$ upcc cpi.c -o cpi.x

And then run, in this case in a interactive salloc session:

$ salloc -N 2 -t 10:00 -p debug -C haswell
[...]
$ upcrun -n 4 ./cpi.x
UPCR: UPC thread 0 of 4 on nid01901 (pshm node 0 of 2, process 0 of 4, pid=36911)
UPCR: UPC thread 1 of 4 on nid01901 (pshm node 0 of 2, process 1 of 4, pid=36912)
UPCR: UPC thread 2 of 4 on nid01902 (pshm node 1 of 2, process 2 of 4, pid=35611)
UPCR: UPC thread 3 of 4 on nid01902 (pshm node 1 of 2, process 3 of 4, pid=35612)
Approx: 3.14159317442312691 Error: 0.00000052083333379

Cray UPC

UPC is directly supported under Cray's compiler environment through their PGAS runtime library (providing similar performance-enabling RDMA functionality to GASNet). To enable UPC support in your C code, simply switch to the Cray compiler environment and supply the '-h upc' option when calling cc. 

Because of its dependence on Cray's PGAS runtime, you may find the additional documentation available on the intro_pgas man page valuable. Specifically, two key environment variables introduced there are: 

  • XT_SYMMETRIC_HEAP_SIZE: Limits the size of the symmetric heap used to service shared memory allocations, analogous to BUPC's UPC_SHARED_HEAP_SIZE
  • PGAS_MEMINFO_DISPLAYCan be set to '1' in order to enable diagnostic output at launch regarding memory utilization.

In addition, there is one additional potential issue to be aware of: virtual memory limits in interactive salloc sessions. If you encounter errors on application launch similar to:

PE 0: ERROR: failed to attach XPMEM segment (at or around line 23 in __pgas_runtime_error_checking() from file ...)

then you may need to release your virtual memory limits by running:

ulimit -v unlimited

With all of this in mind, compiling and running a simple UPC application, analogous to the above example for BUPC but now using the Cray compilers, would look like:

$ module swap PrgEnv-intel PrgEnv-cray
$ cc -h upc cpi.c -o cpi.x $ salloc -N 2 -t 10:00 -p debug -C haswell
[...]
$ ulimit -v unlimited # may not be necessary
$ srun -n 4 ./cpi.x
Approx:  3.14159317442312691 Error:     0.00000052083333379 

Coarray Fortran

Coarray fortran (CAF) is supported on Cori through two different implementations as well: Cray CAF and Intel CAF.

Cray CAF

Like UPC, Coarray fortran is directly supported under Cray's compiler environment through their PGAS runtime library. To enable CAF support in your fortran code, simply switch to the Cray compiler environment and supply the '-h caf' option when calling ftn.

Because of the shared dependence on libpgas, both PGAS_MEMINFO_DISPLAY and XT_SYMMETRIC_HEAP_SIZE remain relevant, as does the guidance supplied in the intro_pgas man page. Further, for the time being, you again may need to manually unlimit your virtual memory limits when running Cray PGAS applications in an interactive salloc session.

For example, compiling and executing (again, in an salloc session) a simple CAF application under the Cray compilers might look like:

$ module swap PrgEnv-intel PrgEnv-cray
$ cat caf_hello.f90
program Hello_World
implicit none
integer :: i
character(len=20) :: name[*]
if (this_image() == 1) then
write(*,'(a)',advance='no') 'Enter your name: '
read(*,'(a)') name do i = 2, num_images()
name[i] = name
end do
end if
sync all
write(*,'(3a,i0)') 'Hello ',trim(name),' from image ', this_image()
end program Hello_world
$ ftn -h caf caf_hello.f90 -o caf_hello.x
$ salloc -N 2 -t 10:00 -p debug -C haswell
[...]
$ ulimit -v unlimited # may not be necessary
$ srun -n 4 ./caf_hello.x
Enter your name: Cori
Hello Cori from image 1
Hello Cori from image 2
[...]

Intel CAF

Coarray fortran is also supported under the Intel fortran compilers, although it does not use a native PGAS support library under the hood. Instead, it uses a portable runtime library based on Intel MPI. Further, Intel-compiled CAF applications provide an additional layer of abstraction over their use of an MPI-based runtime in the form of an integrated launcher (i.e. the application can be run directly without explicitly calling mpirun or the like).

To enable CAF support, use the -coarray argument to Intel fortran. Specifically you will typically want to use -coarray=distributed in order to support distributed-memory CAF (i.e. across nodes). See the ifort man page for more details.

In addition, although the integrated launcher avoids exposing the user to the implicit dependence on MPI, it can be difficult to configure. Thus, we instead advise you to use the native SLURM launcher (srun) on Cori. This can be done by specifying the '-switch no_launch' argument to ifort. Further, in order to properly enable integration between SLURM and Intel MPI, one needs to also set the I_MPI_PMI_LIBRARY environment variable to point to the SLURM PMI library (we also recommend adjusting I_MPI_FABRICS to a more restrictive set of shm and tcp only). 

Taken together, running the same simple CAF application as demonstrated above, but now under Intel fortran, might look something like: 

$ # make sure PrgEnv-intel is loaded
$ module load impi
$ ifort -coarray=distributed -switch no_launch caf_hello.f90 -o caf_hello.x
$ salloc -N 2 -t 10:00 -p debug -C haswell
[...]
$ export I_MPI_PMI_LIBRARY=/opt/slurm/default/lib/libpmi.so
$ srun -n 4 ./caf_hello.x
Enter your name: Cori
Hello Cori from image 1
Hello Cori from image 2
[...]