NERSCPowering Scientific Discovery Since 1974

Programming

Using the CUDA Compiler

To compile a CUDA code, first load the following module:

module load cuda

Then use the CUDA compiler:

nvcc kernel.cu

MPI + CUDA

Mixing MPI (C) and CUDA (C++) code requires some care during linking because of differences between the C and C++ calling conventions and runtimes. A helpful overview of the issues can be found at How to Mix C and C++.

One option is to compile and link all source files with a C++ compiler, which will enforce additional restrictions on C code. Alternatively, if you wish to compile your MPI/C code with a C compiler and call CUDA kernels from within an MPI task, you can wrap the appropriate CUDA-compiled functions with the extern keyword, as in the following example.

These two source files can be compiled and linked with both a C and C++ compiler into a single executable using:

mpicc -c main.c -o main.o
nvcc -c multiply.cu -o multiply.o

mpicc main.o multiply.o -L/usr/local/cuda/lib64 -lcudart

The CUDA/C++ compiler nvcc is used only to compile the CUDA source file, and the MPI C compiler mpicc is user to compile the C code and to perform the linking
/* multiply.cu */

#include <cuda.h>
#include <cuda_runtime.h>

__global__ void __multiply__ (const float *a, float *b)
{
const int i = threadIdx.x + blockIdx.x * blockDim.x;
b[i] *= a[i];
}

extern "C" void launch_multiply(const float *a, const *b)
{
/* ... load CPU data into GPU buffers a_gpu and b_gpu */

__multiply__ <<< ...block configuration... >>> (a_gpu, b_gpu);

safecall(cudaThreadSynchronize());
safecall(cudaGetLastError());

/* ... transfer data from GPU to CPU */

Note the use of extern "C" around the function launch_multiply which instructs the C++ compiler (nvcc in this case) to make that function callable from the C runtime. The following C code shows how the function could be called from an MPI task.

/* main.c */

#include <mpi.h>

void launch_multiply(const float *a, float *b);

int main (int argc, char **argv)
{
int rank, nprocs;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &nprocs);

/* ... prepare arrays a and b */

launch_multiply (a, b);

MPI_Finalize();
return 1;
}

MPI + CUDA + Fortran

This is also something requiring care.  One recommended approach is to separate code that uses CUDA from code that uses MPI; compile the CUDA code using nvcc; compile all other code using MPI wrappers; and link with the MPI wrapper, passing link flags for finding and using CUDA libraries.  An example you can try is here (tar file).  This is a "Life" program care of the Portland Group.  Four variants of the code are built with the 'make' command: a sequential, non-GPU version; a sequential GPU version; and two MPI+GPU variants.  The pgi-gpu modulefile (as well as openmpi) must be loaded for this to work.

Using the TotalView Debugger

TotalView is a parallel debugger which supports CUDA debugging.  To use it on Dirac:

qsub -I -V -q dirac_int -l nodes=...     # use appropriate resource parameters

module load cuda

module load totalview

nvcc -g -G prefix.cu -o prefix # -g for host and -G for device

totalview ./prefix

For information on how to use TotalView, please check the TotalView web page.