NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

MPI Errors

Special considerations must be taken when debugging MPI codes. Not only is the programming paradigm less familiar than traditional serial programming, but synchronization problems can be particularly tricky to identify. The best advice for debugging MPI is to have a good reference available, as man pages can be incomplete (e.g. the Fortran bindings are missing from the man pages here at NERSC). In addition, always run test cases before you implement a routine in a large code. You can never be sure that you understand how an MPI binding works until you've run test cases on the specific machine you will implement it on.

General Strategy

It's difficult to identify a general strategy for debugging MPI codes, since every program tends to be so different. But it's always a good idea to check the man page on every binding to make sure the syntax is correct.

The best tool for debugging MPI codes is Totalview. Totalview allows you to view variables on any processor and set break points in multiple places. An important consideration is that often it is best to set a break point after a chunk of code which contains complicated communication procedures, because otherwise the line-by-line stepping of Totalview can affect communication behavior.

Things to look for when using Totalview include the arguments being passed in an MPI call. Is every processor passing the arguments you think it's passing? Totalview is also handy for checking message tags when the code hangs. Another problem that can be easily spotted in Totalview is if one processor has stopped execution through a stop statement or some other reason. A last thing to check for is that a collective communication call such as MPI_BCAST is being made on every processor.

Common Errors

Here are some common MPI errors:

Fortran versus C Bindings

Fortran bindings require an extra argument that is often forgotten by the programmer. For example, the Fortran binding to call a barrier is:

Fortran
MPI_BARRIER(comm,ierr)
C
MPI_Barrier(MPI_Comm comm)

It is easy to forget the ierr tag in the Fortran case.

Status in Receives

A few MPI bindings require a status argument, including the basic receive. For example, here is the Fortran binding for a basic receive:

Fortran
MPI_RECV(buf,count,datatype,source,tag,comm,status,ierr)

The status argument should be dimensioned as an integer array of length MPI_STATUS_SIZE:

status(MPI_STATUS_SIZE)

where MPI_STATUS_SIZE is defined in the include file (mpif.h or mpic.h). The user should not dimension this array themselves.

Reduction Operations

When using a reduction operation such as MPI_REDUCE, where a variable from all processors are combined in some way into a variable on one processor, make sure that the source and target names are different. For example, the Fortran binding for a reduce where a scalar from each processor is summed and the result is stored in a variable on processor zero would be:

MPI_REDUCE(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)

If you wanted to use this to sum the energy calculated on each processor to find the total energy of the system, then you might be tempted to call both variables energy. Hence you would type:

MPI_REDUCE(energy,energy,1,MPI_REAL,MPI_SUM,0,MPI_COMM_WORLD,ierr)

This will bomb.

Non-Blocking I/O

An important aspect of non-blocking I/O that is in the man page but not in the manual is that you cannot call MPI_WAIT. You need to call MPIO_WAIT, otherwise the code will hang.

Limit on MPI Tags

MPI tag numbers greater than 32,768 (this is 2^15) are not allowed. This problem might arise when using automatic tag generation.


LBNL Home
Page last modified: Fri, 07 Nov 2003 20:14:44 GMT
Page URL: http://www.nersc.gov/nusers/help/tutorials/debug/mpi.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science