Introduction to MPI
This tutorial is intended to be used as a basic introduction to parallel programming on the NERSC parallel machines. It introduces a number of concepts and the MPI message passing library.
This tutorial assumes that you are familiar with FORTRAN and basic UNIX commands.
Access to the examples can be gained with the following commands on the SP:
% module load training % cd $EXAMPLES/mpi/intro
- Basic Concepts
- Introduces basic concepts of using MPI and passing simple messages.
- Initialization
- Initializing the library.
- Data Types
- Mapping MPI datatypes to machine data types.
- Basic Send
- Moving data to another task.
- Basic Receive
- Moving data from another task.
- Broadcast
- Sending data from one processor to all others.
- Reduction
- How to get data from all other processors.
- Example
- A small example using broadcast and reduction routines.
Related information
Basic Concepts
Franklin, Bassi, and Jacquard are mixed distributed memory and shared memory machines. Each node consists of multiple CPUs (8 on Bassi and 2 on Franklin and Jacquard) that have direct access to a single node's pool of memory. Each node has its own memory and does not have direct access to the memory on other nodes.
Fortran, C, and C++ provide no mechanism for directly accessing or sharing memory that resides on another node. If a program needs data that is stored on another node, the programmer must incorporate subroutines calls from a library of message passing routines to get that data. The most common portable message passing library is called the Message Passing Interface or MPI. MPI is a standard that allows you to send data from one node to another. MPI also allows you to communicate among MPI tasks, whether they are executing on the same node or not.
While a single node does have multiple CPUs, most codes still use MPI calls to transfer data between all processors, even those on the same node. In many cases compilers and operating systems will recognize when MPI is communicating on-node and will substitute faster shared-memory constructs where appropriate.
In the following, well use the term task and process rather interchangeably to refer to the independent, parallel execution of code. You can have a number of MPI tasks less than or equal to the number of CPUs on the node. (If you are writing a hybrid code, e.g. using OpenMP and MPI, each MPI task can have a number of associated threads.)
Initialization
To use the MPI library you must include header files which contain definitions and declarations that are needed by the MPI library routines. In Fortran 90 you can use the mpi module. The following line must appear at the top of any source code file that will make an MPI call.
Fortran
INCLUDE 'mpif.h'
USE mpi
C/C++:
#include "mpi.h"
MPI_Init
The first MPI call must be MPI_INIT, which initializes the message passing routines, for example:
Fortran:
INTEGER:: ierr CALL MPI_INIT(ierr)
where ierr is an integer which holds an error code when the call returns.
NOTE: The value of ierr is really of little use since, by default, MPI aborts the program when it encounters an error. However, ierr must be included with MPI calls anyway.
C
int MPI_Init( int *argc, char ***argv)
where argc and argv are the arguments
passed to main. MPI does not use these arguments in
any way, however, and in MPI-2 implementations, NULL may
be passed instead.
MPI_Finalize
When you are finished with the message passing routines, you must close out the MPI routines. The command for doing this finalization is
Fortran
CALL MPI_FINALIZE(ierr)
C
int MPI_Finalize(void)
Inquiry routines
Two other MPI calls are usually made soon after initialization. They are
MPI_COMM_SIZE()-
MPI_COMM_SIZEis used to find the number of tasks in a specified MPI communicator. In MPI, you can divide your total number of tasks into groups, called communicators. In this tutorial we will only refer to the communicator that includes all MPI processes:MPI_COMM_WORLD.The format for
MPI_COMM_SIZEis as follows:Fortran
INTEGER:: totTasks, ierr CALL MPI_COMM_SIZE( MPI_COMM_WORLD, totTasks, ierr )
C
int MPI_Comm_size( MPI_COMM_WORLD, int *totTasks )
where the total number of tasks in
MPI_COMM_WORLDis returned in the variable totTasks and ierr returns an error code. MPI_COMM_RANK()MPI_COMM_RANKis used to find the rank (the name or identifier) of the tasks running the code. Each task in a communicator is assigned an identifying number from 0 to totTasks-1. The format for the call is as follows:Fortran
INTEGER:: task_id, ierr CALL MPI_COMM_RANK( MPI_COMM_WORLD, task_id, ierr )
C
int MPI_Comm_rank( MPI_COMM_WORLD, int *task_id )
where task_id is an integer that identifies the task.
Example
This example, hello.f90, available in $EXAMPLES/mpi/intro/hello demonstrates these calls.
If we compile (you can type "make" to compile if you are using the example code) and run this code on franklin, we get
franklin% ftn -o hello hello.f90 franklin% qsub -I -lmppwidth=4 franklin% cd $PBS_O_WORKDIR franklin% aprun -n 4 ./hello task_no is 3 of total 4 tasks task_no is 1 of total 4 tasks task_no is 0 of total 4 tasks task_no is 2 of total 4 tasks
Data types
One argument that usually must be given to MPI routines is the type of the data being passed. User-defined datatypes (an advanced MPI topic) allows MPI to automatically scatter and gather data to and from non-contiguous buffers. MPI defines a number of constants that correspond to language datatypes in Fortran and C.
When an MPI routine is called, the Fortran data type of the data being passed must match the corresponding MPI integer constant. The following table the MPI 1.2 datatype definitions.
Type Length --------------- ------ MPI_PACKED 1 MPI_BYTE 1 MPI_CHAR 1 MPI_UNSIGNED_CHAR 1 MPI_SIGNED_CHAR 1 MPI_WCHAR 2 MPI_SHORT 2 MPI_UNSIGNED_SHORT 2 MPI_INT 4 MPI_UNSIGNED 4 MPI_LONG 4 MPI_UNSIGNED_LONG 4 MPI_FLOAT 4 MPI_DOUBLE 8 MPI_LONG_DOUBLE 16 MPI_CHARACTER 1 MPI_LOGICAL 4 MPI_INTEGER 4 MPI_REAL 4 MPI_DOUBLE_PRECISION 8 MPI_COMPLEX 2*4 MPI_DOUBLE_COMPLEX 2*8 Optional Type Length ------------------ ------ MPI_INTEGER1 1 MPI_INTEGER2 2 MPI_INTEGER4 4 MPI_INTEGER8 8 MPI_LONG_LONG 8 MPI_UNSIGNED_LONG_LONG 8 MPI_REAL4 4 MPI_REAL8 8 MPI_REAL16 16
Basic send
We're ready to do some message passing. We'll start with the most basic
send routine, MPI_SEND. The function call looks like this:
Fortran
FORTRAN_TYPE:: buff INTEGER:: count, dest, ierr CALL MPI_SEND(buff, count, MPI_TYPE, dest, tag, comm, ierr)
C
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
This single command allows the passing of any kind of variable, even a large array, to any group of tasks. We'll break it down into variables:
- buff : The variable you want to send.
- count : The number of variables you're passing. If you're
passing only a single value, this should be
1. If you're passing an array, it's the overall size of the array. For example, if you wanted to send a 4 by 5 array, count would be 4*5=20 since you're actually passing 20 values. - MPI_TYPE : The kind of variable you're passing so the MPI routine knows what to expect.
- dest : The ID number of the task you're sending the message to.
- tag : a message tag. This is a way for the receiver to verify that it's getting the message it expects. The message tag is an integer number that you can assign any value.
- comm : This is the group ID of tasks that your message is going to. In large, complex programs tasks may be divided into groups to speed connections and transfers. In small programs, this will more than likely be MPI_COMM_WORLD.
- ierr : an integer error code.
Basic Receive
Once you've sent a message, you must receive it on another
task.
The
MPI_RECV call is similar to the send call.
Fortran
FORTRAN_TYPE:: rbuf INTEGER:: count, source, tag, status(MPI_STATUS_SIZE), ierr CALL MPI_RECV(rbuf, count, MPI_TYPE, source, tag, comm, status, ierr)
C
int MPI_Recv( void *rbuf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )
The arguments that are different from those in MPI_SEND are
- rbuf : This is the name of the variable where you'll be storing the received data.
- source : This replaces the destination in the send command. This is the return ID of the sender.
- status : You can check this variable to see if the receive was completed.
Example: simple send and receive
This example, pingpong.f90, available in $EXAMPLES/mpi/intro/simple, demonstrates these calls.
If we compile and run
franklin% ftn -o pingpong pingpong.f90
the output looks like this:
franklin% qsun -I -lmppwidth=2 franklin% cd $PBS_O_WORKDIR franklin% aprun -n 2 ./pingpong 0: TASK # 0 sent 0 0: TASK # 0 received 1 1: TASK # 1 sent 1 1: TASK # 1 received 0
Broadcast
Often you will have data on one processor that it needs to
share with all other processors. You accomplish this
by using
a broadcast, which sends data to a group of processes.
The MPI routine MPI_BCAST() transfers
data from one task to a group of others. The
format is:
Fortran
FORTRAN_TYPE:: buff INTEGER:: count, root, ierr CALL MPI_BCAST(buff, count, MPI_TYPE, root, comm, ierr)
C
int MPI_Bcast( void *buff, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
where the variables are the same as before, except
- FORTRAN_TYPE : is to be replaced with the Fortran data type
of the variable being passed,
e.g.
REALandINTEGER. - MPI_TYPE : is to be replaced with the MPI type
of the data being passed,
e.g.
MPI_REALandMPI_INTEGER. - buff : the variable name of the data being broadcast. When the call returns, all tasks will have the same value placed in buff
- root : This is the
task which has the value to be shared with all the others. In other words,
if buff=10 on task 3, setting root=3 will cause
the call to
MPI_BCASTto set buff=10 on all the tasks.
Example
This sample program, bcast.f90 is available in $EXAMPLES/mpi/intro/bcast/. The program has task 0 calculate a value for the variable rsq and then broadcast that value to all the other tasks. Each task then prints out the value, thus verifying the broadcast.
If we compile and run, we get these results
franklin% ftn -o bcast bcast.f90 franklin% qsub -I -lmppwidth=8 frnaklin% cd $PBS_O_WORKDIR franklin% aprun -n 8 ./bcast TASK # 0 has rsq= 27.562500 TASK # 1 has rsq= 27.562500 TASK # 2 has rsq= 27.562500 TASK # 3 has rsq= 27.562500 TASK # 4 has rsq= 27.562500 TASK # 5 has rsq= 27.562500 TASK # 6 has rsq= 27.562500 TASK # 7 has rsq= 27.562500
Each task ends up with the same value of the variable rsq.
Reduction
Often you will have calculated some quantity or quantities on many processors and you will need to collect or reduce that data onto one or all processors. MPI provides routines to perform these reductions.
One Collector
If each task in your job has calculated a private value for some
variable, var, with the same name on all tasks,
the MPI_REDUCE routine will reduce all
those values, according to some operation, and store the
result in a variable on one task. The syntax is
Fortran
FORTRAN_TYPE:: sendbuf, recbuf INTEGER:: count, root, ierr CALL MPI_REDUCE(sendbuf, recbuf, count, & MPI_TYPE, MPI_OP, root, comm, ierr)
C
int MPI_Reduce( void *sendbuf, void *recbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
where
- sendbuf : the variable to be collected from all tasks.
- recbuf : the variable where the result of operating MPI_OP on all those values will be placed
- MPI_OP : one of a number of functions to be performed on the values being collected.
- root : the only task that will collect the result of the reduce command in recbuf.
Possible values for MPI_OP are:
| MPI Name | Operation |
|---|---|
| MPI_MAX | Maximum |
| MPI_MIN | Minimum |
| MPI_PROD | Product |
| MPI_SUM | Sum |
| MPI_LAND | Logical AND |
| MPI_LOR | Logical OR |
| MPI_LXOR | Logical EXCLUSIVE OR |
| MPI_BAND | Bitwise AND |
| MPI_BOR | Bitwise OR |
| MPI_BXOR | Bitwise EXCLUSIVE OR |
| MPI_MAXLOC | Maximum value and location |
| MPI_MINLOC | Minimum value and location |
Everyone Collects
You may wish to have the reduction available to all tasks.
The routine MPI_ALLREDUCE performs that function.
It's almost the same as the basic reduce command:
Fortran
FORTRAN_TYPE:: sendbuf, recbuf INTEGER:: count, ierr CALL MPI_ALLREDUCE(sendbuf, recbuf, count, MPI_TYPE, MPI_OP, comm, ierr)
C
int MPI_Allreduce( void *sendbuf, void *recbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )
The only difference between this call and MPI_REDUCE
is the lack of a root task in the call statement.
Example
We'll put together things we've learned into an example program. This program uses collective operations to broadcast a value to all tasks and gather results from all tasks to task 0.
This example, flip.f90, available in $EXAMPLES/MPI_intro/flip, demonstrates these calls.
If we compile and run on the SP, we get
seaborg% mpxlf90 -o flip -qsuffix=f=f90 flip.f90 1501-510 Compilation successful for file flip.f90 seaborg% poe ./flip -procs 8 -nodes 1 subfilter: default repo mpccc will be charged Flipping coin 1000000 times on each task. Processor 1 got 499614 heads. Processor 0 got 499493 heads. Processor 3 got 500017 heads. Processor 2 got 500595 heads. Processor 5 got 499306 heads. Processor 4 got 499778 heads. Processor 6 got 500008 heads. Processor 7 got 500830 heads. Heads came up 49.99551392 percent of the time.
![]() |
Page last modified: Fri, 21 May 2004 22:27:08 GMT Page URL: http://www.nersc.gov/nusers/help/tutorials/mpi/intro/print.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |

