NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
Restore navigation column
NERSC Tutorials

Introduction to MPI

This tutorial is intended to be used as a basic introduction to parallel programming on the NERSC parallel machines. It introduces a number of concepts and the MPI message passing library.

This tutorial assumes that you are familiar with FORTRAN or C, basic UNIX commands, and a UNIX text editor.

Access to the examples can be gained with the following commands on the SP:

% module load training
% cd $EXAMPLES/mpi/intro
Basic Concepts
Introduces basic concepts of using MPI and passing simple messages.
Initialization
Initializing the library.
Data Types
Mapping MPI datatypes to machine data types.
Basic Send
Moving data to another task.
Basic Receive
Moving data from another task.
Broadcast
Sending data from one processor to all others.
Reduction
How to get data from all other processors.
Example
A small example using broadcast and reduction routines.

Related information

Basic Concepts

Franklin, Bassi, and Jacquard are mixed distributed memory and shared memory machines. Each node consists of multiple main processing cores (8 on Bassi, 4 on Franklin and 2 on Jacquard) that have direct access to a single node's pool of memory. Each node has its own memory and does not have direct access to the memory on other nodes.

Fortran, C, and C++ provide no mechanism for directly accessing or sharing memory that resides on different nodes. If a program needs data that is stored on another node, the programmer must incorporate subroutines calls from a library of message passing routines to get that data. The most common portable message passing library is called the Message Passing Interface or MPI. MPI is a standard that allows you to send data from one node to another. MPI also allows you to communicate among MPI tasks, whether they are executing on the same node or not.

While a single node does have multiple cores, most codes still use MPI calls to transfer data between all processors, even those on the same node. In many cases compilers and operating systems will recognize when MPI is communicating on-node and will substitute faster shared-memory constructs where appropriate.

In the following, well use the term task and process rather interchangeably to refer to the independent, parallel execution of code. You can have a number of MPI tasks less than or equal to the number of cores on the node. (If you are writing a hybrid code, e.g. using OpenMP and MPI, each MPI task can have a number of associated threads.)

[NEXT: Initialization]

Initialization

To use the MPI library you must include header files which contain definitions and declarations that are needed by the MPI library routines. In Fortran 90 you can use the mpi module. The following line must appear at the top of any source code file that will make an MPI call.

Fortran

 INCLUDE 'mpif.h'
Fortran 90 codes should use instead:
 USE mpi 

C/C++:

 #include "mpi.h"

MPI_Init

The first MPI call must be MPI_INIT, which initializes the message passing routines, for example:

Fortran:

INTEGER:: ierr

CALL MPI_INIT(ierr) 

where ierr is an integer which holds an error code when the call returns.

NOTE: The value of ierr is really of little use since, by default, MPI aborts the program when it encounters an error. However, ierr must be included with MPI calls anyway.

C

int MPI_Init( int *argc, char ***argv)

where argc and argv are the arguments passed to main. MPI does not use these arguments in any way, however, and in MPI-2 implementations, NULL may be passed instead.

MPI_Finalize

When you are finished passing messages, you must close out the MPI routines. Often this is the last thing done in a program. The syntax for this finalization is

Fortran

       CALL MPI_FINALIZE(ierr)

C

int MPI_Finalize(void)

Inquiry routines

Two other MPI calls are usually made soon after initialization. They are

MPI_COMM_SIZE()

MPI_COMM_SIZE is used to find the number of tasks in a specified MPI communicator. In MPI, you can divide your total number of tasks into groups, called communicators. In this tutorial we will only refer to the communicator that includes all MPI processes: MPI_COMM_WORLD.

The format for MPI_COMM_SIZE is as follows:

Fortran

INTEGER:: totTasks, ierr

CALL MPI_COMM_SIZE( MPI_COMM_WORLD, totTasks, ierr )

C

int MPI_Comm_size( MPI_COMM_WORLD, int *totTasks )

where the total number of tasks in MPI_COMM_WORLD is returned in the variable totTasks and ierr returns an error code.

MPI_COMM_RANK()

MPI_COMM_RANK is used to find the rank (the name or identifier) of the tasks running the code. Each task in a communicator is assigned an identifying number from 0 to totTasks-1. The format for the call is as follows:

Fortran

INTEGER:: task_id, ierr

CALL MPI_COMM_RANK( MPI_COMM_WORLD, task_id, ierr )

C

int MPI_Comm_rank( MPI_COMM_WORLD, int *task_id )

where task_id is an integer that identifies the task.

Example

This example, hello.f90, hello.c, available in $EXAMPLES/mpi/intro/hello demonstrates these calls.

If we compile (you can type "make" to compile if you are using the example code) and run this code on franklin, we get

franklin% ftn -o hello hello.f90
franklin% qsub -I -lmppwidth=4
franklin% cd $PBS_O_WORKDIR
franklin% aprun -n 4 ./hello 
 task_no is  3  of total  4  tasks
 task_no is  1  of total  4  tasks
 task_no is  0  of total  4  tasks
 task_no is  2  of total  4  tasks

Data types

One argument that usually must be given to MPI routines is the type of the data being passed. User-defined datatypes (an advanced MPI topic) allows MPI to automatically scatter and gather data to and from non-contiguous buffers. MPI defines a number of constants that correspond to language datatypes in Fortran and C.

When an MPI routine is called, the Fortran data type of the data being passed must match the corresponding MPI integer constant. The following table the MPI 1.2 datatype definitions.

Type 			Length
--------------- 	------
MPI_PACKED 		1
MPI_BYTE 		1
MPI_CHAR 		1
MPI_UNSIGNED_CHAR 	1
MPI_SIGNED_CHAR 	1
MPI_WCHAR 		2
MPI_SHORT 		2
MPI_UNSIGNED_SHORT 	2
MPI_INT 		4
MPI_UNSIGNED 		4
MPI_LONG 		4
MPI_UNSIGNED_LONG 	4
MPI_FLOAT 		4
MPI_DOUBLE 		8
MPI_LONG_DOUBLE 	16
MPI_CHARACTER 		1
MPI_LOGICAL 		4
MPI_INTEGER 		4
MPI_REAL 		4
MPI_DOUBLE_PRECISION 	8
MPI_COMPLEX 		2*4
MPI_DOUBLE_COMPLEX 	2*8

Optional Type 		Length
------------------ 	------
MPI_INTEGER1 		1
MPI_INTEGER2 		2
MPI_INTEGER4 		4
MPI_INTEGER8 		8
MPI_LONG_LONG 		8
MPI_UNSIGNED_LONG_LONG 	8
MPI_REAL4 		4
MPI_REAL8 		8
MPI_REAL16 		16

Basic send

We're ready to do some message passing. We'll start with the most basic send routine, MPI_SEND. The function call looks like this:

Fortran

FORTRAN_TYPE::  buff
INTEGER:: count, dest, ierr

CALL MPI_SEND(buff, count, MPI_TYPE, dest, tag, comm, ierr)

C

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
	int tag, MPI_Comm comm)

This single command allows the passing of any kind of variable, even a large array, to any group of tasks. We'll break it down into variables:

Basic Receive

Once you've sent a message, you must receive it on another task. The MPI_RECV call is similar to the send call.

Fortran

FORTRAN_TYPE::  rbuf
INTEGER:: count, source, tag, status(MPI_STATUS_SIZE), ierr

CALL MPI_RECV(rbuf, count, MPI_TYPE, source, tag, comm, status, ierr)

C

int MPI_Recv( void *rbuf, int count, MPI_Datatype datatype, int source,
	int tag, MPI_Comm comm, MPI_Status *status )

The arguments that are different from those in MPI_SEND are

Example: simple send and receive

This example, pingpong.f90, available in $EXAMPLES/mpi/intro/simple, demonstrates these calls.

If we compile and run

franklin% ftn -o pingpong pingpong.f90

the output looks like this:

franklin% qsun -I -lmppwidth=2
franklin% cd $PBS_O_WORKDIR
franklin% aprun -n 2 ./pingpong 
   0: TASK # 0  sent  0
   0: TASK # 0  received  1
   1: TASK # 1  sent  1
   1: TASK # 1  received  0

Broadcast

Often you will have data on one processor that it needs to share with all other processors. You accomplish this by using a broadcast, which sends data to a group of processes. The MPI routine MPI_BCAST() transfers data from one task to a group of others. The format is:

Fortran

FORTRAN_TYPE:: buff
INTEGER:: count, root, ierr

CALL MPI_BCAST(buff, count, MPI_TYPE, root, comm, ierr)

C

int MPI_Bcast( void *buff, int count, MPI_Datatype datatype, int root,
	MPI_Comm comm)

where the variables are the same as before, except

Example

This sample program, bcast.f90 is available in $EXAMPLES/mpi/intro/bcast/. The program has task 0 calculate a value for the variable rsq and then broadcast that value to all the other tasks. Each task then prints out the value, thus verifying the broadcast.

If we compile and run, we get these results

franklin% ftn -o bcast  bcast.f90
franklin% qsub -I -lmppwidth=8
frnaklin% cd $PBS_O_WORKDIR
franklin% aprun -n 8 ./bcast 
 TASK # 0  has rsq=  27.562500
 TASK # 1  has rsq=  27.562500
 TASK # 2  has rsq=  27.562500
 TASK # 3  has rsq=  27.562500
 TASK # 4  has rsq=  27.562500
 TASK # 5  has rsq=  27.562500
 TASK # 6  has rsq=  27.562500
 TASK # 7  has rsq=  27.562500

Each task ends up with the same value of the variable rsq.

Reduction

Often you will have calculated some quantity or quantities on many processors and you will need to collect or reduce that data onto one or all processors. MPI provides routines to perform these reductions.

One Collector

If each task in your job has calculated a private value for some variable, var, with the same name on all tasks, the MPI_REDUCE routine will reduce all those values, according to some operation, and store the result in a variable on one task. The syntax is

Fortran

FORTRAN_TYPE:: sendbuf, recbuf
INTEGER:: count, root, ierr

CALL MPI_REDUCE(sendbuf, recbuf, count, &
		MPI_TYPE, MPI_OP,  root, comm, ierr)

C

int MPI_Reduce( void *sendbuf, void *recbuf, int count, 
	MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

where

Possible values for MPI_OP are:

MPI Name Operation
MPI_MAX Maximum
MPI_MIN Minimum
MPI_PROD Product
MPI_SUM Sum
MPI_LAND Logical AND
MPI_LOR Logical OR
MPI_LXOR Logical EXCLUSIVE OR
MPI_BAND Bitwise AND
MPI_BOR Bitwise OR
MPI_BXOR Bitwise EXCLUSIVE OR
MPI_MAXLOC Maximum value and location
MPI_MINLOC Minimum value and location

Everyone Collects

You may wish to have the reduction available to all tasks. The routine MPI_ALLREDUCE performs that function. It's almost the same as the basic reduce command:

Fortran

FORTRAN_TYPE:: sendbuf, recbuf
INTEGER:: count, ierr

CALL MPI_ALLREDUCE(sendbuf, recbuf, count, MPI_TYPE, MPI_OP, comm, ierr)

C

int MPI_Allreduce( void *sendbuf, void *recbuf,  int count,
	MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )

The only difference between this call and MPI_REDUCE is the lack of a root task in the call statement.

Example

We'll put together things we've learned into an example program. This program uses collective operations to broadcast a value to all tasks and gather results from all tasks to task 0.

This example, flip.f90, available in $EXAMPLES/MPI_intro/flip, demonstrates these calls.

If we compile and run on the SP, we get

seaborg% mpxlf90 -o flip -qsuffix=f=f90 flip.f90
1501-510  Compilation successful for file flip.f90
seaborg% poe ./flip -procs 8 -nodes 1  
subfilter: default repo mpccc will be charged
 Flipping coin  1000000  times on each task.
 Processor  1  got  499614  heads.
 Processor  0  got  499493  heads.
 Processor  3  got  500017  heads.
 Processor  2  got  500595  heads.
 Processor  5  got  499306  heads.
 Processor  4  got  499778  heads.
 Processor  6  got  500008  heads.
 Processor  7  got  500830  heads.
 Heads came up  49.99551392  percent of the time.

LBNL Home
Page last modified: Mon, 11 Jan 2010 21:29:11 GMT
Page URL: http://www.nersc.gov/nusers/help/tutorials/mpi/intro/print.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science