This test executes a new proposed standard benchmark method for MPI startup that is intended to provide a realistic assessment of both launch and wireup requirements. Accordingly, it exercises both the launch system of the environment and the interconnect subsystem in a specified pattern.
Specifically, the test consists of the following steps:
- Record a time stamp for when the test started - this is passed to rank=0 upon launch.
- Launch a 100MB executable on a specified number of processes on each node.
- Each process executes MPI_Init.
- Each process on an odd-numbered node (designated the "originator" for purposes of this description) sends a one-byte message to the process with the same local rank on the even-numbered node above it - i.e., a process on node N sends to a process on node N+1, where N is odd.
- The receiving process answers the message with a one-byte message sent back to the original sender. In other words, the test identifies pairs of nodes, and then the processes with the same local rank on each pair of nodes exchange a one-byte message.
- Each originator records the time that the reply is received, and then enters a call to MPI_Gather. This allows all the time stamps to be collected by the rank=0 process.
- Once all the times stamps have been collected, the rank=0 process searches them to find the latest time. This marks the ending time of the benchmark. The start time is then subtracted from this to generate the final time to execute the benchmark.
Thus, the benchmark seeks to measure not just the time required to spawn processes on remote nodes, but also the time required by the interconnect to form inter-process connections capable of communicating.
How to Build
To execute the benchmark, you will need to compile both the ziatest.c and ziaprobe.c programs. A very simple Makefile is provided.
How to Run
The ziatest.c program simply obtains the initial time stamp, and then executes the "mpirun" (or equivalent) command to initiate the actual benchmark. As distributed, the format of this command in the code takes advantage of OpenMPI's "-npernode" option. To use the benchmark with a different MPI, you may need to add a second argument to the ziatest command line. Specify the run command followed by the necessary option(s). End with the option (no value) that specifies the number of mpi_ranks per node (aka -npernode). The required behavior is to launch a constant number of processes on each node.
There is no requirement on the number of nodes, nor that there be an even number of nodes. In the case of an odd number of nodes, the test will automatically "wrap" the test by requiring the last node to communicate with node=0. Note that this can invoke a penalty in performance as the processes on node=0 will have to respond twice to messages. Thus, the test does tend to favor even numbers of nodes.
With the code compiled, simply execute:
./ziatest N -or-
./ziatest N "LAUNCH_CMD OPTIONS"
where N is the number of processes to be launched on each node. If you are using OpenMPI and want to restrict the nodes, or are not using a common resource manager (e.g., SLURM or MOAB) to provide an allocation, then create a hostfile and add OMPI_MCA_hostfile=your-hostfile-name to your environment.
If you are using a different version of MPI, the optional second argument can be used to specify the command and other arguments.
If you are using OpenMPI, the output will appear in the following format:
$ ./ziatest 4
Time test was completed in 624.57 millisecs
Slowest rank: 8
Here is an example from a Cray CLE system using the optional second argument:
$ ./ziatest 4 "aprun -n 16 -N"
Time test was completed in 0:01 min:sec
Slowest rank: 16
As you can see, the time required to execute the test is displayed, along with the rank that reported the slowest time. The time will be in min:sec if the test tool longer than 1 minute to execute. The slowest rank information is provided in the hopes it may prove of some diagnostic value.
If you question the output value, try running ziatest through the UNIX time command:
time ./ziatest 4
The result should be close to identical.
Run a logrithmic scaling study using base-10 values times double the number of cores per node. For example, if there are 26 cores per node, run:
52, 520 5200, 52,000, ... MAXIMUM_NUMBER_OF_MPI_RANKS
The output is self-explanatory.
Copyright (c) 2008 Los Alamos National Security, LLC. All rights reserved.