The information in this section pertains to both the MPI/MPL signal-handling library and the MPI threads library.
PE jobs can only run in a system partition that has all of its nodes at the same PSSP level. In a particular system partition, PE requires PSSP V3.2 or higher.
As the end user, you are encouraged to think of the Parallel Operating Environment (POE) (also referred to as the poe command) as an ordinary (serial) command. It accepts redirected I/O, can be run under the nice and time commands, interprets command flags, and can be invoked in shell scripts.
An n-task parallel job running in POE consists of: the n user tasks, a number of instances of the PE partition manager daemon (pmd) that is equal to the number of nodes, and the POE home node task in which the poe command runs. The pmd is the parent task of the user's task. There is one pmd for each node. A pmd is started by the POE home node on each machine on which a user task runs, and serves as the point of contact between the home node and the user's tasks.
The POE home node routes standard input, standard output, and standard error streams between the home node and the user's tasks via the pmd daemon, using TCP/IP sockets for this purpose. The sockets are created when the POE home node starts the pmd daemon for each task of a parallel job. The POE home node and pmd also use the sockets to exchange control messages to provide task synchronization, exit status and signaling. These capabilities do not depend upon the message passing library and are available to control any parallel program run by the poe command.
The exit status is any value from 0 through 255. This value, which is returned from POE on the home node, reflects the composite exit status of your parallel application as follows:
The POE job-step function is intended for the execution of a sequence of separate yet inter-related dependent programs. Therefore, it provides you with a job control mechanism that allows both job-step progression and job-step termination. The job control mechanism is the program's exit code.
POE continues the job-step sequence if the task exit code is 0 or in the range of 2 through 127.
POE terminates the parallel job, and does not execute any remaining user programs in the job-step list if the task exit code is 1 or greater than 127.
Any POE infrastructure detected failure (such as failure to open pipes to the child task or an exec failure to start the user's executable) terminates the parallel job, and does not execute any remaining user programs in the job-step queue.
POE links in the following routines when your executable is compiled with any of the POE compilation scripts, such as: mpcc, mpcc_r, or mpxlf.
POE installs signal handlers for most signals that cause program termination in order to notify the other tasks of termination. POE then causes the program to exit normally with a code of (128 + signal).
The signal-handling library and the threads library handle synchronous signals in the same way. This section includes information about installing your own signal handler for these signals.
For synchronous signals, you can install your own signal handlers by using the sigaction() system call. If you use sigaction(), you can use either the sa_handler member or the sa_sigaction member in the sigaction structure to define the signal-handling function. If you use the sa_sigaction member, the SA_SIGINFO flag must be set.
For the following signals, POE installs signal handlers that use the sa_sigaction format:
POE catches these signals, performs some cleanup, installs the default signal handler (or lightweight corefile generation), and re-raises the signal, which will terminate the task.
Users can install their own signal handlers, but they should save the address of the POE signal handler. If the user program decides to terminate, it should call the POE signal handler as follows:
saved.sa_flags =SA_SIGINFO; (*saved.sa_sigaction)(signo,NULL,NULL)
If the user program decides not to terminate, it should just return to the interrupted code.
Do not use hard coded file descriptor numbers beyond those specified by STDIN, STDOUT and STDERR.
POE opens several files and uses file descriptors as message passing handles. These are allocated before the user gets control, so the first file descriptor allocated to a user is unpredictable.
POE provides for orderly termination of a parallel job, so that all tasks terminate at the same time. This is accomplished in the atexit routine registered at program initialization. For normal exits (codes 0, 2-127), the atexit routine sends a control message to the POE home node, and waits for a positive response. For abnormal exits and those which don't go through the atexit routine, the pmd daemon catches the exit code and sends a control message to the POE home node.
For normal exits, when POE gets a control message for every task, it responds to each node, allowing that node to exit normally with its individual exit code. The pmd daemon monitors the exit code and passes it back to the POE home node for presentation to the user.
For abnormal exits and those detected by pmd, POE sends a message to each pmd asking that it send a SIGTERM signal to its tasks, thereby terminating the task. When the task finally exits, pmd sends its exit code back to the POE home node and exits itself.
User-initiated termination of the POE home node via SIGINT (Ctrl-c) and/or SIGQUIT (Ctrl-\) causes a message to be sent to pmd asking that the appropriate signal be sent to the parallel task. Again, pmd waits for the tasks to exit, then terminates itself.
To prevent uncontrolled root access to the entire parallel job computation resource, POE checks to see that the user is not root as part of its authentication.
Use of the following AIX function may be limited, as no formal testing has been done:
You can have POE run a shell script which is loaded and run on the remote nodes as if it were a binary file.
If the POE home node task is not started under the Korn shell, mounted file system names may not be mapped correctly to the names defined for the automount daemon or AIX equivalent running on the IBM RS/6000 SP. See the IBM Parallel Environment for AIX: Operation and Use, Volume 1 for a discussion of alternative name mapping techniques.
The program executed by POE on the parallel nodes does not run under a shell on those nodes. Redirection and piping of STDIO applies to the POE home node (POE binary), and not the user's code. If shell processing of a command line is desired on the remote nodes, invoke a shell script on the remote nodes to provide the desired preprocessing before the user's application is executed.
The partition manager daemon (pmd) uses pipes to direct STDIN, STDOUT and STDERR to the user's program. Therefore, do not rewind these files.
Quotation marks, either single or double, used as argument delimiters are stripped away by the shell and are never "seen" by poe. Therefore, the quotation marks must be escaped to allow the quoted string to be passed correctly to the remote task(s) as one argument. For example, if you want to pass the following string to the user program (including the embedded blank)
a b
you need to enter the following:
poe user_program \"a b\"
user_program is passed the following argument as one token:
a b
Without the backslashes, the string would have been treated as two arguments (a and b).
POE behaves like rsh when arguments are passed to POE. Therefore, this command:
poe user_program "a b"
is equivalent to:
rsh some_machine user_program "a b"
In order to pass the string argument as one token, the quotation marks have to be "escaped".
Programs generating large volumes of STDOUT or STDERR may overload the home node. As described previously, standard output and standard error files generated by a user's program are piped to pmd, then forwarded to the POE binary via a TCP/IP socket. It is possible to generate so much data that the IP message buffers on the home node are exhausted, the POE binary hangs and possibly the entire node may hang. Note that the option -stdoutmode (environment variable MP_STDOUTMODE) controls which output stream is displayed by the POE binary, but does not limit the standard output traffic received from the remote nodes, even if set to display the output of just one node.
The POE environment variable MP_SNDBUF can be used to override the default network settings for the size of the TCP buffers used.
If you have large volumes of standard I/O, work with your network administrator to establish appropriate TCP/IP tuning parameters. You may also want to examine if using named pipes is appropriate for your application.
When your program runs on the remote nodes, it has no controlling terminal. STDIN, STDOUT, and STDERR are always piped.
Programs that depend on piping standard input or standard output as part of a processing sequence may wish to bypass the home node POE binary. Running the poe command (or starting a program compiled with one of the POE compile scripts) causes the POE binary to be loaded on the machine on which you typed the command (the POE home node). The POE binary, in turn, starts a daemon named pmd on each parallel node assigned to run the job, and then requests pmd to run your executable (via fork and exec). The POE binary reads STDIN and passes it to each of the parallel tasks via a TCP/IP socket connection to the pmd daemon, which pipes it to the user. Similarly, STDOUT and STDERR from the user are piped to pmd and sent on the socket back to the home node, where it is written to the POE binary's STDOUT and STDERR descriptors. If you know that the task reading STDIN or writing STDOUT must be on the same node (processor) as the POE binary (the POE home node), named pipes can be used to bypass POE's reading and forwarding STDIN and STDOUT.
If STDIN is piped or redirected to the POE binary (via ordinary pipes), and your application is linked with the signal handling message passing library, (via mpcc, mpxlf, or mpCC), then set the environment variable MP_HOLD_STDIN to yes. This lets POE initialize the signal-handling library before handling the STDIN file.
If your application is linked with the threads library, see Standard I/O requires special attention for more information.
The following two scripts show how STDIN and STDOUT can be piped directly between pre- and post-processing steps, bypassing the POE home node task. This example assumes that parallel task 0 is known or forced to be on the same node as the POE home node.
The script compute_home runs on the home node; the script compute_parallel runs on the parallel nodes (those running tasks 0 through n-1).
compute_home:
#! /bin/ksh
# Example script compute_home runs three tasks:
# data_generator creates/gets data and writes to stdout
# data_processor is a parallel program that reads data
# from stdin, processes it in parallel, and writes
# the results to stdout.
# data_consumer reads data from stdin and summarizes it
#
mkfifo poe_in_$$
mkfifo poe_out_$$
export MP_STDOUTMODE=0
export MP_STDINMODE=0
data_generator >poe_in_$$ |
poe compute_parallel poe_in_$$ poe_out_$$ data_processor |
data_consumer <poe_out_$$
rc=$?
rm poe_in_$$
rm poe_out_$$
exit rc
compute_parallel: #! /bin/ksh # Example script compute_parallel is a shell script that # takes the following arguments: # 1) name of input named pipe (stdin) # 2) name of output named pipe (stdout) # 3) name of program to be run (and arguments) # poe_in=$1 poe_out=$2 shift 2 $* <$poe_in >$poe_out
Environment variables starting with MP_ are intended for use by POE, and should be set only as instructed in the documentation. POE also uses a handful of MP_ environment variables for internal purposes, which should not be interfered with.
If the value of MP_INFOLEVEL >equiv. 1, POE will display any MP_ environment variables that it does not recognize, but it will continue normally.
POE assumes that NLSPATH contains the appropriate POE message catalogs, even if LANG is set to C or is not set. Duplicate message catalogs are provided for languages En_US, en_US, and C.
The FORTRAN, C, and C++ bindings for MPI are contained in the same library and can be freely intermixed.
The AIX compilers support the flag -qarch. This option allows you to target code generation to a particular processor architecture. While this option can provide performance enhancements on specific platforms, it inhibits portability, particularly between the Power and PowerPC(R) machines. The MPI library is not targeted to a specific architecture and is the same on PowerPC and Power nodes.
The MPI-IO and MPI one-sided subroutines, 64-bit application support, and other miscellaneous MPI-2 functions, are only available with the threads library.
In 32-bit addressing mode, AIX makes available up to 11 additional
address segments for end user programs. The MPI libraries use some of
these as listed in Table 1. The remaining are available to the user for either
extended heap (-bmaxdata option) or shared memory
(shmget). Very large jobs, which include all jobs with more
than 1000 tasks, will need to use the -bmaxdata option to ensure a
large enough heap for the MPI data buffers.
Table 1. Number of memory segments used by the MPI and LAPI libraries
| The numbers are additive. | ||
| Component | RS/6000 SP node with switch | IBM |
|---|---|---|
| MPI IP | 1* | 0 |
| MPI user space | 2 | not available |
| Shared memory message passing (MPI) | 1** | 1** |
| LAPI shared memory | 2*** | 2*** |
| LAPI user space | 2 | not available |
| ||
| ||
| ||
The RS/6000 SP Switch clock is a globally-synchronized counter that
can be used as a source for the MPI_WTIME function, provided that all tasks
are run on nodes of the same SP system. The environment variable
MP_CLOCK_SOURCE provides additional control. Table 2 shows how the clock source is determined. MPI
guarantees that the MPI attribute MPI_WTIME_IS_GLOBAL has the same value at
every task.
Table 2. How the clock source is determined
| MP_CLOCK_SOURCE | Library version | Are all nodes SP? | Source used | MPI_WTIME_IS_GLOBAL |
|---|---|---|---|---|
| AIX | ip | yes | AIX | false |
| AIX | ip | no | AIX | false |
| AIX | us | yes | AIX | false |
| AIX | us | no | AIX | false |
| SWITCH | ip | yes* | switch | false |
| SWITCH | ip | no | AIX | false |
| SWITCH | us | yes | switch | true |
| SWITCH | us | no | Error |
|
| not set | ip | yes | switch | false |
| not set | ip | no | AIX | false |
| not set | us | yes | switch | true |
| not set | us | no | Error |
|
| ||||
If you plan to run your parallel applications with a large number of tasks (more than 256), the following tips may improve stability and performance:
ulimit -n 10000
Running a POE job that uses MALLOCDEBUG with an align:n option of other than 8 may result in undefined behavior. To allow the parallel program being run by POE (myprog, for example) to run with an align:n option of other than 8, create the following script (called myprog.sh, for example):
MALLOCTYPE=debug MALLOCDEBUG=align:0 myprog myprog_options
and then run:
poe myprog.sh poe_options
instead of:
poe myprog poe_options myprog_options