MVAPICH 0.9.4 Performance Tuning Guide Version 1.0 MVAPICH Team Network-Based Computing Laboratory The Ohio State University --------------------------------------------------------------------- --------------------------------------------------------------------- Table of Contents: - Overview - Classes of parameters - Compiling time parameters - Run time parameters - Parameter descriptions and instructions for tuning - VBUF_TOTAL_SIZE - VIADEV_DEVICE - VIADEV_RDMA_LIMIT - VIADEV_SQ_SIZE - VIADEV_CQ_SIZE - VIADEV_NUM_RDMA_BUFFER - VIADEV_MAX_RDMA_SIZE - VIADEV_DEFAULT_MTU - VIADEV_MAX_FAST_EAGER_SIZE - VIADEV_DEFAULT_MAX_SG_LIST - VIADEV_RENDEZVOUS_THRESHOLD - NDREG_ENTRIES - VIADEV_VBUF_POOL_SIZE - VIADEV_VBUF_SECONDARY_POOL_SIZE - Flow control related parameters - VIADEV_INITIAL_PREPOST_DEPTH - VIADEV_PREPOST_DEPTH - VIADEV_CREDIT_NOTIFY_THRESHOLD - VIADEV_DYNAMIC_CREDIT_THRESHOLD - SMP communication related parameters - SMPI_MAX_NUMLOCALNODES - SMPI_LENGTH_QUEUE - SMP_EAGERSIZE - Multicast based broadcast related parameters - MCST_THRESHOLD - VIADEV_UD_PREPOST_DEPTH - VIADEV_UD_PREPOST_THRESHOLD - SENDER_WINDOW - BCAST_TIME_OUT - Parameters for Multi-Rail Configuration - Other InfiniBand initialization parameters Overview --------------------------------------------------------------------- --------------------------------------------------------------------- This document describe how to tune parameters in the MVAPICH release 0.9.4. For each of these parameters, the default value and a brief description about how to tune it are given. Tuning parameters roughly belong to two classes: compiling time parameters and run time parameters. Compiling time parameters can only be changed at compiling time. Thus, both the MVAPICH stack and the applications need to be re-compiled. Run time parameters have default values. However, they can be changed at run time by using user-defined environmental variables. In this case, no re-compilation of applications or MVAPICH is necessary. Classes of parameters --------------------------------------------------------------------- --------------------------------------------------------------------- Compiling time parameters: Parameters in this class are macros. They must be set by changing the source code file. Recompilation of both MVAPICH and applications must be done in order to see the effect of the change. Run time parameters: Each parameter in this class has a corresponding environmental variable. The default value of the parameter can be changed be setting the corresponding environmental variable. This can be done in the shell separately or in the command line where the MPI program is launched. Here is an example of setting a parameter (DISABLE_RDMA_ALLTOALL) in the command line: $ mpirun_rsh -np 4 n0 n1 n2 n3 DISABLE_RDMA_ALLTOALL=1 ./cpi Parameter descriptions -------------------------------------------------------------------- --------------------------------------------------------------------- VBUF_TOTAL_SIZE: Class: Compiling time Location: (vbuf.h) Default: Architecture dependent. (12*1024) for IA-32. This macro defines the size of each "vbuf". Basically, vbufs store descriptors and packets used in the underlying communication (send, receive and RDMA). In our current implementation, each eager data packet must fit into one vbuf. Therefore, it also puts an upper limit on the size of the eager data packet. (Please note that eager data payload is even smaller due to the size of the packet header and the descriptor.) However, a large value may lead to more wasted memory. Different presets for this value are available for different sizes of clusters. Please use -D_SMALL_CLUSTER, -D_MEDIUM_CLUSTER and -D_LARGE_CLUSTER for cluster sizes 0-64, 64-256, 256 and beyond respectively. ----------------------------------------------------------------------- VIADEV_DEVICE: Class: Run time Name of the InfiniBand device. By default the first device in the system is used. ----------------------------------------------------------------------- VIADEV_RDMA_LIMIT: Class: Run time Upper Limit of the number of outstanding RDMA operations at the InfiniBand level. Effective only when macro VIADEV_HAVE_RDMA_LIMIT is defined. This value is 2 by default. However, it should be set according to the capability of HCAs. ----------------------------------------------------------------------- VIADEV_SQ_SIZE: Class: Run time Upper Limit of the number of Send Queue entries at the InfiniBand level. Note that the number of Receive Queue entries are calculated automatically in MVAPICH. This value should be large enough to hold all outstanding send/rdma requests. By default, it is set to 200. ----------------------------------------------------------------------- VIADEV_CQ_SIZE: Class: Run time Upper Limit of the number of Completion Queue entries at the InfiniBand Level. This must be large enough to hold all the outstanding signaled communication operations. The default value is 40000. ----------------------------------------------------------------------- VIADEV_NUM_RDMA_BUFFER: Class: Run time The number of RDMA buffers used for the RDMA fast path. This "fast path" is used to reduce latency and overhead of small data and control messages. This value is effective only when macro RDMA_FAST_PATH is defined. The default value is architecture dependent. Different presets for this value are available for different sizes of clusters. Please use -D_SMALL_CLUSTER, -D_MEDIUM_CLUSTER and -D_LARGE_CLUSTER for cluster sizes 0-64, 64-256, 256 and beyond respectively. ----------------------------------------------------------------------- VIADEV_MAX_RDMA_SIZE: Class: Run time The upper limit of message size when IB RDMA is used in MVAPICH. Messages such as Rendezvous data will be divided into smaller chunks if their sizes exceed this limit. The default value is 1048576. ----------------------------------------------------------------------- VIADEV_DEFAULT_MTU: Class: Run time The internal MTU used for IB. This parameter should be a string instead of an integer. Valid values are: "MTU256", "MTU512", "MTU1024", "MTU2048", "MTU4096". Default value is "MTU1024". Note that current Mellanox IB devices only support MTUs up to 2048. ----------------------------------------------------------------------- VIADEV_MAX_FAST_EAGER_SIZE: This is used to specify the maximum size of the messages which are sent using "header caching". The default value is 255. Please note that this value cannot exceed 255 in the current implementation. ----------------------------------------------------------------------- VIADEV_DEFAULT_MAX_SG_LIST: This specifies the maximum number of gather/scatter entries support for each queue pair. Currently, InfiniBand communication uses only one gather/scatter entry. However, this parameter also affects the maximum size of data that can be sent using "inline". Larger messages can be sent through inline with larger VIADEV_DEFAULT_MAX_SG_LIST value. The default value is 20. ----------------------------------------------------------------------- VIADEV_RENDEZVOUS_THRESHOLD: Class: Run time This specifies the switch point between eager and rendezvous protocol in MVAPICH. Please note that since the size of vbufs puts an upper limit on this value, you can only decrease it at run time, not increase it. ----------------------------------------------------------------------- NDREG_ENTRIES Class: Run time This defines the total number of buffers that can be stored in the registration cache. It has no effect if LAZY_MEM_UNREGISTER is not defined. A larger value will lead to more infrequent lazy de-registration. However, the underlying IB layer may have some limit on the total amount of memory a process can register. If you are experiencing memory registration failure, please try decreasing this value. ----------------------------------------------------------------------- VIADEV_VBUF_POOL_SIZE Class: Run time The number of vbufs in the initial pool. This pool is shared among all the connections. A large value will lead to more initial memory usage. However, a small value may lead to memory allocation at run time and degrade performance. The default value is set to 5000. ----------------------------------------------------------------------- VIADEV_VBUF_SECONDARY_POOL_SIZE Class: Run time The number of vbufs allocated each time when the global pool is running out in the initial pool. This is also shared among all the connections. A large value may lead to waste of memory. But if the value is too small, memory allocation may be frequent during run time and degrade performance. The default value is set to 500. ----------------------------------------------------------------------- Flow control related parameters: VIADEV_INITIAL_PREPOST_DEPTH: Class: Run time This defines the initial number of pre-posted receive buffers for each connection. If communication happen for a particular connection, the number of buffers will be increased to VIADEV_PREPOST_DEPTH. The default value is 5. VIADEV_PREPOST_DEPTH: Class: Run time This defines the number of buffers pre-posted for each connection to handle send/receive operations. If FAST_RDMA_PATH is enabled, this macro can be set to a small value (such as 32 or 64). This number should not be set to a very large value for large systems. Otherwise the memory consumption will be large. The default value is 64. VIADEV_CREDIT_NOTIFY_THRESHOLD: VIADEV_DYNAMIC_CREDIT_THRESHOLD: Class: Run time Flow control information is usually sent via piggybacking with other messages. These two parameters are used to determine when to send explicit flow control update messages. The default values are 5 and 10, respectively. ----------------------------------------------------------------------- SMP communication related parameters. SMPI_MAX_NUMLOCALNODES Class: Compiling time Location: mpid_smpi.h This macro has no effect if macro _SMP_ is not defined. It specifies the upper limit of the number of processes MPI supports on a single node. Usually it can be set to be the maximum number of physical processors on an SMP node (if you are not running more than one processes on a single processor). The default value is set to 4. SMPI_LENGTH_QUEUE Class: Run time This has no effect if macro _SMP_ is not defined. It defines the size of shared buffer between every two processes on the same node. A larger value may allow more communication without waiting for flow control. However, a smaller value can save more resources. Note that this variable should be set in MBytes. The default value is set to 4. SMP_EAGERSIZE Class: Run time This has no effect if macro _SMP_ is not defined. It defines the switch point from Eager protocol to Rendezvous protocol for intra-node communication. If macro _SMP_RNDV_ is defined, then for messages larger than SMP_EAGERSIZE, SMP Rendezvous protocol is used, where a message is split into smaller packets, and sent out through shared memory in a pipelining manner. The packet size is the same as the value of SMP_EAGERSIZE. If macro _SMP_RNDV_ is not defined, then IB is used for intra-node Rendezvous protocol. In the latter case the value of this variable should be determined by shared memory communication performance (memory speed, cache size, ...) and IB performance. Note that this variable should be set in KBytes. The default value is architecture dependent. ----------------------------------------------------------------------- Multicast based broadcast related parameters: Class: Compiling time MCST_THRESHOLD: Location: (src/coll/intra_fns_new.c) This threshold indicates that MPI_Bcast uses hardware multicast up to MCST_THRESHOLD bytes. This parameter is currently set to 8192 based on experimentation on 16 node systems. For large scale systems, this threshold may be increased to get the benefit of hardware-based multicast for larger messages. For example, 16 KB for 32-node systems, 32 KB for 64-node systems, and so on. VIADEV_UD_PREPOST_DEPTH: Location: (mpid/vapi/bcast_info.h) This parameter indicates the total number of buffers preposted for UD messages. This value is equal to 32 currently. VIADEV_UD_PREPOST_THRESHOLD: This parameter defines a low water mark for the number of buffers posted for UD messages. If this water mark is reached, the posting of buffers will begin until the number of buffers reaches the value defined in VIADEV_UD_PREPOST_DEPTH. Currently, this value is set to 16. SENDER_WINDOW: Location: (mpid/vapi/bcast_info.h) This parameter indicates the maximum number of outstanding MPI_Bcasts allowed at any root. After issuing these many broadcasts, the root blocks if it has not received acks for any of these MPI_Bcasts. This parameter is currently set to 512. BCAST_TIME_OUT: Location: (mpid/vapi/bcast_info.h) This parameter indicates the time duration the root waits for the Ack before retransmitting the message. Currently, this is set to a value of one second. ----------------------------------------------------------------------- Parameters for Multi-Rail Configuration In addition to the above environment variables, the Multi-Rail configuration provides following parameters Class: Run Time NUM_PORTS This parameter indicates number of ports to be used for communication per adapter on an end node. By default, this value is set to 2. Class: Run Time NUM_HCAS This parameter indicates number of adapters to be used for communication on an end node. By default, this value is set to 1. Class:Run Time STRIPING_THRESHOLD For a class of messages, a user may want to use Rendezvous protocol and not stripe the data across multiple ports/adapters. For messages of size equal and above this value, the data is striped across multiple paths. This value should atleast be equal to the VIADEV_RENDEZVOUS_THRESHOLD. The value of STRIPING THREHOSLD is currently equal to VIADEV_RENDEZVOUS_THRESHOLD. For Optimal performance, this value may need a change depending upon the number of ports and number of adapters in the system. ----------------------------------------------------------------------- Other InfiniBand parameters used during initialization: Class: Run time These parameters are used mostly to initialize the underlying InfiniBand layer used by MVAPICH communication. VIADEV_DEFAULT_QP_OUS_RD_ATOM VIADEV_DEFAULT_PSN VIADEV_DEFAULT_PKEY_IX VIADEV_DEFAULT_P_KEY VIADEV_DEFAULT_MIN_RNR_TIMER VIADEV_DEFAULT_SERVICE_LEVEL VIADEV_DEFAULT_TIME_OUT VIADEV_DEFAULT_STATIC_RATE VIADEV_DEFAULT_SRC_PATH_BITS VIADEV_DEFAULT_RETRY_COUNT VIADEV_DEFAULT_RNR_RETRY VIADEV_DEFAULT_MAX_RDMA_DST_OPS