Optimizing I/O performance on the Lustre file system
Lustre File Striping
Edison and Cori use Lustre as their $SCRATCH file systems. For many applications a technique called file striping will increase IO performance. File striping will primarily improve performance for codes doing serial IO from a single node or parallel IO from multiple nodes writing to a single shared file as with MPI-IO, parallel HDF5 or parallel NetCDF.
The Lustre file system is made up of an underlying set of IO servers and disks called Object Storage Servers (OSSs) and Object Storage Targets (OSTs) respectively. A file is said to be striped when read and write operations access multiple OST's concurrently. File striping is a way to increase IO performance since writing or reading from multiple OST's simultaneously increases the available IO bandwidth.
NERSC Striping Shortcuts
- The default striping is 2 on Edison's $SCRATCH (backed by either /scratch1 and /scratch2), while it is 8 on Edison's $SCRATCH3, and is 1 on Cori's $SCRATCH.
- This means that each file created with the default striping is split across 2 OSTs on Edison's primary scratch filesystems, and 8 on Edison's specialized $SCRATCH3. On Cori, the default striping allocates 1 OST for the file.
- Selecting the best striping can be complicated since striping a file over too few OSTs will not take advantage of the system's available bandwidth but striping over too many will cause unnecessary overhead and lead to a loss in performance.
- NERSC has provided striping command shortcuts based on file size to simplify optimization on both Edison and Cori.
- Users who want more detail should read the sections below or contact the consultants at firstname.lastname@example.org.
|Striping Shortcut Commands|
|Single Shared-File I/O||File per Processor|
|Description||Either one processor does all the I/O for a simulation in serial or multiple processors write to a single shared file as with MPI-IO and parallel HDF5 or NetCDF||Each processor writes to its own file resulting in as many files as number of processors used (for a given output)|
|Size of File||Command|
|< 1GB||Do Nothing. Use default striping.||keep default striping|
|~1GB - ~10GB||stripe_small||keep default striping|
|~10GB - ~100GB||stripe_medium||keep default striping|
|~100GB - 1TB+||stripe_large||Ask consultants|
Striping must be set on a file before is written. For example, one could simultaneously create an empty file which will later be 10-100 GB in size and set its striping appropriately with the command:
% stripe_medium [myOutputFile]
This could be done before running a job which will later populate this file. Striping of a file cannot be changed once the file has been written to, aside from copying the existing file into a newly created (empty) file with the desired striping.
stripe_small will set the number of ost as 8, stripe_medium will have 24 ost and stripe_large will set as 72. In all cases, the stripe size is 1MB.
Files inherit the striping configuration of the directory in which they are created. Importantly, the desired striping must be set on the directory before creating the files (later changes of the directory striping are not inherited). When copying an existing striped file into a striped directory, the new copy will inherit the directory's striping configuration. This provides another approach to changing the striping of an existing file.
Inheritance of striping provides a convenient way to set the striping on multiple output files at once, if all such files are written to the same output directory. For example, if a job will produce multiple 10-100 GB output files in a known output directory, the striping of the latter can be configured before job submission:
% stripe_medium [myOutputDir]
Or one could put the striping command directly into the job script:
#SBATCH -P debug
#SBATCH -N 2
#SBATCH -t 00:10:00
#SBATCH -J my_job
srun -n 10 ./a.out
More Details on File Striping
To set striping for a file or directory use the command
% lfs setstripe
Each file and directory can have a separate striping pattern and a directory's striping setting can be overridden for a particular file by issuing the lfs setstripe command for individual files within that directory. However, as noted above, striping settings for a file must be set before it is created. If the settings for an existing file are changed, it will only get the new striping setting if the file is recreated. If the settings for an existing directory are changed, the files need to be copied elsewhere and then copied back to the directory in order to inherit the new settings. The lfs setstripe syntax is:
% lfs setstripe --size [stripe-size] --index [OST-start-index] --count [stripe-count] filename
|lfs setstripe arguments|
|stripe-size||Number of bytes write on one OST before cycling to the next. Use multiples of 1MB. Default has been most successful.||1MB|
|OST-start-index||Starting OST. Default highly recommended||-1 (System follows a round robin procedure to optimize creation of files by all users.)|
|stripe-count||Number of OSTs (disks) a file exists on||2 on Edison, 8 on Edison's /scratch3, 1 on Cori|
For large shared files, set the stripe-count to something large. On Edison, the limits on stripe-count are different: the /scratch1 and /scratch2 filesystems backing $SCRATCH each have 96 OSTs, while the specialized /scratch3 filesystem has 36 OSTs. On Cori, there are 248 OSTs (but currently, we recommend the striping on Cori to be less than 160 OSTs, otherwise it will cause filesystem problems).
% lfs setstripe myfile --count 80
For applications where each processor writes its own file, set the stripe-count to 1.
% lfs setstripe myfile --count 1
An application's best performance may likely fall between these two extreme examples.
The lfs getstripe command will give the striping pattern for a file or directory.
% lfs getstripe testdir
stripe_count: 11 stripe_size: 0 stripe_offset: -1
MPI-IO Collective Mode
Collective mode refers to a set of optimizations available in many implementations of MPI-IO that improve the performance of large-scale IO to shared files. To enable these optimizations, you must use the collective calls in the MPI-IO library that end in _all, for instance MPI_File_write_at_all(). Also, all MPI tasks in the given MPI communicator must participate in the collective call, even if they are not performing any IO operations. The MPI-IO library has a heuristic to determine whether to enable collective buffering, the primary optimization used in collective mode.
Collective buffering, also called two-phase IO, breaks the IO operation into two stages. For a collective read, the first stage uses a subset of MPI tasks (called aggregators) to communicate with the IO servers (OSTs in Lustre) and read a large chunk of data into a temporary buffer. In the second stage, the aggregators ship the data from the buffer to its destination among the remaining MPI tasks using point-to-point MPI calls. A collective write does the reverse, aggregating the data through MPI into buffers on the aggregator nodes, then writing from the aggregator nodes to the IO servers. The advantage of collective buffering is that fewer nodes are communicating with the IO servers, which reduces contention while still attaining high performance through concurrent I/O transfers. In fact, Lustre prefers a one-to-one mapping of aggregator nodes to OSTs.
Since the release of mpt/3.3, Cray has included a Lustre-aware implementation of the MPI-IO collective buffering algorithm. This implementation is able to buffer data on the aggregator nodes into stripe-sized chunks so that all read and writes to the Lustre filesystem are automatically stripe aligned without requiring any padding or manual alignment from the developer. Because of the way Lustre is designed, alignment is a key factor in achieving optimal performance.
Several environment variables can be used to control the behavior of collective buffering on Edison and Cori. The MPICH_MPIIO_HINTS variable specifies hints to the MPI-IO library that can, for instance, override the built-in heuristic and force collective buffering on:
% setenv MPICH_MPIIO_HINTS="*:romio_cb_write=enable:romio_ds_write=disable"
Placing this command in your batch file before calling aprun will cause your program to use these hints. The * indicates that the hint applies to any file opened by MPI-IO, while romio_cb_write controls collective buffering for writes and romio_ds_write controls data sieving for writes, an older collective mode optimization that is no longer used and can interfere with collective buffering. The options for these hints are enabled, disabled, or automatic (the default value, which uses the built-in heuristic).
It is also possible to control the number of aggregator nodes using the cb_nodes hint, although the MPI-IO library will automatically set this to the stripe count of your file.
When set to 1, the MPICH_MPIIO_HINTS_DISPLAY variable causes your program to dump a summary of the current MPI-IO hints to stderr each time a file is opened. This is useful for debugging and as a sanity check against spelling errors in your hints.
More detail on MPICH runtime environment variables, including a full list and description of MPI-IO hints, is available from the intro_mpi man page on Edison.