NERSCPowering Scientific Discovery Since 1974

Performance Tuning

Hints and tips on how to optimize your Burst Buffer performance

Note: this only applies to the Cori Burst Buffer and should not be taken as general Burst Buffer advice (i.e. your mileage will vary on other systems). This page will be updated as the DataWarp software is updated and performance continues to improve. 

For larger files, ensure your Burst Buffer allocation will be striped over multiple nodes

Currently, the Burst Buffer granularity is 82GiB in the wlm_pool, and 20.14GiB in the sm_pool. If you request an allocation smaller than this amount, your files will sit on one BB node. If you request more than this amount, then your files will be striped over multiple BB nodes. For example, if you request 82GiB in wlm_pool then your files all sit on the same BB server - but if you request 82GiB in sm_pool then your files will be striped over 5 BB nodes. This is important, because each BB server has a maximum possible BW of roughly 6.5GB/s - so your aggregate BW is summed over the number of BB servers. Of course, if other people are accessing data on the same BB server at the same time, then you will share that BW and will be unlikely to reach the theoretical peak. 

In general, it is better to stripe your data over many BB servers, particularly if you have a large number of compute nodes trying to access the data. The wlm_pool is the default - you can request the sm_pool by adding "pool=sm_pool" to your #DW command, e.g. 

#DW jobdw capacity=10GB access_mode=striped type=scratch pool=sm_pool

Note that there are a total of 80 nodes in the sm_pool, so if you request more than (80*20.14Gib=) 1611GiB of BB allocation in this pool, then you will guarantee to have multiple stripes on the same BB server, thereby halving your maximum possible BW. 

 The following figure shows current IOR results for MPI shared file (mssf) and posix file-per-process (pfpp) runs, using 100MB block size and 1MB transfer size, with files striped over increasing numbers of BB nodes, for both the wlm_pool and sm_pool. 16 compute nodes were used, with 4 MPI ranks per node.  The benefit of striping over [1,2,4,8,16,32] BB nodes is clear in both granularity pools, although the scaling is less strong after 4 nodes,  because the compute load is staying constant. The maximum possible bandwidth is roughly 6.6 GiB/S per BB node - but this can only be achieved if you are generating enough IO load to keep the nodes busy.

As a general rule, the number of BB nodes used by an application should be scaled up with the number of compute nodes, to keep the BB nodes busy but not over-subscribed. The exact ratio of compute to BB nodes will depend on the amount of IO load produced by the application. 

 IOR 100MiBblock allocation

Use large transfer sizes, >512KiB

We have seen that using transfer sizes less than 512KiB results in poor performance. In general, we recommend using as large a transfer size as possible. The following figure shows IOR results configured as described above, using an 800GiB BB allocation in both the wlm_pool and sm_pool, for 100MiB block size and varying transfer size. Optimal performance is seen at 512KiB transfer size in the wlm_pool, with some fall-off above that in most cases. Results for the sm_pool appear to favour a larger transfer size. 

IOR 100MiBblock transfersize

Optimizing MPI-IO shared file I/O

Collective buffering is enabled by default when performing shared-file I/O using the MPI-IO API (mssf), but it does not perform optimally at most scales.  Up to 5x better performance can be achieved by disabling collective buffering using the MPI-IO hint "romio_cb_write=disable":

export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
srun --export=MPICH_MPIIO_HINTS ./my_app

This has been found to provide better MPI-IO shared file write performance at scales up to 16,384 cores.

It is also beneficial to ensure that I/O is aligned on 8 MB DVS stripe boundaries, and many I/O libraries built on top of MPI-IO provide ways to do this.  For example, using the parallel HDF5 library's H5Pset_alignment can yield a modest but consistent speedup:

Speedup from HDF5 alignment

Moderate speedup by aligning parallel HDF5 writes to 8 MB DVS stripe boundaries

 

 

Use more than 4 MPI processes per BB node

 We have seen that the Burst Buffer cannot be kept busy with less than 4 processes writing to each BB node - less than this will not be able to achieve the peak potential performance of roughly 6.6 GiB/S per node. 

Use Cray library iobuf to improve performance for small file sizes

Unlike Lustre, DVS does not currently use client-side caching, so small file R/W will suffer in comparison to performance on Cori's $SCRATCH disk. This can be alleviated by compiling the Cray library iobuf into your application - see this page for more details (but note that other information on that page is out-of-date).