NERSCPowering Scientific Discovery Since 1974

Introduction to Scientific I/O

The Lustre File System

Lustre is a scalable, POSIX-compliant parallel file system designed for large, distributed-memory systems, such as Hopper and Edison at NERSC. It uses a server-client model with separate servers for file metadata and file content, shown in Figure 2.1. For example, on Hopper, the /scratch and /scratch2 file systems each have a single metadata server (which can be a bottleneck when working with thousands of files) and 156 'Object Storage Targets' that store the contents of files.


Figure 2.1. The general architecture of cluster file systems such as Lustre that have separate object and metadata stores. In Lustre, the I/O servers are called Object Storage Targets (OSTs).

Although Lustre is designed to correctly handle any POSIX-compliant I/O pattern, in practice it performs much better when the I/O accesses are aligned to Lustre's fundamental unit of storage, which is called a stripe and has a default size (on NERSC systems) of 1MB.

Striping is a method of dividing up a shared file across many OSTs, as shown in Figure 2.2. Each stripe is stored on a different OST, and the assignment of stripes to OSTs is round-robin.

Figure 2.2. A shared file striped across four OSTs.



The reason for striping a file is to increase the available bandwidth. Writing to several OSTs in parallel aggregates the bandwidth of each of the individual OSTs. This is the exact same principal behind the use of striping in a RAID array of disks. In fact, the disk arrays backing a Lustre file system are also RAID striped, so you can think of Lustre striping as a second layer of striping that allows you to access every single physical disk in the file system if you want the maximum available bandwidth (i.e. by striping over all available OSTs).

Lustre allows you to modify three striping parameters for a shared file:

  • the stripe count controls how many OSTs the file is striped over (for example, the stripe count is 4 for the file shown in Figure 2.2);
  • the stripe size controls the number of bytes in each stripe; and
  • the start index chooses where in the list of OSTs to start the round-robin assignment (the default value -1 allows the system to choose the offset in order to load balance the file system).

The default parameters on Hopper are [count=2, size=1MB, index=-1], but these can be changed and viewed on a per-file or per-directory basis using the commands:

lfs setstripe [file,dir] -c [count] -s [size] -i [index]
lfs getstripe [file,dir]

A file automatically inherits the striping parameters of the directory it is created in, so changing the parameters of a directory is a convenient way to set the parameters for a collection of files you are about to create. For instance, if your application creates output files in a subdirectory called output/, you can set the stiping parameters on that directory once before your application runs, and all of your output files will inherit those parameters.

NERSC has also created the following shortcut commands for applying roughly optimal striping parameters for a range of common file sizes and access patterns:

stripe_fpp [file,dir]
stripe_small [file,dir]
stripe_medium [file,dir]
stripe_large [file,dir]

For more details on these striping shortcuts, see our page on tuning Lustre I/O performance.

Unfortunately, striping parameters must be set before a file is written to (either by touching a file and setting the parameters or by setting the parameters on the directory). To change the parameters after a file has been written to, the file must be copied over to a new file with the desired parameters, for instance with these commands:

lfs setstripe newfile [new parameters]
cp oldfile newfile