NERSCPowering Scientific Discovery Since 1974

Tutorials and example batch scripts

A number of tutorials for the Burst Buffer are available. See for example the tutorial at CUG2017, the NUG Monthly Telecon tutorial, and the Data Day tutorial (video, slides). 

Access to the BurstBuffer resource is integrated with the Scheduler of the system (i.e. SLURM). The Scheduler provides the ability to provision the BurstBuffer resource to be shared by a set of users or jobs. Using the Burst Buffer on Cori can be as simple as adding a line to your slurm batch script. Here we give examples of how to use the Burst Buffer as a scratch space and as a persistent reservation, and how to stage data in and out of the SSD. 

Burst Buffer as scratch

The fast IO available with the Burst Buffer makes it an ideal scratch space to store temporary files during the execution of an IO-intensive code. Note that all files on the Burst Buffer allocation will be removed when the job ends - so you will need to stage_out any data that you want to retain at the end of your job. You also do not need to delete any unwanted data you leave on the BB - it will be removed automatically during the tear down of your BB allocation (performed by the DataWarp software after your job completes). To use the Burst Buffer as scratch, you will need to add a "#DW jobdw" command to your slurm batch script, and specify what type of allocation you require (striped or private) and how much. The "jobdw" indicates that you are requesting an allocation that will last for the duration of the compute job. 

  • Access mode:  
    • Striped access mode means your data will be striped across several DataWarp nodes. The Burst Buffer has two levels of granularity in two different pools - 80GiB (wlm_pool, default) and 20GiB (sm_pool). This is the minimum allocation on each Burst Buffer SSD. For example, if you wish to use 1.2TiB of Burst Buffer space in the wlm_pool, this will be striped across 15 BB SSDs, each holding 80GiB (note that you may actually see more than one unit of granularity on the same BB SSD - there is no guarantee that your allocation will be spread evenly between SSDs). DataWarp nodes are allocated on a round-robin basis. 
    • Private access mode means each of your compute jobs will have their own, private space on the BB that will not be visible to any other compute job. Data will be striped across BB nodes in private mode. Note that all compute nodes will share the allocation - so if one compute node fills up all the space then the other compute nodes will run out and you will see "out of space" errors. 
    • In general, we recommend using access_mode=striped and the default granularity pool. 
  • Type: The only type currently available is type=scratch

The path to the Burst Buffer disk will then be available as $DW_JOB_PRIVATE or $DW_JOB_STRIPED. If you want your code to access the BB disk space, you will need to tell it (your code will not automatically stage your data on the BB for you). For example, as an option to your code: 

#!/bin/bash 
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 00:15:00
#DW jobdw capacity=10GB access_mode=striped type=scratch pool=sm_pool
srun a.out INSERT_YOU_CODE_OPTIONS_HERE

The following example demonstrates how to copy data in or out of the Burst Buffer. The location of this data could be passed as an option to your executable (note that your executable will not know where this data is unless you tell it!). The only filesystem currently mounted on the datawarp nodes is the Lustre scratch system, accessible via $SCRATCH on Cori. Any files to be transferred to or from the Burst Buffer must be located on this disk.

Note that this is a different way of moving data compared to the Datawarp "stage_in" and "stage_out" commands, and may be significantly slower.

#!/bin/bash 
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -C haswell #SBATCH -t 00:15:00 #DW jobdw capacity=10GB access_mode=striped type=scratch mkdir $DW_JOB_STRIPED/inputdir mkdir $DW_JOB_STRIPED/outputdir cp $SCRATCH/path/to/input/file $DW_JOB_STRIPED/inputdir/ srun a.out INSERT_YOUR_CODE_OPTIONS_HERE cp -r $DW_JOB_STRIPED/outputdir/ $SCRATCH/path/to/output/

 

Staging In Files

If you will be accessing a file multiple times in your code, or if you have large input files that would benefit from a faster IO, you should consider using the stage_in feature of the Burst Buffer. This will copy the named file or directory into the Burst Buffer, which can then be accessed using $DW_JOB_PRIVATE or $DW_JOB_STRIPED as above. Note that currently only the Cori scratch file system is accessible from the Burst Buffer, so only files on $SCRATCH can be staged in. Stage_in/out differs from a simple filesystem "cp" in that the BB nodes transfer the files directly to the Burst Buffer SSDs, without going through the compute nodes (as is the case with "cp"), so it is significantly faster. In addition, your files will be staged in before the start of the compute job - so you are not charged compute time for the time taken to stage the data in. 

  • You can stage_in (and stage_out) both files and directories - use "type=file" or "type=directory" accordingly. 
  • You need to have permission to access the files - if you try to stage_in someone else's files without sufficient permissions, you may see errors. 
  • Please use the full path to your scratch directory and not $SCRATCH (or other environmental variables), as the datawarp software will not be able to interpret the environmental variable.
  • If you don't specify a destination directory name on the Burst Buffer when using type=directory, you will find the files contained in your source directory have been copied over to the Burst Buffer - not the directory itself. 
  • Your compute job will wait in the queue until the data is staged in to the BB - so that your data will be available to you as soon as the compute job starts. If you have a large amount of data (e.g. TB-scale) or many (e.g. millions) of files to stage in, this may take tens of minutes, so your compute job could pend for longer than you'd expect based on your queue priority. In this case, just be patient!  
#!/bin/bash
#SBATCH -p regular
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 01:00:00
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file
srun a.out INSERT_YOUR_CODE_OPTIONS_HERE

Staging Out Files

Similarly, you can use the Burst Buffer to stage out files to your scratch directory. This stage_out will happen after the end of your compute job, and you are not charged compute time for it. So be aware that your staged-out data may not appear immediately when your compute job finishes. 

#!/bin/bash
#SBATCH -p regular
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 01:00:00
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_out source=$DW_JOB_STRIPED/dirname destination=/global/cscratch1/sd/username/path/to/dirname type=directory
srun a.out INSERT_YOUR_CODE_OPTIONS_HERE

 

Persistent Reservations

If you have multiple jobs that need to access the same files across many jobs, you can use a persistent reservation. This creates a space on the Burst Buffer that will persist after the end of your job, and can be accessed by subsequent jobs. Important: you must delete the reservation at the end of your use of it, to free up the space for other users. The Burst Buffer is not intended for long-term storage and we do not guarantee your files will be recoverable over long periods of time. 

1.Create the reservation

You can create the named persistent reservation in a standalone job or at the start of your regular batch script. The only type of reservation available is with striped access and of type scratch. Note that creating (and destroying) the persistent reservation uses a "#BB" prefix rather than the "#DW" used in other examples. This is because the command is interpreted by Slurm and not by the Cray Datawarp software. This can result in the batch job behaving in a way you may not expect. As soon as the scheduler reads the job, the Burst Buffer resource is scheduled, even though the job has not yet executed. This means the the persistent reservation will be available shortly after you submit the batch job, even if the job is not scheduled to execute for many hours. It also means that canceling the job after the reservation has been created will not cancel the reservation. If unsure, please use "scontrol show burst" to see what reservations have been created on the Burst Buffer. 

  • You will need to give your persistent reservation a name - this must be a unique name otherwise the allocation will fail. Please avoid using simple names like "offline" which can cause confusion. 
  • You can create a PR using a batch script that does no compute work at all - but you still need to submit it to the batch system and therefore request compute time, even if it's not used.  
#!/bin/bash
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 00:05:00
#BB create_persistent name=myBBname capacity=100GB access=striped type=scratch 
2.Use the reservation

Tell slurm that you will be accessing the Burst Buffer reservation using the datawarp command and the name of your persistent reservation with #DW persistentDW name=myBBname. The path to your Burst Buffer reservation can then be accessed during your job using $DW_PERSISTENT_STRIPED_myBBname, where "myBBname" is the name you gave your persistent reservation when you created it. In the example below, you could pass this path to your executable as an option (note that you have to actually write your code so that it accepts such an option, it will not magically know where your data sits).

  • Be careful when running many jobs that write to the same location - you may inadvertently overwrite your data. 
#!/bin/bash
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 00:05:00
#DW persistentdw name=myBBname
mkdir $DW_PERSISTENT_STRIPED_myBBname/test1
srun a.out INSERT_YOUR_CODE_OPTIONS_HERE 
3.Delete the reservation

It is very important to remember to delete the persistent reservation once you have finished using it, to avoid inconveniencing other users. At present, we recommend this task be submitted as an independent batch job. Similarly to creating the reservation, the reservation will be destroyed as soon as the scheduler reads the batch job. This means your reservation may be destroyed many hours before your batch job rises to the top of the queue and actually executes. 

Note that data left on a persistent reservation will be automatically deleted when the allocation is torn down - you do not need to remove the data yourself. 

#!/bin/bash
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 00:05:00
#BB destroy_persistent name=myBBname

 

Interactive Sessions

You can also access the Burst Buffer during an interactive session, as either scratch space (usable only during your session), or accessing your persistent reservation. The simplest way to do this is to specify a configuration script that contains the #DW or #BB directives you would otherwise use in your batch script, and specify that script when you request the interactive session with salloc (note that the quotation marks around the filename are required): 

salloc -N 1 -C haswell -p debug -t 00:10:00 --bbf="bbf.conf"

where the file bb.conf contains the line:

#DW persistentdw name=myBBname

or 

#DW jobdw capacity=10GB access_mode=striped type=scratch

The path to the Burst Buffer space will then be available in your interactive session using the environmental variables  $DW_JOB_STRIPED or $DW_PERSISTENT_STRIPED_myBBname, as usual. More information about accessing the Burst Buffer during an interactive session can be found on the Slurm web page.

 

Shifter 

 The Burst Buffer can be used with Shifter. An example below uses the cvmfs image of the ATLAS experiment. For more details on Shifter see the Shifter web page

#!/bin/bash
#SBATCH -N 2 -C haswell -p debug
#DW jobdw type=scratch capacity=400GB access_mode=striped
module load shifter
srun -n 2 shifter --image custom:atlas_cvmfs:latest ./shiftertest.sh

Finding out what's on the Burst Buffer

The command "scontrol show burst" is a very useful one. Using this you can see all allocations on the Burst Buffer - bother persistent and scratch. It also shows you the available pools, and how much space on the BB is allocated and how much is available. 

Accessing DataWarp configuration

The command "dwstat" can be used within your compute job (not available on login nodes) to find out information about where your BB allocation is assigned. The syntax can be found on the Cray web pages - here we give a couple of examples of useful ways of combining the commands to extract information about your BB usage. 

You can see all persistent reservations in dwstat, but only your per-job (i.e. scratch) instances. 

Finding which nodes your BB allocation has been assigned to for a scratch instance

This script will give you your session ID number, instance ID number and the list of BB fragments for a scratch instance on the BB. Note that you CANNOT use "srun dwstat" as this will cause problems with all compute nodes trying to query the datawarp status at the same time. You only need to run this command from the head node, without the "srun". 

#!/bin/bash
module load dws
sessID=$(dwstat sessions | grep $SLURM_JOBID | awk '{print $1}')
echo "session ID is: "${sessID}
instID=$(dwstat instances | grep $sessID | awk '{print $1}')
echo "instance ID is: "${instID}
echo "fragments list:"
echo "frag state instID capacity gran node"
dwstat fragments | grep ${instID}

Finding which nodes your BB allocation has been assigned to for a persistent reservation

This script will give you your session ID number, instance ID number and the list of BB fragments for a persistent reservation on the BB. Note that you CANNOT use "srun dwstat" as this will cause problems with all compute nodes trying to query the datawarp status at the same time. You only need to run this command from the head node, without the "srun". 

You will need to substitute the name of the persistent reservation you are trying to access for myBBname in this example. 

#!/bin/bash
module load dws
sessID=$(dwstat sessions | grep myBBname | awk '{print $1}')
echo "session ID is: "${sessID}
instID=$(dwstat instances | grep $sessID | awk '{print $1}')
echo "instance ID is: "${instID}
echo "fragments list:"
echo "frag state instID capacity gran node"
dwstat fragments | grep ${instID}

 

 

Using the datawarp API 

We do not recommend using the datawarp API unless you really want/need to - it remains under-tested. 

There is a demo script of staging functionality available here, and some description of usage here. This can be compiled as follows:

#!/bin/bash
module load datawarp
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/datawarp/default/lib64
export LIBRARY_PATH=$LIBRARY_PATH:/opt/cray/datawarp/default/lib64

gcc -ldatawarp -I/opt/cray/datawarp/default/include -o datawarp_Stager datawarp_stager.c