Using the DLCache Library On the Cray XE6 System Mike Davis, Cray Inc. February 2012 Introduction ------------ The DLCache library is a set of functions that can be incorporated into a dynamically-linked application to provide improved performance during the loading of dynamic libraries when running the application at large scale. The DLCache library interfaces with the dynamic linker (ld.so) and the compute-node /tmp file system to cache the application's dynamic library load operations. Dynamic Linking Without DLCache ------------------------------- When a dynamically-linked application is executed, the set of dependent libraries is loaded in sequence by ld.so prior to the transfer of control to the application's main function. This set of dependent libraries ordinarily numbers a dozen or so; their names and paths are given by the ldd(1) command. As ld.so processes each dependent library, it executes a series of system calls to make the library's contents available to the application. These include: fd = open (/path/to/dependent/library, O_RDONLY); read (fd, elfhdr, 832); // to get the library ELF header fstat (fd, &stbuf); // to get library attributes, such as size mmap (buf, length, prot, MAP_PRIVATE, fd, offset); // read text segment mmap (buf, length, prot, MAP_PRIVATE, fd, offset); // read data segment close (fd); When the application runs at very large width (that is, at high PE count or large MPI comm size), then every PE (or MPI rank) executes the same set of system calls on the same file system object at more or less the same time. This can cause serious file system contention and performance degradation. DLCache Theory of Operation --------------------------- The successful use of DLCache depends on three central assumptions: (1) the application can be executed at either small or large width; (2) the number of dynamic libraries to be loaded, and their order, does not change from small- width to large-width runs; and (3) the number of dynamic libraries to be loaded, and their order, is identical across PEs. The reasons for these central assumptions will be given in the next section. Because the loading of dependent libraries occurs so early in the applicaion's execution, there is a very limited set of I/O, communication, and other system- related support available to the application at that time. Specifically, there is no opportunity at this stage to employ higher-level operations such as MPI. For this reason, the DLCache functions are limited to a small set of system- call operations such as open(), read() and write() to perform their optimizations. This is why compute-node /tmp was chosen to hold the cache, rather than MPI buffers. Dynamic Linking With DLCache ---------------------------- The DLCache library eases file-system contention and delivers improved performance by caching the dependent library contents in a file local to the compute node (/tmp/dlcache.dat). The calls normally made by ld.so to load a dependent library are intercepted and redirected to the DLCache library, which then accesses the dlcache file to make the dependent library's contents available to the application. Using the DLCache library in an application involves four steps. The first two steps are performed when the application is built, and the last two steps are performed when the application is executed. Step 1: Specifying a Non-default Dynamic Linker ----------------------------------------------- Since the system calls made by ld.so must be intercepted, the application must be linked with a custom ld.so. The procedure for performing this custom link is as follows. Suppose that the normal command for linking the application (called 'myapp') looks like this: cc main.o solve.o report.o \ -L${HOME}/lib -lmytools -o myapp To link with a custom ld.so, one would instead perform the following commands: cp /lib64/ld-linux-x86-64.so.2 ld.so cc main.o solve.o report.o \ ${DLCACHE_HOME}/lib/*.o \ -L${HOME}/lib -lmytools -o myapp \ -Wl,--dynamic-linker=`pwd`/ld.so The cp command creates a copy of the system default ld.so in the current working directory. The second line of the cc command specifies that the object files that make up the DLCache library be statically linked into the executable. The fourth line of the cc command specifies that the new copy of ld.so be used as the dynamic linker. Note that, at this point, the contents of the "custom" ld.so are no different from that of the system default. The only difference between the two is their path names. It turns out that this difference is important to the success of the next step. Step 2: Customizing the dynamic linker -------------------------------------- Customizing the dynamic linker involves patching the code in ld.so so that, instead of issuing the set of system calls to load a dependent libarary, the linker calls the corresponding DLCache library functions that are now part of the application code. The tool that performs this patching is called 'dlpatch'. dlpatch is part of the DLCache software package. The command to customize the dynamic linker is: ${DLCACHE_HOME}/bin/dlpatch myapp The dlpatch command locates the addresses of the DLCache library functions from the executable 'myapp', and patches instructions into the ld.so that cause it to branch to the appropriate addresses in 'myapp' instead of executing the system calls. Step 3: Creating the dlcache.dat file ------------------------------------- Before the application can make use of a dlcache.dat file, it must first generate the file. This is done by executing the application in a "trial mode." Suppose that the job script to execute the application at a small width is as follows (the line numbers at the left are for annotation only; they do not appear in the actual script file): 1 #!/bin/bash 2 #PBS -S /bin/bash 3 #PBS -l mppwidth=24 4 #PBS -l mppnppn=24 5 #PBS -l walltime=1:00:00 6 #PBS -o run.out 7 #PBS -j oe 8 # 9 test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR} 10 aprun -n 24 -N 24 myapp 11 # 12 # All done 13 # The changes necessary to create the dlcache.dat file appear in the modified script below: 1 #!/bin/bash 2 #PBS -S /bin/bash 3 #PBS -l mppwidth=24 4 #PBS -l mppnppn=24 5 #PBS -l walltime=1:00:00 6 #PBS -o run.out 7 #PBS -j oe 8 # 9 test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR} 10 DL_OP="cache-write" 11 aprun -n 24 -N 24 ${DLCACHE_HOME}/bin/dlcache.pre ${DL_OP} 12 aprun -n 24 -N 24 myapp 13 aprun -n 24 -N 24 ${DLCACHE_HOME}/bin/dlcache.post ${DL_OP} 14 # 15 # All done 16 # The modified script has three new lines. Line 10 sets the shell variable DL_OP to the type of caching operation to perform (cache-write). Line 11 executes a tool called 'dlcache.pre', which prepares the compute-node's /tmp space for the process of generating a dlcache.dat file. On line 12 of the modified script, as on line 10 of the original script, the application is executed at a width of 24 PEs, across the 24 cores of a single compute node. It is assumed that 'myapp' has been built to do DLCaching, as described in the sections above; as such, it will generate a dlcache.dat file as it runs. Line 13 executes a tool called 'dlcache.post', which collects the dlcache.dat file from the /tmp space on the compute node and writes it to the current working directory of the job script. The file can then be used in a subsequent, large-width, execution of myapp. Step 4: Reading the dlcache.dat file ------------------------------------ Now that the application has been run at small width to create the dlcache.dat file, a large-width run can be made to read the file. Suppose that the job script to execute the application at a large width is as follows (the line numbers at the left are for annotation only; they do not appear in the actual script file): 1 #!/bin/bash 2 #PBS -S /bin/bash 3 #PBS -l mppwidth=24000 4 #PBS -l mppnppn=24 5 #PBS -l walltime=1:00:00 6 #PBS -o run.out 7 #PBS -j oe 8 # 9 test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR} 10 aprun -n 24000 -N 24 myapp 11 # 12 # All done 13 # The changes necessary to read the dlcache.dat file appear in the modified script below: 1 #!/bin/bash 2 #PBS -S /bin/bash 3 #PBS -l mppwidth=24 4 #PBS -l mppnppn=24 5 #PBS -l walltime=1:00:00 6 #PBS -o run.out 7 #PBS -j oe 8 # 9 test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR} 10 DL_OP="cache-read" 11 aprun -n 24000 -N 24 ${DLCACHE_HOME}/bin/dlcache.pre ${DL_OP} 12 aprun -n 24 -N 24 myapp 13 aprun -n 24000 -N 24 ${DLCACHE_HOME}/bin/dlcache.post ${DL_OP} 14 # 15 # All done 16 # The modified script has three new lines. Line 10 sets the shell variable DL_OP to the type of caching operation to perform (cache-read). Line 11 executes a tool called 'dlcache.pre', which prepares the compute-nodes' /tmp space for the process of reading the dlcache.dat file. (This step is optimized such that only rank 0 reads the dlcache.dat file from the current working directory; the contents are then broadcast to the lowest-rank PEs on each node; then the lowest-rank PEs write the contents to their respective /tmp directories.) On line 12 of the modified script, as on line 10 of the original script, the application is executed at a width of 24000 PEs, across the 24 cores of 1000 compute nodes. It is assumed that 'myapp' has been built to do DLCaching, as described in the sections above; as such, it will read the dlcache.dat file as it runs. Line 13 executes a tool called 'dlcache.post', which removes the dlcache.dat file from the /tmp space on the compute nodes.