Using the DLCache Library
			On the Cray XE6 System

			Mike Davis, Cray Inc.
			February 2012

Introduction
------------

The DLCache library is a set of functions that can be incorporated into a
dynamically-linked application to provide improved performance during the
loading of dynamic libraries when running the application at large scale.
The DLCache library interfaces with the dynamic linker (ld.so) and the
compute-node /tmp file system to cache the application's dynamic library
load operations.

Dynamic Linking Without DLCache
-------------------------------

When a dynamically-linked application is executed, the set of dependent
libraries is loaded in sequence by ld.so prior to the transfer of control to
the application's main function.  This set of dependent libraries ordinarily
numbers a dozen or so; their names and paths are given by the ldd(1) command.
As ld.so processes each dependent library, it executes a series of system
calls to make the library's contents available to the application.  These
include:

	fd = open (/path/to/dependent/library, O_RDONLY);
	read (fd, elfhdr, 832); // to get the library ELF header
	fstat (fd, &stbuf); // to get library attributes, such as size
	mmap (buf, length, prot, MAP_PRIVATE, fd, offset); // read text segment
	mmap (buf, length, prot, MAP_PRIVATE, fd, offset); // read data segment
	close (fd);

When the application runs at very large width (that is, at high PE count or
large MPI comm size), then every PE (or MPI rank) executes the same set of
system calls on the same file system object at more or less the same time.
This can cause serious file system contention and performance degradation.

DLCache Theory of Operation
---------------------------

The successful use of DLCache depends on three central assumptions: (1) the
application can be executed at either small or large width; (2) the number
of dynamic libraries to be loaded, and their order, does not change from small-
width to large-width runs; and (3) the number of dynamic libraries to be
loaded, and their order, is identical across PEs.  The reasons for these
central assumptions will be given in the next section.

Because the loading of dependent libraries occurs so early in the applicaion's
execution, there is a very limited set of I/O, communication, and other system-
related support available to the application at that time.  Specifically, there
is no opportunity at this stage to employ higher-level operations such as MPI.
For this reason, the DLCache functions are limited to a small set of system-
call operations such as open(), read() and write() to perform their
optimizations.  This is why compute-node /tmp was chosen to hold the cache,
rather than MPI buffers.

Dynamic Linking With DLCache
----------------------------

The DLCache library eases file-system contention and delivers improved
performance by caching the dependent library contents in a file local to the
compute node (/tmp/dlcache.dat).  The calls normally made by ld.so to load a
dependent library are intercepted and redirected to the DLCache library, which
then accesses the dlcache file to make the dependent library's contents
available to the application.

Using the DLCache library in an application involves four steps.  The first
two steps are performed when the application is built, and the last two
steps are performed when the application is executed.

	Step 1: Specifying a Non-default Dynamic Linker
	-----------------------------------------------
	Since the system calls made by ld.so must be intercepted,
	the application must be linked with a custom ld.so.
	The procedure for performing this custom link is as follows.
	Suppose that the normal command for linking the application
	(called 'myapp') looks like this:

		cc main.o solve.o report.o \
			-L${HOME}/lib -lmytools -o myapp

	To link with a custom ld.so, one would instead perform the
	following commands:

		cp /lib64/ld-linux-x86-64.so.2 ld.so
		cc main.o solve.o report.o \
			${DLCACHE_HOME}/lib/*.o \
			-L${HOME}/lib -lmytools -o myapp \
			-Wl,--dynamic-linker=`pwd`/ld.so

	The cp command creates a copy of the system default ld.so
	in the current working directory.  The second line of the
	cc command specifies that the object files that make up the
	DLCache library be statically linked into the executable.
	The fourth line of the cc command specifies that the new
	copy of ld.so be used as the dynamic linker.

	Note that, at this point, the contents of the "custom" ld.so
	are no different from that of the system default.  The only
	difference between the two is their path names.  It turns out
	that this difference is important to the success of the next
	step.

	Step 2: Customizing the dynamic linker
	--------------------------------------

	Customizing the dynamic linker involves patching the code in
	ld.so so that, instead of issuing the set of system calls to
	load a dependent libarary, the linker calls the corresponding
	DLCache library functions that are now part of the application
	code.  The tool that performs this patching is called
	'dlpatch'.  dlpatch is part of the DLCache software package.
	The command to customize the dynamic linker is:

	${DLCACHE_HOME}/bin/dlpatch myapp

	The dlpatch command locates the addresses of the DLCache library
	functions from the executable 'myapp', and patches instructions
	into the ld.so that cause it to branch to the appropriate
	addresses in 'myapp' instead of executing the system calls.

	Step 3: Creating the dlcache.dat file
	-------------------------------------

	Before the application can make use of a dlcache.dat file, it
	must first generate the file.  This is done by executing the
	application in a "trial mode."  Suppose that the job script to
	execute the application at a small width is as follows (the line
	numbers at the left are for annotation only; they do not appear
	in the actual script file):

1	#!/bin/bash
2	#PBS -S /bin/bash
3	#PBS -l mppwidth=24
4	#PBS -l mppnppn=24
5	#PBS -l walltime=1:00:00
6	#PBS -o run.out
7	#PBS -j oe
8	#
9	test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR}
10	aprun -n 24 -N 24 myapp
11	#
12	# All done
13	#

	The changes necessary to create the dlcache.dat file appear
	in the modified script below:

1	#!/bin/bash
2	#PBS -S /bin/bash
3	#PBS -l mppwidth=24
4	#PBS -l mppnppn=24
5	#PBS -l walltime=1:00:00
6	#PBS -o run.out
7	#PBS -j oe
8	#
9	test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR}
10	DL_OP="cache-write"
11	aprun -n 24 -N 24 ${DLCACHE_HOME}/bin/dlcache.pre ${DL_OP}
12	aprun -n 24 -N 24 myapp
13	aprun -n 24 -N 24 ${DLCACHE_HOME}/bin/dlcache.post ${DL_OP}
14	#
15	# All done
16	#

	The modified script has three new lines.  Line 10 sets the
	shell variable DL_OP to the type of caching operation to
	perform (cache-write).  Line 11 executes a tool called
	'dlcache.pre', which prepares the compute-node's /tmp space
	for the process of generating a dlcache.dat file.  On line
	12 of the modified script, as on line 10 of the original
	script, the application is executed at a width of 24 PEs,
	across the 24 cores of a single compute node.  It is assumed
	that 'myapp' has been built to do DLCaching, as described
	in the sections above; as such, it will generate a dlcache.dat
	file as it runs.  Line 13 executes a tool called
	'dlcache.post', which collects the dlcache.dat file from the
	/tmp space on the compute node and writes it to the current
	working directory of the job script.  The file can then be
	used in a subsequent, large-width, execution of myapp.

	Step 4: Reading the dlcache.dat file
	------------------------------------

	Now that the application has been run at small width to create
	the dlcache.dat file, a large-width run can be made to read
	the file.  Suppose that the job script to execute the
	application at a large width is as follows (the line numbers
	at the left are for annotation only; they do not appear
	in the actual script file):

1	#!/bin/bash
2	#PBS -S /bin/bash
3	#PBS -l mppwidth=24000
4	#PBS -l mppnppn=24
5	#PBS -l walltime=1:00:00
6	#PBS -o run.out
7	#PBS -j oe
8	#
9	test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR}
10	aprun -n 24000 -N 24 myapp
11	#
12	# All done
13	#

	The changes necessary to read the dlcache.dat file appear
	in the modified script below:

1	#!/bin/bash
2	#PBS -S /bin/bash
3	#PBS -l mppwidth=24
4	#PBS -l mppnppn=24
5	#PBS -l walltime=1:00:00
6	#PBS -o run.out
7	#PBS -j oe
8	#
9	test "${PBS_O_WORKDIR}" != "" && cd ${PBS_O_WORKDIR}
10	DL_OP="cache-read"
11	aprun -n 24000 -N 24 ${DLCACHE_HOME}/bin/dlcache.pre ${DL_OP}
12	aprun -n 24 -N 24 myapp
13	aprun -n 24000 -N 24 ${DLCACHE_HOME}/bin/dlcache.post ${DL_OP}
14	#
15	# All done
16	#

	The modified script has three new lines.  Line 10 sets the
	shell variable DL_OP to the type of caching operation to
	perform (cache-read).  Line 11 executes a tool called
	'dlcache.pre', which prepares the compute-nodes' /tmp space
	for the process of reading the dlcache.dat file.  (This step
	is optimized such that only rank 0 reads the dlcache.dat
	file from the current working directory; the contents are
	then broadcast to the lowest-rank PEs on each node; then the
	lowest-rank PEs write the contents to their respective /tmp
	directories.)  On line 12 of the modified script, as on line
	10 of the original script, the application is executed at a
	width of 24000 PEs, across the 24 cores of 1000 compute nodes.
	It is assumed that 'myapp' has been built to do DLCaching, as
	described in the sections above; as such, it will read the
	dlcache.dat file as it runs.  Line 13 executes a tool called
	'dlcache.post', which removes the dlcache.dat file from the
	/tmp space on the compute nodes.