NERSCPowering Scientific Discovery Since 1974

CrayPat

Description

CrayPat is a performance analysis tool offered by Cray for the XT and XE platforms. Production codes should NOT use CrayPat. CrayPat has a large feature set. 

Perftools-lite is a simplified and easy-to-use version of the CrayPat tool.  It provides basic performance analysis information automatically with simple steps. Users can decide whether to use the full CrayPat tools after trying Perftools-lite.

Here we will highlight basic usage and point to other relevant documentation.

How to use Perftools-lite

The general workflow for getting performance data using Perftools-lite is as follows:

  1. Unload the darshan module  (Perftools-lite conflicts with the default loaded "darshan" I/O performance monitoring tool).
  2. Load the perftools-lite module.
  3. Build your application as normal.
  4. In batch script, set CRAY_ROOTFS to DSL (note: the pat_report binary is dynamically linked).
  5. If the job is submitted from a GPFS system, then also in the batch script:  set PAT_RT_EXPFILE_DIR to a directory in Lustre file system, or set PAT_RT_EXPFILE_MAX >= the number of PEs or -1.
  6.  Run as normal. Performance data is summarized at the end of the job stdout.
  7.  More detailed information can also be gathered with pat_report or Cray Apprentice2 after the run.

Outputs from Perftools-lite

  • In the job's stdout file, basic information from the default "sample_profile" option:  execution time, memory high-water mark, aggregate FLOPS rate,  top time-consuming user functions,  MPI information, etc.
  • A *.rpt text file with the same info as above.
  • A *.ap2 file that can be used with "pat_report" for more detailed informaition, and with "ap2" for graphic visualization.
  • Possibly one or more suggested MPICH_RANK_ORDER_FILE files.

How to Use CrayPat

The general workflow for getting performance data using CrayPat is as follows:

  1. Unload the darshan module if it is loaded.
  2. Load the perftools module.
  3. Build your application; keep .o files.
  4. Instrument the application using pat_build.
  5. Run the instrumented executable to get a performance data (".xf") file.
  6. Run pat_report on the data file to view the results.

Darshan is an I/O profiling tool that is loaded by default. When using CrayPat, the darshan module needs to be unloaded first. Otherwise, the following error will occur:

% pat_build myprogram
FATAL: The program '/.../myprogram' directly or indirectly uses the MPI profiling mechanism.

If you still see this error after unloading the darshan module, then it is necessary to unload the darshan module in your .bashrc.ext (bash shell) or .cshrc.ext (tcsh or csh shell) dotfile and login again.

The perftools module needs to be loaded before you start building your application. Otherwise, the following error message appears during the link stage.

ERROR: Missing required ELF section 'link information' from the file ....
Load the correct 'craypat' module and rebuild the program.

Object files (*.o files) need to be made available to CrayPat to correctly build an instrumented executable for profiling or tracing. In other words, compile and link stage should be separated by using the -c compile flag. Otherwise, one will see the warning message

% module load perftools
% ftn mytest.f90
WARNING: CrayPat is saving object files from temporary locations into directory '/global/homes/...'

Please note that the Cray compiler wrappers (ftn, cc and CC) must be used for building an executable, instead of native compiler commands (pgf90, pgcc, pgCC, gfortran, gcc, g++, etc.) since pat_build cannot build an instrumented executable from an executable built with a native compiler.

Try to run a CrayPat-instrumented executable in the $SCRATCH or $SCRATCH2 space or make sure that CrayPat writes its performance data to those spaces via the PAT_RT_EXPFILE_DIR environment variable (see the intro_craypat man page):

#!/bin/csh
#PBS -l mppwidth=48
...
cd $PBS_O_WORKDIR
setenv PAT_RT_EXPFILE_DIR $SCRATCH/data_dir # the directory must exist
aprun -n 48 ./myprogram+pat

Your parallel application may fail when performance data is created in a non-Lustre file system.

Sampling Experiments

Sampling (sometimes called "asynchronous") is to sample the program counter (PC) or the call stack at given time intervals or when specified counter overflows. The default experiment type is to sample the PC at a time interval (i.e., samp_pc_time). There are other sampling experiment types available, and the type can be set by the environment variable PAT_RT_EXPERIMENT (see the intro_craypat man page).

% module load perftools
% ftn -c myprogram.f90
% ftn -o myprogram myprogram.o
% pat_build -S myprogram (or simply pat_build myprogram)

This generates a new executable, myprogram+pat. Run this executable on compute nodes, as you would with the regular executable. This will generate a performance data file with the suffix, 'xf' (e.g., myprogram+pat+5511-2558sdot.xf). To view the contents, run pat_build on it:

% pat_report myprogram+pat+5511-2558sdot.xf

This commands prints ASCII text report to your terminal and creates files with different suffices, 'ap2' and 'apa'. The first file is used to view performance data graphically with the Cray Apprentice2 tool, and the latter is for suggested pat_build options for more detailed tracing experiments.  To see source line information use the "-O ca+src" option to pat_report.

A more detailed source line-by-line profile can be obtained by the following:

pat_build a.out [don't use any pat_build options]
aprun -n ... a.out
pat_report -O samp_profile+src ....

This will produce an output similar to the following:

Table 1: Profile by Group, Function, and Line
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | Source
| | | | Line
100.0% | 3654.0 | -- | -- |Total
|---------------------------------------------------------------
| 99.6% | 3640.0 | -- | -- |USER
||--------------------------------------------------------------
|| 82.5% | 3015.0 | -- | -- |dim3_sweep_module_dim3_sweep_
3| | | | | HOPPER/src/./dim3_sweep.f90
||||------------------------------------------------------------
4||| 1.3% | 47.0 | -- | -- |line.122
4||| 6.2% | 226.0 | -- | -- |line.217
4||| 11.0% | 402.0 | -- | -- |line.218
4||| 11.1% | 407.0 | -- | -- |line.228
4||| 7.5% | 274.0 | -- | -- |line.229
4||| 7.3% | 266.0 | -- | -- |line.238
4||| 4.3% | 158.0 | -- | -- |line.240
4||| 5.8% | 212.0 | -- | -- |line.241
4||| 5.4% | 198.0 | -- | -- |line.242
4||| 6.7% | 245.0 | -- | -- |line.243
4||| 3.3% | 120.0 | -- | -- |line.322
4||| 4.5% | 164.0 | -- | -- |line.371
4||| 3.9% | 142.0 | -- | -- |line.377

which shows that the SUBROUTINE "dim3_sweep" takes up nearly all the time (~82%) and source lines 218 and 228 comprise the bulk of that.  Note that the individual source lines shown do not add to 82%, probably because some source lines have fallen below the pat_report printing threshold.

Tracing Experiments

pat_build also can instrument an executable to trace calls to user-defined functions and Cray-provided library functions (e.g., MPI functions). Again, to generate an instrumented executable, one needs to load the perftools module first, and, then, compile and link in separate steps.

% module load perftools
% ftn -c myprogram.f90
% ftn -o myprogram myprogram.o

The -w flag enables tracing. If only this flag is used, the entire program is traced as a whole (as "main"), with no individual function being traced.

% pat_build -w myprogram

To instrument user-defined functions, func1 and func2, use the -T option, together with the -w option:

% pat_build -w -T func1,func2 myprogram

Be careful to choose the func1 and func2 names properly; the compiler may have appended underscore characters to the Fortran routine name.

To trace a group of functions you list in a text file, tracefile, use the -t option:

% pat_build -w -t tracefile myprogram

where the file, tracefile, contains the function names to be traced.

To trace all the user-defined function, use the -u option.

% pat_build -u myprogram

Tracing the entire user functions can slow down the code significantly if it contains many small and frequently called functions. To avoid such excessive overhead, one can restrict only to the functions with a certain text size or larger, by using the directive, trace-text-size (see the pat_build man page):

% pat_build -u -Dtrace-text-size=800 myprogram    # to trace those with text size >= 800 bytes

To trace a Cray-provided library function group (e.g., MPI, OpenMP, ...), specify the function group name after the -g flag. The supported function groups are listed in the pat_build man page. For example, to trace the MPI, OpenMP and heap memory related functions as well as all the user functions, one can do:

% pat_build -g mpi,omp,heap -u myprogram

After running the instrumented executable on compute nodes via aprun, run pat_report on the generated data file ("xf" file). This prints ASCII text output to terminal, and creates a file with the same basename but with the ap2 suffix, which is to be viewed with the Cray Apprentice2 tool.

Automatic Program Analysis (APA)

Since a sampling experiment runs with little overhead and a detailed tracing experiment in general comes with large overhead, a good strategy to get performance analsys for a code that a user doesn't know about its performance characteristics would to run a sampling experiment first to identify routines that need to be instrumented for a more detailed tracing experiment later.

CrayPat's Automatic Program Analysis (APA) feature provides an easy way for such a purpose. Using this feature, one can generate an instrumented executable for a sampling experiment. When the binary is executed, it generates an ASCII text file that contains CrayPat's suggestion for pat_build tracing options, which can be used to re-instrument the executable for detailed tracing experiments.

The general workflow for using APA is as follows.

  1. Generate the executable for sampling, using the special '-O apa' flag. It will generate an instrumented executable, myprogram+pat.
  2. % pat_build -O apa myprogram    # generates myprogram+pat
  3. Running the executable on compute nodes via aprun generates an xf file. Let's call it myprogram+pat+4571-19sdot.xf in this example.
  4. Run pat_report on the data file:
  5. % pat_report myprogram+pat+4571-19sdot.xf
    It will generate the myprogram+pat+4571-19sdot.ap2 and myprogram+pat+4571-19sdot.apa. The latter contains suggested pat_build options for building an executable for tracing experiments.
  6. Examine the myprogram+pat+4571-19sdot.apa file and, if necessary, customize it for your need using your favorite text editor.
  7. Rebuild an executable using pat_build's -O option with the apa file name as the argument. It generates a new instrumented executable, myprogram+apa.
  8. % pat_build -O myprogram+pat+4571-19sdot.apa    # generates myprogram+apa
  9. Run the new executable, myprogram+apa, for a tracing experiment.
  10. Run pat_report on the newly created xf file. This is the tracing result.
  11. % pat_report myprogram+apa+4590-19tdot.xf

Monitoring Hardware Performance Counters

One can monitor hardware performance counter (HWPC) events while running sampling or tracing experiments (however, doing this with sampling experiment is discouraged). Supported PAPI standard and AMD native event names that can be monitored can be found by running the papi_avail and papi_native_avail commands on a compute node using a batch script:

#!/bin/csh
#PBS -l mppwidth=1
...
cd $PBS_O_WORKDIR
module load perftools
aprun -n 1 papi_avail
aprun -n 1 papi_native_avail

By default, hardware performance counters are not monitored during sampling or tracing experiments. To enable monitoring, one has to explicitly specify up to four event names using the PAT_RT_HWPC environment variable before aprun command is executed. For example, one can monitor floating point operations and L1 data cache misses with the following:

setenv PAT_RT_HWPC "PAPI_FP_OPS,PAPI_L1_DCM"

Or one can set the environment variable to a predefined hardware counter group number:

setenv PAT_RT_HWPC 1   # for PAPI_L1_DCM, PAPI_L1_DCA, PAPI_TLB_DM and PAPI_FP_OPS

As another example, to find the "computational intensity" (CI) in your code, use

 

setenv PAT_RT_HWPC "PAPI_FP_OPS,PAPI_L1_DCA"

The CI will be the ratio of the counts returned by those two events.

 

There are 19 such groups (from 0 to 19, with 4 not supported for AMD Opteron). These groups and the meaings are explained in the hwpc man page.

Multiplexing (monitoring more than 4 events) is not yet supported in the current version.

Some Advanced CrayPat Topics

By default, CrayPat will aggregate values from multiple processing elements during a run (and there are options to control the aggregation).  If you want to look at HWPC values on a per-processing element (PE) basis, you can just do the following:

pat_report -s pe=ALL file.ap2 ...

The following should work as well:

pat_report -d counters -b pe -s aggr_pe_counters=select0 ...

If you want to sort the report by PEs, you can add

-s sort_by_pe=yes

If you'd like HPWC values for just the whole program (not per function, etc), you can do

pat_report -O hwpc -s pe=ALL ...

Also, if the code is OpenMP-MPI hybrid rather than pure MPI then it is recommended to sample with NUM_THREADS=1 .

Measuring Load Imbalance

You can use CrayPat to measure load imbalance in programs instrumented to trace MPI functions. By default CrayPat causes the trace wrapper for each MPI collective subroutine to measure the time for a barrier call prior to entering the collective. This time is reported by pat_report in as MPI_SYNC, which is separate from the MPI function group itself. The MPI_SYNC time essentially represents the time spent waiting for the MPI call to synchronize; it determines if the MPI ranks arrive at the collectives together or not.  If the environment variable PAT_RT_MPI_SYNC is set to 1 (which is the default), the time spent waiting at a barrier and synchronizing processes is reported under MPI_SYNC, while the time spent executing after the barrier is reported under MPI. Default values are 1 for tracing experiments and 0 for sampling experiments.

Cray Apprentice2

Cray Apprentice2 is a tool used to visualize performance data instrumented with the CrayPat tool. There are many options for viewing results. Please refer to the app2 man page or Cray's documentation for more details.

% module load perftools
% app2 myprogram+pat+####-####tdot.ap2

To enable the "Mosaic" and "Traffic Report" views you need to set an environment variable as follows:

  setenv PAT_RT_SUMMARY 0

Doing this may create enormous CrayPat data files and may take a long time.

Cray Reveal

Cray Reveal is a tool developed by Cray to help developing the hybrid MPI/OpenMP programming model.  It is part of the Cray Perftools software package. It utilizes the Cray CCE program library (hence it only works under PrgEnv-cray) for loopmark and source code analysis, combined with performance data collected from CrayPat. Reveal helps to identify top consuming loops, with compiler feedback on dependency and vectorization.  its loop scope analysis provides variable scope and compiler directive suggestions for inserting OpenMP parallelism to a serial or pure MPI code.

Detailed steps for using Reveal

Further Information

NERSC has prepared a detailed tutorial on Cray's perftools.  You can view the presentation material.

Useful hands-on exercise materials accompanying the tutorial are available in the /project/projectdirs/training/XE6-feb-2011/perftools directory, which is accessible from any NERSC machine.

Please refer to Cray's documentation for more details.  Man pages are available but only after you load the perftools modulefile.  Try man pat_help for a tutorial.  Other man pages include craypat, pat_build, pat_report, app2, hwpc, intro_perftools, papi, and papi_counters.

For questions on using CrayPat at NERSC contact the consultants, consult@nersc.gov.

Availability

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
Cray Perftools hopper libraries/ performance 6.0.1 perftools/6.0.1 2012-12-19 2013-02-27
Cray Perftools hopper libraries/ performance 6.1.1 perftools/6.1.1 2013-12-04
perftools edison applications/ performance 6.1.3 perftools/6.1.3 2013-12-13 2013-12-10
perftools edison applications/ performance 6.1.4 perftools/6.1.4 2014-04-23 2014-04-23
perftools hopper libraries/ performance 6.1.0 perftools/6.1.0 2013-04-12 2013-06-20
perftools hopper libraries/ performance 6.1.2 perftools/6.1.2 2013-09-30 2013-12-11