NERSCPowering Scientific Discovery Since 1974

VTune

Introduction

Intel VTune Amplifier XE is a performance analysis tool that enables you to find serial and parallel code bottlenecks and speed execution. VTune provides both a GUI and a command line interface.

VTune is available on Edison and Cori at NERSC. We strongly recommend using the command line tool,  "amplxe-cl", to collect profiling data via batch jobs, and then displaying the profiling data using the GUI, amplxe-gui, on a login node.

Note that the performance of the X Windows-based Graphical User Interface can be greatly improved if used in conjunction with the free NX software.

NOTE for Edison and Cori users: Run VTune in the Lustre file systems ($SCRATCH or $SCRATCH3)! Do not run in the global file systems such as a /project directory, and global homes ($HOME). Please read the section 'Known Issues and Workarounds' for more info.

Using VTune on Edison and Cori

As we mentioned in the introduction, we recommend using the command line tool, "amplxe-cl", to collect profiling data via batch jobs, and then using the GUI, "amplxe-gui," to display the results on a login node. Below are the steps to use VTune:

Compiling codes to run with VTune

In principle, VTune works with executables compiled with all three compilers (Intel, GNU, Cray) available on Edison and Cori. However, we recommend using Intel compilers to avoid running into unexpected issues, as VTune is an Intel product. To compile codes to work with VTune, use the "-g" when compiling source code, and the "-dynamic" flag at link time, when using the compiler wrappers (ftn, cc, and CC). For example, to compile a Fortran MPI+OpenMP code, do

module unload darshan
ftn -g -dynamic -openmp -O3 jacobi.f90 

To compile a C MPI + OpenMP code, do

module unload darshan
cc -g -dynamic -openmp -O3 jacobi.c

Here the "-g" option is needed to assist VTune to associate addresses to source lines, and the "-dynamic" option is needed to build dynamically linked applications with the compiler wrappers (the compiler wrappers, ftn, cc, and CC, link applications statically by default). Sometimes, it is desirable to use the "-debug inline-debug-info" compile option to generate enhanced debug information for inlined code.

ftn -g -dynamic -openmp -O3 -debug inline-debug-info jacobi.f90 

VTune works best with dynamically linked applications, although it also works with statically linked applications (see Intel website for more details). When you execute the dynamically linked applications, you may need to set the LD_LIBRARY_PATH environment variable and/or load the same modules that you used compile your codes.

You can build your applications with full compiler optimizations as you would normally do. However, some compiler flags, such as the Intel compiler's "-fast" flag, may not work well with VTune. Please check this Intel website for more compiler options. There you will find both recommended and unrecommended options for working with VTune.

Note that the 'darshan' module is an I/O profiler that is enabled by default on Edison and Cori. VTune does not work with the 'darshan' module, so you need to unload it before compiling your codes.

Collecting profiling data via batch jobs on Edison and Cori

For a pure MPI code, this an example job script on Edison. Note the vtune module must be loaded on the login node where this job script gets submitted. 

module load vtune
#!/bin/bash
#SBATCH -q debug
#SBATCH --perf=vtune
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j

# on 2 Edison nodes

module load vtune # this is optional
srun -n 48 amplxe-cl -collect general-exploration -r res_dir -trace-mpi -- ./a.out

A 2-node collection on a Cori Haswell node is similar, with only the CPU counts and the Slurm "-C" flag being different:

#!/bin/bash
#SBATCH -q debug
#SBATCH --perf=vtune
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j
#SBATCH -C haswell

# on 2 Cori Haswell nodes

module load vtune #this is optional
srun -n 64 -c 2 amplxe-cl -collect general-exploration -r res_dir -trace-mpi -- ./a.out

Finally, a Cori KNL node collection in quadrant+cache mode:

Notice: Need to load the module "vtune" first, before submitting the following batch script via "sbatch".

#!/bin/bash
#SBATCH -q debug
#SBATCH --perf=vtune
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j
#SBATCH -C knl,quad,cache

# on 2 Cori KNL nodes

module load vtune
srun -n 128 -c 4 amplxe-cl -collect general-exploration -r res_dir -trace-mpi -no-auto-finalize -- ./a.out

This will run the "general-exploration" analysis on two nodes. The "-trace-mpi" option allows VTune to trace MPI code (the default is "-no-trace-mpi"), and determine the MPI rank IDs if the code is linked to a non-Intel MPI library. The "-trace-mpi" flag also aggregates performance data from MPI tasks on the same node into a single collection database (so the above example would produce 2 collection databases since the code ran on 2 nodes). The "-trace-mpi" flag is not necessary if you do a collection with a single MPI process. The "-r" option specifies the directory in which to write the collection database. The result for each node will be saved in a separate directory, named "res_dir.<nodename>". Note, you must use the "--perf=vtune" Slurm flag to run any hardware event based analyses, such as "memory-access," "advanced-hotspots," and "general-exploration." This flag inserts an Intel kernel module that is needed for these hardware event-based analyses.

For a version of vtune other than the default, you can specify it in the #SBATCH directive, for example:

#!/bin/bash
#SBATCH -q debug
#SBATCH --perf=vtune/2017.up2 # note that vtune string must match loaded vtune module name
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j
#SBATCH -C knl,quad,cache

# on 2 Cori KNL nodes

module load vtune/2017.up2 # note that we load a specific version of vtune here
srun -n 128 -c 4 amplxe-cl -collect general-exploration -r res_dir -trace-mpi -no-auto-finalize -- ./a.out

The vtune version passed to #SBATCH must match the version loaded at submit time, and the version loaded in the script.

Note that the Cori KNL example includes an extra flag "-no-auto-finalize" (renamed to "-finalization-mode=none" in VTune 2017). This flag is useful on KNL nodes because the last step of the VTune collection, called "finalization", can take a very long time on KNL nodes. It is generally faster to skip finalization on the KNL compute nodes, and then manually finalize the collection database on a login node via

amplxe-cl -finalize -r <collection_dir>

Finalization is relatively fast on Edison and Cori Haswell compute nodes, so for collections on those nodes this manual finalization procedure is not necessary.

Here is an example of running an MPI/OpenMP hybrid code with VTune on 2 nodes of Edison:

#!/bin/bash
#SBATCH -q debug
#SBATCH --perf=vtune
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j

# collect MPI/OpenMP code profile on 2 nodes of Edison

module load vtune

export OMP_NUM_THREADS=12
srun -n 4 -c 24 amplxe-cl -collect memory-access -knob analyze-openmp=true -knob analyze-mem-objects=true -r res_dir -trace-mpi -- ./a.out

(A collection of this type on Cori Haswell or Cori KNL would look similar, differing only in the "-C haswell" or "-C knl,quad,cache" flags, the "--perf=vtune" flag, as well as the task/thread values.)

The above example is a VTune "memory-access" analysis running on two nodes with 4 MPI tasks and 12 threads per task. In addition to the "-collect," "-r," and "-trace-mpi" options (as explained above), a "-knob" option, "-knob analyze-mem-objects=true," is used so to map the hardware events to memory objects (by instrumenting memory allocation and de-allocation). This option is especially useful to identify the data objects in the source code which generate high memory traffic. This option is turned off by default. You may also want to use  -knob mem-object-size-min-thres=<size> option to specify a minimal size of memory allocations to analyze, which helps reduce runtime overhead of the instrumentation. The default value is 1024 bytes. For example, the following amplxe-cl command line will analyze only those memory allocations that are larger than 4096 bytes.

export OMP_NUM_THREADS=12
srun -n 4 -c 24 amplxe-cl -collect memory-access -knob analyze-openmp=true -knob analyze-mem-objects=true -knob mem-object-size-min-thres=4096 -r res_dir -trace-mpi -- ./a.out

To use hyperthreading, for example, using two nodes and 4 MPI tasks per node, each task with 24 threads, you can do

export OMP_NUM_THREADS=24
srun -n 4 -c 24 amplxe-cl -collect memory-access -knob analyze-openmp=true -knob analyze-mem-objects=true -r res_dir -trace-mpi -- ./a.out

To run VTune interactively on compute nodes, do

module load vtune                     # or load a specific version
salloc -N 2 -q debug --perf=vtune

and when a batch session prompts, you can do (e.g., on Edison):

module load vtune
export OMP_NUM_THREADS=12
srun -n 4 -c 24 amplxe-cl -collect memory-access -knob analyze-openmp=true -knob analyze-mem-objects=true -r res_dir -trace-mpi -- ./a.out

The available analysis types and the amplxe-cl command line options

The available analysis types are:

Option 

Description

advanced-hotspots Advanced Hotspots
concurrency Concurrency
cpugpu-concurrency CPU/GPU Concurrency (not relevant on Edison/Cori)
general-exploration General Exploration
hotspots Basic Hotspots
hpc-performance HPC Performance Characterization
locksandwaits Locks and Waits
memory-access Memory Access

You can run VTune with more options to customize the analyses. The available command line options for amplxe-cl can be found at https://software.intel.com/en-us/node/544244. You can also get the options (not a complete list) by running the following commands on Edison

module load vtune
amplxe-cl -help
amplxe-cl -help collect

You can also use the configuration options, the "-knob" options, for each analysis type to further customize the analysis. The available "-knob" options can be found at https://software.intel.com/en-us/node/544270 . You can also get the list of the available "-knob" options via "amplxe-cl -help collect <analysis type>" command.

Using the VTune GUI to display profiling results on a login node

After the profiling data is collected, do the following to display the results on a login node, e.g., on Edison:

ssh -Y edison.nersc.gov -l <YourUserName>
module load vtune 
amplxe-gui

Then choose the "Open Result" option in the GUI to display your data (open the file with the .amplxe extension under the directory where your profiling data was saved, e.g., res_dir.nid00576/run_dir.nid00576.amplxe). 

The use of the NX software is strongly recommended for the remote users, which can speed up the VTune GUI significantly. 

Please refer to this Intel presentation about how to optimize your code using VTune on Edison.

Using the VTune GUI

To use the VTune GUI, login to the desired system using the -XY option during login. Then execute the following commands:

% module load vtune
% amplxe-gui

This will open a VTune window. If this is the first time that you are using the tool, you need to create a "project" for performance profiling sessions. Otherwise, you will see the previous projects and their data collection events in the left frame of the window. You may also create new projects using the "New Project" button even if you have existing projects.

To create a new project, click on 'New Project...'

ResizedImage600350 VTune Welcome screen

In the 'Create a Project' window, choose a name for the project and the location where performance data for the project will be stored, as shown below. A directory with the same name as the project name will be created under the directory specified in the 'Location' field, and all the analysis data under this project will be stored under the project directory.

VTune create project

Once the project name and its location are set, you need to specify run and data collection configurations in the 'Project Properties' window. Please see the snapshot below for an example.

VTune proj prop1

You need to do the following:

  • Select the target system from the options on the left hand side
  • Enter the application name in the required text box. You may also browse for and select it.
  • In the following text box, enter the application parameters.
  • Uncheck the checkbox for 'Use application directory as working directory' and enter your working directory in the 'Working directory' text field.
  • Enter environment variables and their values by clicking the 'Modify...' button under the 'User-defined environment variables' area.

Scroll down and you can set the runtime estimate so that VTune chooses a proper sampling frequency. You can set sampling data limit too. Note that you can also specify the directory where the results will be created. By default, it will be the directory specified in the 'Create a Project' window earlier (see above for more information).

VTune proj prop2

You should use the 'Binary Symbol Search' and 'Source Search' buttons on the right to specify the paths to the additional binary files and the source file. Then click on the 'Choose Analysis' button to select the type of analysis.

You can review the settings for a project any time by clicking on the 'Project Properties...' button in the top task bar (the third one from left) or pressing the 'Contrl+P' keys.

To open an existing Project, click on 'Open Project...', browse for the project directory and select the '.amplxeproj' file in it.

VTune open project

You can start a new performance analysis by clicking on any of the analysis menu items listed in the main window ('Concurrency Analysis', 'Basic Hotspots Analysis',..., 'New Analysis...'). Otherwise, you can click on the orange triangle button from the toolbar at the top of the pane.

VTune Analyze

Select an analysis type in the left frame among the preset types: 'Advanced Hotspots', 'General Exploration' and 'Bandwidth'. You can even create your own custom type (see the 'Custom Analysis' menu).

VTune Running analysis

Then click on the 'Start' button on the right to start the analysis.

Analysis Runs

Analysis cases will appear under the current project in the 'Project Navigator' region in the far-left window frame. Also, when a new analysis is performed, a tab is added to the toolbar at the very top of the main window. If you want to examine a previous result, double-click on the result name in the 'Project Navigator' area, and a tab for the analysis results will appear in the main window.

A case name includes the sequence number (incrementing from 0) in the project and the analysis type ('ah', 'ge', and 'bw' for 'Advanced Hotspots', 'General Exploration' and 'Bandwidth' analysis types, respectively), as in 'r000ah', 'r001ge', 'r002bw', etc. 

For information on some analysis types available in VTune, refer to the page on Analysis Types in VTune Amplifier.

Helpful Resources for More Information

Known Issues and Workarounds

  • Unload the darshan module on Edison or Cori before compiling your code, otherwise VTune fails to collect data.
  • Running static binaries with VTune may segfault, so link your codes dynamically.
  • On Edison and Cori, do not run VTune jobs on GPFS-based file systems (/project and $HOME). Instead, run it on the Lustre file systems ($SCRATCH and $CSCRATCH) so that VTune saves the profiling data on the Lustre file systems. VTune requires the mmap supported file systems, but GPFS file systems on Edison and Cori are projected onto the compute nodes via the Cray DVS layer which does not support the mmap functionality.   
  • It is recommended to run one analysis at a time in a single job script. VTune jobs may hang if several analysis jobs are run in a single job script. 
  • Collection finalization is very slow on KNL compute nodes. To avoid this problem, add the "-no-auto-finalize" (or "-finalization-mode=none" in version 2017 and newer) flag to the compute node collection, and then manually finalize the database on a login node via the "-finalize" flag to "amplxe-cl".

Installed Versions

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
Intel VTune Amplifier XE carl applications/ performance 2016.up2 vtune/2016.up2 2016-04-26 2016-04-26
Intel VTune Amplifier XE carl applications/ performance 2016.up3 vtune/2016.up3 2016-04-26 2016-07-06
Intel VTune Amplifier XE carl applications/ performance 2016.up4 vtune/2016.up4 2016-07-12
Intel VTune Amplifier XE cori applications/ performance 2016.up4 vtune/2016.up4 2016-07-28 2016-07-28
Intel VTune Amplifier XE cori applications/ performance 2017.up2 vtune/2017.up2 2017-03-08
Intel VTune Amplifier XE cori applications/ performance 2018.0 vtune/2018.0 2017-10-04
Intel VTune Amplifier XE cori applications/ performance 2018.up1 vtune/2018.up1 2017-12-19
Intel VTune Amplifier XE edison applications/ performance 2016.up4 vtune/2016.up4 2017-07-27
Intel VTune Amplifier XE edison applications/ performance 2017.up2 vtune/2017.up2 2017-07-27 2017-07-27
Intel VTune Amplifier XE edison applications/ performance 2018.up1 vtune/2018.up1 2018-01-06 2018-01-09
Intel VTune Amplifier XE gerty applications/ performance 2017_pre_zip vtune/2017_pre_zip 2016-10-04
Intel VTune Amplifier XE gerty applications/ performance 2018.0 vtune/2018.0 2017-09-16 2017-09-16
Intel VTune Amplifier XE gerty applications/ performance 2018.beta vtune/2018.beta 2017-09-15 2017-09-15
Intel VTune Amplifier XE gerty applications/ performance 2018.up1 vtune/2018.up1 2018-01-08 2018-01-08