NERSCPowering Scientific Discovery Since 1974

VTune Temp

Introduction

Intel VTune Amplifier XE is a performance analysis tool that enables you to find serial and parallel code bottlenecks and speed execution. VTune provides both the GUI (executable: amplxe-gui) and the command line interface (executable: amplxe-cl).

VTune is available on Edison, Babbage (the testbed for Cori) and the KNL White Boxes ( the login node for the KNL white boxes is carl.nersc.gov) at NERSC. VTune on Edison (Cray XC30) is not yet an officially supported product. Specifically, using the VTune GUI to launch applications on to Edison compute nodes is not supported officially by Intel.  We strongly recommend using the command line tool,  "amplxe-cl", to collect profiling data via batch jobs, and then displaying the profiling data using the GUI, amplxe-gui, on a login node on Edison.

Note that the performance of the X Windows-based Graphical User Interface can be greatly improved if used in conjunction with the free NX software.

NOTE for Edison users: Run VTune in the Lustre file systems ($SCRATCH or $SCRATCH3)! Do not run in the global file systems such as a project directory, and global homes ($HOME). Read the section 'Known issues and workarounds' for more info.

Using VTune on Edison

As we mentioned in the introduction, we recommend using the command line tool, "amplxe-cl", to collect profiling data via batch jobs, and then using the GUI, amplxe-gui, to display the results on a login node on Edison. Here are the steps to use VTune on Edison.

Compiling codes to run with VTune

In principle, VTune works with executables compiled with all three compilers (Intel, GNU, Cray) available on Edison. However, we recommend using Intel compilers to avoid running into unexpected issues, as VTune is an Intel product. To compile codes to work with VTune, use the -g and the -dynamic options with the compiler wrappers (ftn, cc, and CC). For example, to compile a Fortran MPI+OpenMP code, do

% module unload darshan
% ftn -g -dynamic -openmp -O3 jacobi.f90 

To compile a C MPI + OpenMP code, do

% module unload darshan
% cc -g -dynamic -openmp -O3 jacobi.c

Here, the -g option is needed to assist VTune to associate addresses to source lines, and the -dynamic option is needed to build dynamically linked applications with the compiler wrappers on Edison (the compiler wrappers, ftn, cc, and CC, link applications statically by default). Sometimes, it is desirable to use the -debug inline-debug-info compile option to generate enhanced debug information for inlined code.

% ftn -g -dynamic -openmp -O3 -debug inline-debug-info jacobi.f90 

VTune works best with dynamically linked applications, although it also works with statically linked applications (see Intel website for more details). When you execute the dynamically linked applications, you may need to set the LD_LIBRARY_PATH environment variable and/or load the same modules as you compile your codes.

You can build your applications with full compiler optimizations as you would normally do. However, some compiler flags, such as -fast, may not work well with VTune. Please check this Intel website for more compiler options. There you will find both recommended and unrecommended options to work with VTune.

Note that the darshan module is an I/O profiler that is enabled by default on Edison. VTune does not work with the darshan module, so you need to unload it before compiling your codes.

Collecting profiling data via batch jobs on Edison

For a pure MPI code, this an example job script,

#!/bin/bash -l
#SBATCH -q regular
#SBATCH --vtune
#SBATCH -t 02:00:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j

module load vtune
srun -n 48 amplxe-cl -collect general-exploration -r res_dir --trace-mpi -- ./a.out

This will run the general-exploration analysis (-collect <analysis type>) on two nodes with 48 tasks in total. The --trace-mpi option allows the collectors to trace MPI code (the default is --no-trace-mpi), and determine the MPI rank IDs if the code is linked to a non-Intel MPI library.  The -r option specifies where to write the result. The result for each node will be saved in a separate directory, named res_dir.<nodename>. Note, you must use the --vtune flag to run any hardware event based analyses, such as the memory-access, advanced-hotspots, and the general-exploration. Because, the kernel drivers that are needed for these hardware event-based analyses are installed at the job start time dynamically only if the --vtune flag is used in the job script. 

Here is an example of running an MPI/OpenMP hybrid code (applicable to pure OpenMP codes as well) with VTune:

#!/bin/bash -l
#SBATCH -q regular
#SBATCH --vtune
#SBATCH -t 02:00:00
#SBATCH -N 2
#SBATCH -J myjob
#SBATCH -o myjob.o%j

module load vtune

export OMP_NUM_THREADS=12
srun -n 4 -c 12 amplxe-cl -collect memory-access -knob analyze-mem-objects=true -r res_dir --trace-mpi -- ./a.out

This is a VTune memory-access analysis running on two nodes with 4 MPI tasks and 12 threads per task. In addition to the -collect, -r,  and --trace-mpi options (as explained in above), a -knob option, -knob analyze-mem-objects=true, is used so to map the hardware events to memory objects (by instrumenting memory allocation and de-allocation). This option is especially useful to identify the data objects in the source code which generate high memory traffic. This option is turned off by default. You may also want to use  -knob mem-object-size-min-thres=<size> option to specify a minimal size of memory allocations to analyze, which helps reduce runtime overhead of the instrumentation. The default value is 1024 bytes. For example, the following amplxe-cl command line will analyze only those memory allocations that are larger than 4096 bytes.

export OMP_NUM_THREADS=12
srun -n 4 -c 12 amplxe-cl -collect memory-access -knob analyze-mem-objects=true -knob mem-object-size-min-thres=4096 -r res_dir --trace-mpi -- ./a.out

 To use hyperthreading, for example using two nodes and 4 MPI tasks per node, each task with 24 threads, you can do

export OMP_NUM_THREADS=24
srun -n 4 -c 12 amplxe-cl -collect memory-access -knob analyze-mem-objects=true -r res_dir -trace-mpi -- ./a.out

To run VTune interactively on Edison compute nodes, do

% salloc -N 2 -q debug --vtune

and when a batch session prompts, you can do

% module load vtune
% export OMP_NUM_THREADS=12
% srun -n 4 -c 12 amplxe-cl -collect memory-access -knob analyze-mem-objects=true -r res_dir -trace-mpi -- ./a.out

Available analysis types and the amplxe-cl command line options

The current default VTune on Edison is VTune 2016 Update 2 (module name is vtune/2016.up2). To specify an analysis type use '-collect'. The syntax for performing analysis using VTune amplifier from the command line is as follows:

% amplxe-cl -collect  <analysis-type>  -- ./a.out

The available analysis types for Edison compute nodes (Ivy Bridge processors) are:

Option 

Description

advanced-hotspots Advanced Hotspots
concurrency Concurrency
general-exploration General exploration
locksandwaits Locks and Waits
memory-access Memory Access
tsx-exploration TSX Exploration
tsx-hotspots TSX Hotspots
hotspots Basic Hotspots

 You can run VTune with more options to customize the analysis. The available command line options for amplxe-cl can be found at https://software.intel.com/en-us/node/544244. You can also get the options (not a complete list) by running the following commands on Edison

% module load vtune
% amplxe-cl -help
% amplxe-cl -help collect

You can also use the configuration options, the -knob options, for each analysis type to further customize the analysis. The available -knob options can be found at https://software.intel.com/en-us/node/544270 . You can also get the list of the available -knob options using the following command:

% amplxe-cl -help collect <analysis type> 

Note that this command has to be used on a compute node after the kernel drivers are installed (the kernel drivers only get installed on the compute nodes dynamically at a VTune  job start time). 

% salloc -N 1 -q debug --vtune

Wait for the job allocation and when the batch shell prompts, execute the following commands:

% module load vtune
% amplxe-cl -help collect <analysis-type>

Using the VTune GUI to display profiling results on a login node

After the profiling data is collected, do the following to display the results on an Edison login node:

% ssh -XY edison.nersc.gov -l <YourUserName>
% module load vtune 
% amplxe-gui

Then choose the "Open Result" option in the GUI to display your data.

Screen Shot 2016 07 19 at 11.35.01 AM

 Open the file with the .amplxe extension under the directory where your result data was saved.

Screen Shot 2016 07 19 at 11.39.31 AM The use of the NX software is strongly recommended for the remote users, which can speed up the VTune GUI significantly. 

Please refer to this Intel presentation about how to optimize your code using VTune on Edison.

Using VTune GUI

To use the VTune GUI, login to the desired system using the -XY option during login. Then execute the following commands:

% module load vtune
% amplxe-gui

 This will open a VTune window. If this is the first time you using the tool, you need to create a "project" for performance profiling sessions. Otherwise, you will see the previous projects and their data collection events in the left frame of the window. You may also create new projects using the "New Project" button even if you have existing projects.

 To create a new project, click on 'New Project...'

Screen Shot 2016 07 20 at 2.17.01 PM

 In the 'Create a Project' window, choose a name for the project and the location where performance data for the project will be stored, as shown below. A directory with the same name as the project name will be created under the directory specified in the 'Location' field, and all the analysis data under this project will be stored under the project directory.

Screen Shot 2016 07 20 at 2.30.44 PM

Once the project name and its location are set, you need to specify run and data collection configurations in the 'Project Properties' window. Please see the snapshot below for an example.

Screen Shot 2016 07 20 at 2.37.32 PM

You need to do the following:

  • Select the target system from the options on the left hand side
  • Enter the application name in the required text box. You may also browse for and select it.
  • In the following text box, enter the application parameters.
  • Uncheck the checkbox for 'Use application directory as working directory' and enter your working directory in the 'Working directory' text field.
  • Enter environment variables and their values by clicking the 'Modify...' button under the 'User-defined environment variables' area.
Scroll down and you can set the runtime estimate so that VTune chooses a proper sampling frequency. You can also set sampling data limit, too. Note that you can also specify the directory where the results will be created. By default, it will be the directory specified in the 'Create a Project' window earlier (see above for more information).
 

Screen Shot 2016 07 20 at 2.23.47 PM

 

 You should use the 'Binary Symbol Search' and 'Source Search' buttons on the right to specify the paths to the additional binary files and the source file. Then click on the 'Choose Analysis' button to select the type of analysis.

You can review the settings for a project any time by clicking on the 'Project Properties...' button in the top task bar (the third one from left) or pressing the 'Contrl+P' keys.

You can start a new performance analysis by clicking on any of the analysis menu items listed in the main window ('Concurrency Analysis', 'Basic Hotspots Analysis',..., 'New Analysis...'). Or you can click on the orange triangle button from the toolbar at the top of the pane.

Screen Shot 2016 07 20 at 2.41.43 PM

Select an analysis type in the left frame among the preset types: 'Advanced Hotspots', 'General Exploration' and 'Bandwidth'. You can even create your own custom type (see the 'Custom Analysis' menu).

Screen Shot 2016 07 20 at 2.47.06 PM

Then click on the 'Start' button on the right to start the analysis.

Analysis Runs

Analysis cases will appear under the current project in the 'Project Navigator' region in the far-left window frame. Also, when a new analysis is performed, a tab is added to the toolbar at the very top of the main window. If you want to examine a previous result, double-click on the result name in the 'Project Navigator' area, and a tab for the analysis results will appear in the main window.

A case name includes the sequence number (incrementing from 0) in the project and the analysis type ('ah', 'ge', and 'bw' for 'Advanced Hotspots', 'General Exploration' and 'Bandwidth' analysis types, respectively), as in 'r000ah', 'r001ge', 'r002bw', etc. 

Some Analysis Types in VTune Amplifier

'Advanced Hotspots' analysis

The following example is to choose the 'Advanced Hotspots' type to search for performance hotspots in the code. This analysis type is used to identify time-consuming sections in your application.

Screen Shot 2016 07 20 at 3.17.24 PM

The equivalent CLI command for the chosen analysis type is displayed when clicking the 'Command Line...' button at the lower right corner of the window:

Screen Shot 2016 07 20 at 3.47.23 PM 

If you prefer to run VTune in CLI mode, you have to use the command displayed in this box. You can get the command to be executed for each of the analysis types in a similar way.

Click the 'Start' button to run your code while collecting performance data. The results are presented in various ways under many tabs. Below are the results in the Summary and Bottom-up tabs:

Screen Shot 2016 07 20 at 4.40.56 PM 

 

Screen Shot 2016 07 20 at 4.40.38 PM 

'General Exploration' analysis

A 'General Exploration' analysis identifies where micro-architectural issues affect the performance of your application. This analysis type uses hardware event-based sampling collection. Select the 'General Exploration' analysis type, select the relevant check boxes and then click the 'Start' button.

Screen Shot 2016 07 20 at 5.09.18 PM

 The Summary and Event Count pages from a General Exploration run look as follows:

Screen Shot 2016 07 20 at 5.16.12 PM

 There were a large number, of hardware events, so we have collapsed them in the above screenshot.

Screen Shot 2016 07 20 at 5.16.26 PM

'Memory Access' analysis

A 'Memory Access' run identifies where memory bandwidth issues affect the performance of your application. This analysis uses hardware event-based sampling collection.

There are 3 configuration options that can be specified:

  • Analyze memory objects:
    • This enables the instrumentation of memory allocation/de-allocation and maps hardware events to memory objects.
    • It may cause additional runtime overhead due to the instrumentation of all system memory allocation/de-allocation API.
  • Minimal memory object size to track, in bytes: 
    • This allows the user to specify a minimal size of memory allocations to analyze. This option helps to reduce runtime overhead of the instrumentation.
  • Evaluate max DRAM bandwidth:
    • This option is enabled by default for the Memory Access analysis type. It measures peal DRAM bandwidth at the beginning of collection. It enables the user to visually see whether the bandwidth used reaches the maximum.
Select the 'Memory Access' type of analysis, specify the other options and then click on start.
Screen Shot 2016 07 20 at 5.33.07 PM

Below are the Summary and Bottom-up pages from a Memory Access run: 

Screen Shot 2016 07 20 at 5.31.09 PM

Screen Shot 2016 07 20 at 5.31.39 PM

'Concurrency' analysis

Concurrency analysis helps to identify hotspot functions where processor utilization is poor. Having a lot of cores idle in the hotspot region indicates that it is possible to improve performance by ensuring that there are more cores working simultaneously.

Concurrency analysis provides information on how many threads were running at each moment during the application execution. 

To perform a Concurrency analysis using the GUI, select the 'Concurrency Analysis' type from the Algorithm Analysis group.

Screen Shot 2016 07 19 at 4.56.40 PM

  

The Summary and the Bottom-up pages of the above run are:

Screen Shot 2016 07 20 at 5.41.41 PM This shows the number of cores used over time.

Screen Shot 2016 07 20 at 5.42.34 PM

 This includes threads which are currently running or ready to run and therefore are not waiting at a defined waiting or blocking API. CPU times marked by a red bar indicate that it is worthwhile to explore opportunities for improving performance by improving the utilization of processors during that time.

'Locks and Waits' analysis

If, using the Concurrency Analysis, it is found that many of the cores are idle for a large portion of the time, then it is worthwhile to perform a 'Locks and Waits' Analysis. Locks and Waits analysis helps identify the cause of ineffective processor utilization. One of the most common problems is threads waiting too long on synchronization objects. This analysis helps you to estimate the impact each synchronization object has on the application and shows how long the application had to wait on each synchronization object, or in blocking APIs.

Select a'Locks and Waits' type analysis and then click the start button.

Screen Shot 2016 07 20 at 5.54.34 PM

 

The summary and Bottom-up pages of the result of this run are given below: 

Screen Shot 2016 07 20 at 5.49.37 PM 

 

Screen Shot 2016 07 20 at 5.53.22 PM

Helpful Resources for More Information

Installed Versions

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
Intel VTune Amplifier XE carl applications/ performance 2016.up2 vtune/2016.up2 2016-04-26 2016-04-26
Intel VTune Amplifier XE carl applications/ performance 2016.up3 vtune/2016.up3 2016-04-26 2016-07-06
Intel VTune Amplifier XE carl applications/ performance 2016.up4 vtune/2016.up4 2016-07-12
Intel VTune Amplifier XE cori applications/ performance 2018.up2 vtune/2018.up2 2018-05-09
Intel VTune Amplifier XE edison applications/ performance 2018.up2 vtune/2018.up2 2018-06-11 2018-06-11
Intel VTune Amplifier XE edison applications/ performance 2018up2 vtune/2018up2 2018-04-03 2018-06-11
Intel VTune Amplifier XE gerty applications/ performance 2017_pre_zip vtune/2017_pre_zip 2016-10-04
Intel VTune Amplifier XE gerty applications/ performance 2018.0 vtune/2018.0 2017-09-16 2017-09-16
Intel VTune Amplifier XE gerty applications/ performance 2018.beta vtune/2018.beta 2017-09-15 2017-09-15
Intel VTune Amplifier XE gerty applications/ performance 2018.up1 vtune/2018.up1 2018-01-08 2018-01-08