NERSCPowering Scientific Discovery Since 1974

False Sharing Detection using VTune Amplifier

Introduction

VTune Amplifier has a special analysis type called “Memory Access”. It focuses on memory-related issues including:

  • Performance problems caused due to memory hierarchy (e.g. L1-, L2-, LLC-, DRAM-bound)
  • Bandwidth-limited accesses including DRAM, QPI and MCDRAM (on KNL) bandwidth
  • NUMA problems
  • It also provides performance metrics for different memory objects or data structures.

For information on how to use VTune Amplifier on the NERSC systems, please refer to the page on VTune.

Collecting Data

Using the GUI

Select "Memory Access" from the left hand side pane of the "Analysis Type" tab.

Screen Shot 2016 07 18 at 5.42.55 PM 

As shown in the above screenshot, there are 3 configuration options that can be specified:

  • Analyze memory objects: This enables the instrumentation of memory allocation/de-allocation and maps hardware events to memory objects. It may cause additional runtime overhead due to the instrumentation of all system memory allocation/de-allocation API.
  • Minimal memory object size to track, in bytes: This allows the user to specify a minimal size of memory allocations to analyze. This option helps to reduce runtime overhead of the instrumentation.
  • Evaluate max DRAM bandwidth: This option is enabled by default for the Memory Access analysis type. It measures peak DRAM bandwidth at the beginning of collection. It enables the user to visually see whether the bandwidth used reaches the maximum.  

Using the Command Line

To use "Memory Access" type of analysis from the command-line, the following command can be used.

amplxe-cl -c memory-access -knob analyze-mem-objects=true -knob mem-object-size-min-thres=1024 -- <app>

A short description about some of the options used here are provided below:

  • <app> denotes the application name,
  • mem-object-size-min-thres=1024 is used to set the minimal memory object size to 1024 bytes.
  • analyze-mem-objects=true is used to enable the instrumentation of memory allocation/de-allocation.

Viewing Performance Metrics for Different Memory Objects 

In the grid view, select a grouping level containing Memory Object or Memory Object Allocation Source. The following screenshot shows a sample grid view.

Screen Shot 2016 07 13 at 3.20.50 PM

 

Only metrics based on DLA-capable hardware events are applicable to the memory objects grouping levels. Thus, ‘CPU Time’ metric, which is based on regular "Clockticks" event will not be applicable and will be empty for memory object rows in the Grid view.

It shows information about memory object allocation. For example, the line which states “lin_stream.cpp:100 (152 MB)” means memory objects were allocated on line 100 of the source file lin_stream.cpp and the size of the allocation was 152 MB.

The types of memory objects are:

  • Dynamic: This is for memory objects that are allocated on the heap using malloc, new, and other similar functions
  • Global: This is for global or static variables.
  • Stack: This is for local variables. VTune Amplifier does not recognize individual local variables, so all references to stack memory are associated with one virtual memory object named "[Stack]".

Instrumented APIs

The following out-of-the-box memory APIs are instrumented by VTune:

  • Standard memory allocation/de-allocation API from libc: For example:mmap, malloc, calloc, free, etc.
  • Jemalloc: This is the same API as libc plus mallocx, dallocx, etc
  • Memkind: This includes hbw_malloc, hbw_free, etc.

Also, we can use the __itt_heap_* functions to mark custom allocations. 

void* my_malloc(size_t s)
{
void* p;
__itt_heap_allocate_begin(my_allocator,s ,0);
p= user_defined_malloc(s);
__itt_heap_allocate_end(my_allocator, &p, s,0);
return p;
}

Analyzing False Sharing

False sharing occurs when several threads accessing different data items happen to be on the same cache line. It causes performance degradation due to coherence issues and should be avoided. In order to detect a false sharing issue, the user should run a Memory Access analysis with the "Analyze Memory Objects" option enabled.

Code to Cause False Sharing

#include<stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#include<time.h>
#define NUM_THREADS 16 

int main(int argc,char *argv[])
{
// Define Variables
int A[NUM_THREADS]__attribute__((aligned(64)));
int n=atoi(argv[1]);
int i;
int sum=0;
#pragma omp parallel num_threads(NUM_THREADS)
{
 int thread = omp_get_thread_num();
 A[thread] = 0;
 #pragma omp for schedule(static)
 for (i=0; i<=n; i++)
     A[thread] += i%10;
#pragma omp atomic
sum+=A[thread];

}
printf("The sum is %d\n",sum);
}

Compiling the Code

The code must be compiled using -g option and all optimizations must be turned off to ensure that false sharing issues are not taken care of automatically by the compiler. The -g option is needed to assist VTune in associating addresses to source lines.

% cc -qopenmp -g -O0 -o false_prob ./false_prob.c

Detection of False Sharing using the VTune Amplifier

Use the following commands to execute the code and create results using VTune Amplifier:

% module load vtune
% salloc -p debug -t 30:00 --vtune
% amplxe-cl -collect memory-access -knob analyze-mem-objects=true -knob mem-object-size-min-thres=32 --search-dir sym:p=/global/homes/a/ahanarc --search-dir bin:p=/global/homes/a/ahanarc -- ./false_prob 1000000000
Use the following command to open VTune in GUI mode after relinquishing the job allocation:
% amplxe-gui

Opening the Result

To open the result, click on 'Open Result' in the welcome screen.Screen Shot 2016 07 19 at 11.35.01 AM

Then browse for and open the .amplxe file in the desired result directory.
Screen Shot 2016 07 19 at 11.39.31 AM

Identifying the False Sharing Issue

In the result for the above code, we see that the elapsed time is quite high because the operations to be performed are not time consuming and we have 16 parallel threads. The memory bound percentage is pretty high. This indicates that there could be a false sharing issue here.

False sharing summary 1

 

Click on the 'Memory Bound' link in order to switch to the Bottom-Up view.  In the Bottom-Up view, select the 'OpenMP Region/ Function/ Call Stack' grouping to see the time taken and other information about Serial and OpenMP parts of the code separately.

Here, we see that the parallel region is indeed memory bound and also has relatively high average latency. Also, since most of the time is spent in executing the parallel part of the code, the impact of the parallel region being memory bound is more than the impact of the serial region being memory bound.

False sharing bottom up 1
Double click on the row for the parallel region to view the source code and the specific part of the code which is causing the memory bound issue.
False sharing code 1
From the above screenshot, it is evident that the for loop that is executed in parallel is memory bound. However, it does not use any large arrays. Each thread writes to an element of the array A. But all the elements are on the same cache line. Hence, this is clearly a case of false sharing.

Modifying the Code to Resolve the False Sharing Issue

The most common method of resolving false sharing issues is padding. We will do the same in this in this example. This modification ensures that each thread will be writing to a separate cache lines. Hence, the problem of false sharing will be solved.

#include<stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#include<time.h>
#define NUM_THREADS 16 

int main(int argc,char *argv[])
{
// Define Variables
int A[NUM_THREADS][16]__attribute__((aligned(64)));
int n=atoi(argv[1]);
int i;
int sum=0;
#pragma omp parallel num_threads(NUM_THREADS)
{
 int thread = omp_get_thread_num();
 A[thread][0] = 0;
 #pragma omp for schedule(static)
 for (i=0; i<=n; i++)
     A[thread][0] += i%10;
#pragma omp atomic
sum+=A[thread][0];

}
printf("The sum is %d\n",sum);
} 

Compiling and Executing the Modified Code

The code must be compiled using -g option and all optimizations must be turned off to ensure that false sharing issues are not taken care of automatically by the compiler.

% cc -qopenmp -g -O0 -o false_sol ./false_sol.c

Use the following commands to execute the code and create results using VTune Amplifier:

% module load vtune
% salloc -p debug -t 30:00 --vtune
% amplxe-cl -collect memory-access -knob analyze-mem-objects=true -knob mem-object-size-min-thres=32 --search-dir sym:p=/global/homes/a/ahanarc --search-dir bin:p=/global/homes/a/ahanarc -- ./false_sol 1000000000
Use the following command to open VTune in GUI mode after relinquishing the job allocation:
% amplxe-gui

 Use the GUI to open the .amplxe file in the desired result directory

Analyzing the Modified Code

Once again, run a Memory Access analysis with the "Analyze Memory Objects" option enabled.

On analyzing the modified code, we get the following result:

 False sharing solved summary 1

The Bottom-Up tab also shows that the memory bound percentage of the parallel region has decreased to 0.6%.

False sharing solved bottom up 1
Thus, introducing the padding definitely decreased the execution time of the code by solving the false sharing issue.