SoftwareCompilersLibraries Applications Tools & Utilities Software by PlatformHopperFranklin Bassi Jacquard DaVinci PDSF HPSS Affiliated CollectionsACTS Collection
|
hpmcount and hpmlib on Seaborghpmcount runs an application and then reports execution wall clock time, hardware performance counter information, derived hardware metrics, and resource utilization statistics. The overhead of involved in use of hpmcount is very small. hpmlib is a library which provides an API to hpm from within a program. IPM is a NERSC utility that provides aggregate performance metrics derived from hardware counter data as well as MPI timings. Detailed IPM results are also available on the web. IPM cannot be used concurrently with hpmcount or hpmlib. How to use hpmcountTo use the hpmcount utility, you do not need to modify your code. Note: To guarantee correct hpmcount results, your code must be compiled with -qarch=pwr3 (or -qarch=auto). By default hpmcount will write performance statistics for each task to standard output. hpmcount is part of the HPM Toolkit. NERSC has compiled an explanation of some of the hpmcount output. All users will be interested in the reported value of Floating point instructions + FMA rate, which is the quantity usually called "FLOP" rate. The units reported are "Mflip/s", which is equivalent to the term "Mflop/s" for most purposes. Usage
Aggregate numbers for parallel jobs can be obtained by using IPM. Caution: A common mistake: Do not do the following: hpmcount poe ./executable_name hpmcount ./executable_name (if compiled with an mp* compiler) Both of these are incorrect for parallel codes. They return the hardware performance data on the poe controlling process, not your code. ExamplesThree examples of the use of hpmcount are provided to illustrate various optimization levels for a matrix-matrix multiply computation. Examples are generally provided in Fortran, C and C++.
We've compiled an explanation of the hardware counter output. The IBM documentation for the HPM Toolkit provides detailed descriptions of all the output measures provided by hpmcount as well as information on the lower-level API libhpm and a visualization tool for hpmviz for display of output files produced using libhpm. An example is provided in Fortran, C and C++ demonstrating the use of the low-level API to instrument distinct code sections, and the use of the hpmviz tool to visualize the analysis results. For parallel codes, IPM is available to aggregate hpmcount information across multiple processors. The use of IPM is demonstrated with the following example. Example 1: Unoptimized Matrix-Matrix MultiplyThe following is a sample code which performs a matrix-matrix multiply. It is provided in Running this code under hpmcount produces the following output. Note that these some of these results vary slightly for the same compiler depending on machine loading. The initial output about adding counter indicates the events that hpmcount is tracking. After the code runs, hpmcount provides four sections of output:
The Mflip/s metric near the end of the output indicates Millions of (arithmetic) FLoat Instructions Per Second. This reflects the floating point arithmetic operations performed by the code. These values may be used to deduce the computational efficiency of the code and possibly suggest optimization strategies. In general: Mflip/s between 0- 100 (code needs optimization) Mflip/s between 100- 400 (code may need some optimization) Mflip/s between 400- 800 (well optimized code) Mflip/s between 800-1500 (very well optimized code or IBM libraray) The example code in Fortran is only reporting around 8 Mflip/s, a very small percentage of the peak for a Power3 processor (1500 Mflip/s). The example code in C and C++ is reporting about 200 Mflip/s, which is better. Some other useful values are: Maximum resident set size : Amount of memory the code used PM_TLB_MISS (TLB misses) : Use of Cache (Should be small) Avg number of loads per TLB miss : Should be high (about 300 or more) This code only used about 23 megabytes of memory. The Fortran version had about 2 billion cache misses, and only had about 1 cache page loads per miss. This indicates that the code was making inefficient use of the memory bandwidth. The C and C++ versions had only about 2 million cache misses, and over 1000 cache page loads per miss. The C and C++ versions of the code made much more efficient use of memory bandwidth. Example 2: Optimized Matrix-Matrix MultiplyThe memory access pattern of the code shown above in Example 1 can be made more efficient if the order of the indices in the nested loops is changed from i, k, j to j, k, i. This results in a memory stride of one.
do j=1,index
do k=1,index
do i=1,index
The results of running this rearranged code under hpmcount are shown here. Notice that all of the examples shown thus far have the same number of Floating point instructions + FMAs of just over 2 billion operations. This reordering of the indices in the Fortran code has now improved the performance to the level of the C and C++ versions shown earlier. Example 3: Using ESSL Library FunctionHere is a third example which uses the IBM Engineering and Scientific Subroutine Library (ESSL) function for the matrix-matrix multiply. The function name is DGEMM. FortranThe Fortran code is the same as that shown above, except that the triple-nested loop that does the matrix-matrix multiply is replaced by a single library call:
call DGEMM('N','N',N,N,N,1.0d0,matrixa,N,matrixb,N,0.0d0,mres,N)
C/C++Notice that C and C++ store matrices differently from Fortran, so the transpose of the matrices is used to obtain the same result as the Fortran code. The C++ version of Example 3 is quite similar to the C++ version of Example 1, with the additional #include <essl.h> and the replacement of the triple nested for loops with a call to dgemm, as shown in the C example above. The output from hpmcount is shown below: % xlf90 -o ex3 -lessl ex3.f ** _main === End of Compilation 1 === 1501-510 Compilation successful for file ex3.f. % hpmcount ./ex3 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed mres( 1000 1000 )= 666166500.000000000 hpmcount (V 2.3.1) summary Total execution time (wall clock time): 2.377406 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 2.140000 seconds Total amount of time in system mode : 0.180000 seconds Maximum resident set size : 31704 Kbytes Average shared memory use in text segment : 924 Kbytes*sec Average unshared memory use in data segment : 6367120 Kbytes*sec Number of page faults without I/O activity : 7933 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 2 Number of involuntary context switches : 238 ####### End of Resource Statistics ######## PM_CYC (Cycles) : 797311835 PM_INST_CMPL (Instructions completed) : 1634801891 PM_TLB_MISS (TLB misses) : 3127858 PM_ST_CMPL (Stores completed) : 14477496 PM_LD_CMPL (Loads completed) : 518258411 PM_FPU0_CMPL (FPU 0 instructions) : 506399346 PM_FPU1_CMPL (FPU 1 instructions) : 499241814 PM_EXEC_FMA (FMAs executed) : 1000001941 Utilization rate : 89.418 % Avg number of loads per TLB miss : 165.691 Load and store operations : 532.736 M Instructions per load/store : 3.069 MIPS : 687.641 Instructions per cycle : 2.050 HW Float points instructions per Cycle : 1.261 Floating point instructions + FMAs : 2005.643 M Float point instructions + FMA rate : 843.627 Mflip/s FMA percentage : 99.719 % Computation intensity : 3.765 % xlc -o ex3 -lessl ex3.c % hpmcount ./ex3 SummaryThis demonstrates a striking improvement in the efficiency when using the highly optimized library routines in ESSL. The Fortran code runs 100 times faster than the original version in Example 1 with inefficient memory stride, and even the C version runs 50 times faster than the original version with an efficient memory stride. The performance results for the C++ version are quite similar to those for the C version of this example. How to use hpmlibExample 4: Sections of Unoptimized Matrix-Matrix MultiplyThe following is a sample code which performs a matrix-matrix multiply. The code has three separate instrumented sections using the libhpm functions:
The example is provided in Fortran, C and C++. HPM data collection is initialized with a call to f_hpminit for Fortran or hpmInit for C/C++. The data is identified by an integer task identifier (e.g., zero for serial codes, MPI rank for MPI codes), and a text string. Data collection is concluded with a call to f_hpmterminate for Fortran or hpmTerminate for C/C++. When HPM is terminated, the data is written to a .viz file. Individual sections for HPM data collection are delimited by calls to f_hpmstart and f_hpmstop for Fortran or hpmStart and hpmStop for C/C++. Individual sections are identified by a number and a text string. Notice that the module hpmtoolkit must be loaded before the code can be compiled.
! filename: ex4.f
! compile: module load hpmtoolkit
! xlf -o ex4 ex4.f -qsuffix=cpp=f $HPMTOOLKIT
! run: ./ex4
implicit none
integer, PARAMETER :: index=1000
REAL*8 matrixa(index,index),matrixb(index,index)
REAL*8 mres(index,index)
INTEGER i,j,k,n
#include "f_hpm.h"
! Start hpm monitoring
call f_hpminit (0, "ex4.f")
! Initialize the Matrix arrays
call f_hpmstart(1, "initialize matrices")
do i=1,index
do j=1,index
matrixa(i,j) = real(i+j)
matrixb(i,j) = real(j-i)
mres(i,j) = 0.0
end do
end do
call f_hpmstop(1)
! Matrix-Matrix Multiply
call f_hpmstart(2, "matrix-matrix multiply")
N = index
do i=1,index
do k=1,index
do j=1,index
mres(i,j) = mres(i,j) + matrixa(i,k)*matrixb(k,j)
end do
end do
end do
call f_hpmstop(2)
call f_hpmstart(3, "final output")
write(*,*)'mres(',n,n,')=',mres(n,n)
call f_hpmstop(3)
! End hpm monitoring
call f_hpmterminate(0)
stop
end
The C version of the example:
/* filename: ex4.c
compile: module load hpmtoolkit
xlc -o ex4 ex4.c $HPMTOOLKIT
run: ./ex4
*/
#include "stdio.h"
#include "libhpm.h"
#define INDEX 1000
int main ()
{
int index=INDEX;
double matrixa[INDEX][INDEX], matrixb[INDEX][INDEX],
mres[INDEX][INDEX];
int i,j,k,n;
/* Start HPM monitoring */
hpmInit(0,"ex4.c");
/* Initialize the Matrix arrays */
hpmStart(1, "initialize matrices");
for (i=0; i<INDEX; i++) {
for (j=0; j<INDEX; j++) {
matrixa[i][j] = i+j+2;
matrixb[i][j] = j-i;
mres[i][j] = 0;
}
}
hpmStop(1);
/* Matrix-Matrix Multiply */
hpmStart(2, "matrix-matrix multiply");
n = INDEX;
for (i=0; i<INDEX; i++) {
for (k=0; k<INDEX; k++) {
for (j=0; j<INDEX; j++) {
mres[i][j] = mres[i][j] + matrixa[i][k]*matrixb[k][j];
}
}
}
hpmStop(2);
hpmStart(3, "final output");
printf("mres(%d,%d)=%f\n", n, n, mres[n-1][n-1]);
hpmStop(3);
/* End hpm monitoring */
hpmTerminate(0);
return 0;
}
And the C++ version of the example. Notice that the name of the include file has changed from the C version.
// filename: ex4.C
// compile: module load hpmtoolkit
// xlC -o ex4 ex4.C $HPMTOOLKIT
// run: ./ex4
#include <iostream.h>
#include <libhpm.H>
#define INDEX 1000
int main ()
{
int index=INDEX;
double matrixa[INDEX][INDEX], matrixb[INDEX][INDEX],
mres[INDEX][INDEX];
int i,j,k,n;
// Start HPM monitoring
hpmInit(0,"ex4.C");
// Initialize the Matrix arrays
hpmStart(1, "initialize matrices");
for (i=0; i<INDEX; i++) {
for (j=0; j<INDEX; j++) {
matrixa[i][j] = i+j+2;
matrixb[i][j] = j-i;
mres[i][j] = 0;
}
}
hpmStop(1);
// Matrix-Matrix Multiply
hpmStart(2, "matrix-matrix multiply");
n = INDEX;
for (i=0; i<INDEX; i++) {
for (k=0; k<INDEX; k++) {
for (j=0; j<INDEX; j++) {
mres[i][j] = mres[i][j] + matrixa[i][k]*matrixb[k][j];
}
}
}
hpmStop(2);
hpmStart(3, "final output");
cout.setf(ios::fixed);
cout << "mres(" << n << ","
<< n << ")=" << mres[n-1][n-1] << endl;
hpmStop(3);
// End hpm monitoring
hpmTerminate(0);
return 0;
}
Compiling and running this code under produces the following output. % module load hpmtoolkit % xlf -o ex4 ex4.f -qsuffix=cpp=f $HPMTOOLKIT ** _main === End of Compilation 1 === 1501-510 Compilation successful for file ex4.f. % ./ex4 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed mres( 1000 1000 )= 666166500.000000000 libHPM output in perfhpm0000.64482 % module load hpmtoolkit % xlc -o ex4 ex4.c $HPMTOOLKIT % ./ex4 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed mres(1000,1000)=666166500.000000 libHPM output in perfhpm0000.81814 % module load hpmtoolkit % xlC -o ex4 ex4.C $HPMTOOLKIT % ./ex4 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed mres(1000,1000)=666166500.000000 libHPM output in perfhpm0000.70542 The .viz files that are created are: hpm0000_ex4.f_64482.viz for Fortran, hpm0000_ex4.c_81814.viz for C, and hpm0000_ex4.C_70542.viz for C++. Notice that these names include the arguments to the hpmInit function, and are also connected, by means of the process id, to the text output files. The text output file for the Fortran version of this example is shwon below. Notice that there is an overall summary of resource usage, and then separate information for each instrumented section of the code. % cat perfhpm0000.64482 libhpm (Version 2.3.1) summary - running on POWER3-II Total execution time of instrumented code (wall time): 289.56 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 289.140000 seconds Total amount of time in system mode : 0.350000 seconds Maximum resident set size : 23784 Kbytes Average shared memory use in text segment : 166252 Kbytes*sec Average unshared memory use in data segment : 685331496 Kbytes*sec Number of page faults without I/O activity : 5966 Number of page faults with I/O activity : 22 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 10 Number of involuntary context switches : 28960 ####### End of Resource Statistics ######## Instrumented section: 1 - Label: initialize matrices - process: 0 file: ex4.f, lines: 17 >--< 25 Count: 1 Wall Clock Time: 0.739748 seconds Total time in user mode: 0.582817780163786 seconds PM_CYC (Cycles) : 218591825 PM_INST_CMPL (Instructions completed) : 44007010 PM_TLB_MISS (TLB misses) : 3006426 PM_ST_CMPL (Stores completed) : 6002003 PM_LD_CMPL (Loads completed) : 12001002 PM_FPU0_CMPL (FPU 0 instructions) : 4000023 PM_FPU1_CMPL (FPU 1 instructions) : 999988 PM_EXEC_FMA (FMAs executed) : 0 Utilization rate : 78.786 % Avg number of loads per TLB miss : 3.992 Load and store operations : 18.003 M Instructions per load/store : 2.444 MIPS : 59.489 Instructions per cycle : 0.201 HW Float points instructions per Cycle : 0.023 Floating point instructions + FMAs : 5.000 M Float point instructions + FMA rate : 6.759 Mflip/s FMA percentage : 0.000 % Computation intensity : 0.278 Instrumented section: 2 - Label: matrix-matrix multiply - process: 0 file: ex4.f, lines: 28 <--> 37 Count: 1 Wall Clock Time: 288.821541 seconds Total time in user mode: 287.304251770298 seconds PM_CYC (Cycles) : 107756496862 PM_INST_CMPL (Instructions completed) : 29007007012 PM_TLB_MISS (TLB misses) : 2001221941 PM_ST_CMPL (Stores completed) : 2002002004 PM_LD_CMPL (Loads completed) : 8001001002 PM_FPU0_CMPL (FPU 0 instructions) : 1000016230 PM_FPU1_CMPL (FPU 1 instructions) : 0 PM_EXEC_FMA (FMAs executed) : 1000016230 Utilization rate : 99.475 % Avg number of loads per TLB miss : 3.998 Load and store operations : 10003.003 M Instructions per load/store : 2.900 MIPS : 100.432 Instructions per cycle : 0.269 HW Float points instructions per Cycle : 0.009 Floating point instructions + FMAs : 2000.032 M Float point instructions + FMA rate : 6.925 Mflip/s FMA percentage : 100.000 % Computation intensity : 0.200 Instrumented section: 3 - Label: final output - process: 0 file: ex4.f, lines: 39 <--> 41 Count: 1 Wall Clock Time: 0.001785 seconds Total time in user mode: 0.000283860815953749 seconds PM_CYC (Cycles) : 106320 PM_INST_CMPL (Instructions completed) : 25864 PM_TLB_MISS (TLB misses) : 116 PM_ST_CMPL (Stores completed) : 6008 PM_LD_CMPL (Loads completed) : 4985 PM_FPU0_CMPL (FPU 0 instructions) : 345 PM_FPU1_CMPL (FPU 1 instructions) : 85 PM_EXEC_FMA (FMAs executed) : 75 Utilization rate : 15.881 % Avg number of loads per TLB miss : 42.974 Load and store operations : 0.011 M Instructions per load/store : 2.353 MIPS : 14.490 Instructions per cycle : 0.243 HW Float points instructions per Cycle : 0.004 Floating point instructions + FMAs : 0.001 M Float point instructions + FMA rate : 0.283 Mflip/s FMA percentage : 29.703 % Computation intensity : 0.046 The hpmviz tool is an X-Windows application that provides a graphical user interface (GUI) to examine the separate sections of the code. The tool is started with the command % hpmviz After hpmviz has started, the various .viz files can be opened with the pull-down File menu. A sample screen shot, after loading the three .viz files for this example, is shown by following the link below.
hpmviz screen shot
The left side of the screen contains the instrumented sections of the code and timing information. The right side contains the source code. Tabs at the top of each half-screen window control the file being displayed. Left-mouse-button clicking on the name of an instrumented section on the left will hightlight the corresponding source code section on the right. Right-mouse-button clicking on the neame of an instrumented section on the left will open a new window with detailed HPM information.
hpmviz screen shot
The hpmviz is particularly useful for multiprocessor computations with multiple instrumented sections. Example 5: Parallel Library Function (PESSL)Here is an example which uses the IBM Parallel Engineering and Scientific Subroutine Library (PESSL) function for the matrix-matrix multiply. The funciton name is PDGEMM. The code is provided in Running this code under IPM for parallel HPM produces the following output. Although the parallel code is slower than the serial code for a problem of this size (1000 by 1000 matrices), due in part to communication overhead, for larger problems (such as 10,000 by 10,000 matrices) which cannot fit on a single node, performance approaches 800 Mflip/s/processor for parallel computations. Performance values are comparable whether the IBM PESSL library or the NERSC/ScaLAPACK library is used. Instructions for compiling and linking the codes using either library set are included in the source code files above. |
![]() |
Page last modified: Thu, 24 Jan 2008 20:19:14 GMT Page URL: http://www.nersc.gov/nusers/resources/software/ibm/hpmcount/ Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |