NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Hardware Counter Information

Thanks to Scientific Supercompuing Center Karlsruhe for some of the following information.

Following is a description of of the output from the hardware counters:

PM_CYC (Cycles)
The number of machine cycles used by the program. The CPU time used is this number divided by 375 MHz.
PM_INST_CMPL (Instruction completed)
The number of instructions that were executed by the program.
PM_TLB_MISS (TLB misses)
When a program is running it uses virtual memory addresses. In order to access physical memory the system must map these virtual address to physical memory addresses during program execution. This mapping process is done in units of 4kB pages and the system keeps the addresses of recently used pages in fast memory - the Translation Lookaside Buffer (TLB). When the program needs to access memory that is mapped in the TLB the mapping is done very quickly. When access is needed to memory whose page is not in the TLB, a TLB miss occurs and it takes many cycles to perform the memory address translation. This slows down memory access significantly.
PM_ST_CMPL (Stores completed)
The number of store instructions which move data from a register to memory.
PM_LD_CMPL (Loads completed)
The number of load instructions which copy data into a register.
PM_FPU0_CMPL (FPU 0 instructions)
The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. Each FPU can start a new instruction at every cycle. This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by the first FPU.
PM_FPU1_CMPL (FPU 1 instructions)
This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by the second FPU.
PM_EXEC_FMA (FMAs executed)
The POWER3 can execute a compution of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA) and occurs commonly in many codes, particularly those that perform matrix operations. The compiler will generate FMA instructions as often as possible. This counter shows the number of FMAs executed by the program.
Various derived quantities are reported by hpmcount:
Utilization Rate
The ratio of CPU time (see PM_CYC above) to wall clock time. For a task on a dedicated compute node this ratio will be extremely close to 1.
Average number of loads per TLB miss
The ratio of PM_LD_CMPL/PM_TLB_MISS. Each time a TLB miss occurs, a new page in brought into the buffer. Each page is 4kB with 512 data elements of size 8 bytes. So an average number of loads per TLB miss in the 500 range indicates that each data element is being accessed once (on average). Higher values indicate that data is being reused more than once while the address is in the TLB. A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly.
Load and store operations
The sum of PM_ST_CMPL and PM_LD_CMPL.
Instructions per load/store
A low value of this metric indicates that the code is dominated by data movement operations (load/store) rather than computation. A value of 2 means that half the instructions are used for moving data. By reusing data in registers a larger value can be reached.
MIPS
The average number of instructions completed per second. The POWER3 can execute several instructions in parallel.
Instructions per cycle
The average number of instructions issued per clock cycle. Well tuned programs may reach more than 2 instructions per cycle.
Float point instructions + FMAs
The number of floating point operations performed by the code. The number of FMAs is added to the number of instructions since one FMA instruction performs two floating point operations.
Float point instructions + FMA rate
This is the MFlops rate. The POWER3 has a peak performance of 1500 MFlops (2 FPUs each executing a FMA per cycle with a clock frequency of 375MHz).
FMA percentage
The percentage of floating point instructions that are FMA calculations of the form x=s*a+b.
Computation intensity
The ratio of floating point operations to the total number of loads and stores.

LBNL Home
Page last modified: Wed, 24 Nov 2004 18:28:49 GMT
Page URL: http://www.nersc.gov/nusers/resources/software/ibm/hpmcount/counter.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science