NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

IBM Compiler Optimization Argument Examples

This describes the compilation and run time impact on several different publicly available benchmarks of a variety of compiler optimization arguments.

Introduction

Publicly available benchmarks are compiled and run with several different sets of optimization options and the performance recorded. The time required to compile and link the code is also recorded, and the results summarized.

The following information is given for each benchmark:

  • The source of the benchmark, the source code changes required to enable it to compile and run on the SP, and the way the compiler is invoked to produce the executable, or, if a makefile is used, the changes to the makefile required for the code to run on the SP.
  • The elapsed time for a single threaded compile and link from source code for the given set of optimization arguments.
  • Either an internal time measurement like MFLOPs, if provided by the benchmark, or, if there is no internal timer, the elapsed time for a (possibly multi-threaded) run of the code on a dedicated system with the given set of options or the GFlops figure returned by the poe+ program.

The numbers given in the tables below for the individual benchmarks are the best of several dedicated runs in batch mode.

Linpack

The Linpack benchmark solves a dense set of linear equations. The version tested here is the 1000x1000 double precision version obtained from 1000d. It is contained in a single 755 line source code file containing 11 subroutines and functions in addition to the main driver. It is a simple Fortran 77 code originally written in 1978 and last modified in 1992.

Source Code Changes

All references to the second() in the source were replace with references to rtc().

Compile Changes

The code was compiled with the xlf compiler with no options beyond the optimization options.

Times

These runs were done with the 8.1.1.3 version of xlf in December, 2003.

The Compile Time is the wall-clock time for the compile and link returned by the unix time command.

The MFLOPS result is that returned by the internal timer in the code.

Results

Optimization Compile MFLOPS
unoptimized  .52  49
-O2 1.36 243
-O3 -qstrict -qarch=pwr3 -qtune=pwr3 1.98 256
-O3 -qarch=pwr3 -qtune=pwr3 2.07 257
-O3 -qhot -qarch=pwr3 -qtune=pwr3 3.75 256
-O4 -qnohot 4.08 264
-O4 6.56 264
-O5 -qnohot 4.19 266
-O5 5.96 263

Comments

There was no significant difference in the code's performance at any optimization level when the threaded compiler (xlf_r) was used or when the mass library was included. In this case, the recommended optimization options, -O3 -qstrict -qarch=pwr3 -qtune=pwr3 compilation give close to the best performance obtainable by any other optimization options.

This example exhibits the limitations of the use of compiler options alone to improve a code's performance, since the best performance obtained on a single processor in this example attains only 18% of the processor's theoretically peak performance of 1.5 GFlops.

Most of the work in this example is done by four BLAS routines, daxpy, ddot, dscal, and idamax, that are also in the IBM high performance ESSL library. However, when the benchmark versions of these routines are replaced with the ESSL routines the performance attained is no better than 250 MFlops, worse than with the benchmark versions.

Fortran Livermore Loops

This version of the double precision Livermore Loop benchmark was obtained from livermore. This is the 1991 update of the benchmark whose earliest version dates from the 1970's. It contains 24 numeric kernels written in fairly straightforward, uncomplicated Fortran 77. Several summary figures are returned by the program at the end of the run.

Source Code Changes

The only changes to the original source required were to the timing routines. The SECOND function definition in the main routine at line 556 was uncommented:

	REAL*8 SECOND

These three lines, 4469-4471, in the SECOND function were commented out and replaced with a call to the system elapsed time measurement function rtc():

C         REAL*4 CPUTYM(4), ETIME
C         XT= ETIME( CPUTYM)
C         SECOND=    CPUTYM(1)
	second=rtc()

Compile Changes

The code was compiled with the xlf compiler with no additional arguments beyond those for optimization.

Timers

The internal Compile Time result is the seconds required to compile and link the test program. To compare the effects of the various optimization levels, the Average (mean), Minimum, and Maximum MFLOP Rates for the loops returned by the code are listed.

Livermore Loop MFLOPS

Optimization Compile Time Average Minimum Maximum
unoptimized  1.63  48 11  141
-O2  8.73 245 22  995
-O3 -qstrict -qarch=pwr3 -qtune=pwr3 14.22 322 60 1259
-O3 -qarch=pwr3 -qtune=pwr3 15.07 339 60 1257
-O3 -qhot -qarch=pwr3 -qtune=pwr3 67.62 315 58  996
-O4 -qnohot 34.08 353 58 1532
-O4 80.04 321 58  998
-O5 -qnohot 49.10 360 58 1541
-O5 91.23 319 58  997

Comments

The performance of this benchmark is significantly degraded when the -qhot option is specified. Not only is the compile time greatly increased, but both the Average and Maximum MFLOPS are significantly worse than the corresponding optimization level without the -qhot option. This may be due to the fact that all of the loops are fairly small and uncomplicated, and the sophisticated analysis and loop restructuring done by this option add too much overhead at execution time.

Another interesting feature is that two of the higher level optimizations that do not include -qhot, -O4 -qnohot and -O5 -qnohot, are reported as attaining a MFLOP total greater than the theoretical peak performance of the POWER3 processor, 1.5 GFLOPS. The loop that attains this speed is Kernel 7, a very short loop representing an equation of state fragment:

 1007 DO 7 k= 1,n
        X(k)=     U(k  ) + R*( Z(k  ) + R*Y(k  )) +
     1        T*( U(k+3) + R*( U(k+2) + R*U(k+1)) +
     2        T*( U(k+6) + Q*( U(k+5) + Q*U(k+4))))
    7 CONTINUE

Very likely, these optimizations make use of the fact that several of the elements in the equation are used in more than one iteration of the loop and need only be computed once. When the POWER3 hardware performance monitor is applied to this loop by means of hpmcount, the measured MFLOPS for this loop are around 750.

NAS Kernels

The NAS Kernel Benchmark consists of seven Fortran test kernels that perform calculations typical of scientific applications run at the NASA Ames Research Center. It was written in the 1980's and consists of approximately 1000 lines of Fortran code, organized into seven separate tests.

Source Code Changes

The only changes made were to the CPTIME internal timer routine. The original version was replaced by this SP specific version:

      common /savetime/tx
      real*8 rtc,tx,t
      T = rtc()
      if (tx.gt.t) tx=0
      CPTIME = real(T - TX)
      TX = T
      RETURN
      END

Compile Changes

The code was compiled with the xlf compiler with no options beyond the optimization options.

Timers

The Compile Time result is the wall clock seconds for the compile and link returned by the unix time command. In this table the average MFLOPS for all the kernels returned by the program's internal is given.

Timings

Optimization Compile Time Average MFLOPS
unoptimized   .72 31
-O2  3.07 75
-O3 -qstrict -qarch=pwr3 -qtune=pwr3  8.31 79
-O3 -qarch=pwr3 -qtune=pwr3  8.49 79
-O3 -qhot -qarch=pwr3 -qtune=pwr3 42.01 91
-O4 -qnohot 15.96 79
-O4 45.16 90
-O5 -qnohot 20.32 79
-O5 44.78 90

Comments

This provides a contrast with the Livermore kernels in that the -qhot option significantly improves performance when it is added to other optimization options at the cost of of an almost five fold increase in compile time in some cases.

Individual Kernels

This benchmark also provides timings for the seven individual kernels.

  • MXM - 4-way unrolled matrix multiply routine.
  • FFT - Complex radix 2 fft.
  • CHOL- Cholesky decomposition/substitution.
  • BTRIX - Block tri-diagonal solver.
  • GMTRY - Compute solid-related arrays, Gauss eliminate the matrix of wall influence coefficients.
  • EMIT - Emit new vortices to satisfy boundary condition.
  • VPENTA - Invert 3 pentadiagonals simultaneously.

Individual Kernel MFLOPS

Optimization MXM FFT CHOL BTRIX GMTRY EMIT VPENTA
unoptimized 53 34 18 57 13 97 26
-O2 236 162 136 210 18 150 37
-O3 -qstrict -qarch=pwr3 -qtune=pwr3 753 157 192 218 18 153 36
-O3 -qarch=pwr3 -qtune=pwr3 765 146 206 221 17 153 36
-O3 -qhot -qarch=pwr3 -qtune=pwr3 424 164 210 287 18 472 57
-O4 -qnohot 763 149 207 220 18 156 37
-O4 503 151 212 285 18 473 57
-O5 -qnohot 749 153 210 220 18 157 37
-O5 431 162 211 303 18 473 57

LBNL Home
Page last modified: Mon, 24 May 2004 19:26:15 GMT
Page URL: http://www.nersc.gov/nusers/resources/software/ibm/opt_options/optex.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science