NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
Restore navigation column

IBM Compiler Optimization Flags

Introduction

IBM Fortran, C, and C++ compiles are done without any optimization by default. Any level of optimization done by the compiler must be explicitly specified by means of flags to the compiler at compile and link time. This is different from the situation on the previous Cray platforms at NERSC, whose compilers provided a fairly high level of optimization by default, and it was necessary to ask explicitly for an unoptimized compile if that was what was wanted.

For the most part, the IBM Fortran, C, and C++ compilers all have the same optimization arguments. The description of these arguments below applies to all three compilers unless otherwise stated.

These are the currently recommended optimization options for compiling on the machine on which you will be running the code:

	-O3 -qstrict -qarch=auto-qtune=auto

These options provide a compromise between minimizing compilation time and maximizing the performance of the compiled code. All of these options as well as several other useful optimization options will be described below.

Some specific examples of the compile time and run time impact of several different sets of optimization arguments on several public benchmarks are given at Compiler Optimization Argument Examples.

-On

The compilers allow you to specify a general level of optimization by specifying a numeric optimization level with the -O flag. The higher the number the greater the amount of optimization the compiler does, the longer the compile takes, and the more memory the compile uses. The lowest numeric optimization is -O2. There are no -O0 nor -O1 optimization arguments currently supported by the compilers.

-O2 (-O)

The -O2 option is designed to provide an intermediate level of optimization that does not require an excessive amount of time to perform the compile and will produce numeric results identical to those produced by an unoptimized compile. It avoids certain types of optimizations that have the potential to produce different numeric results. See the section on the -qstrict argument below for a discussion on how the exact equality of numeric results is accomplished.

The -O option is identical to the -O2 option.

The optimizations done at the -O2 level include:

This is a dot product example of the store motion optimization done at the -O2 level.

Fortran

	x=0.0
	do i=1,ilim
	         x=a(i)*b(i)+x
	enddo

C

	x = 0.0;
	for ( i=0 ; i < ilim ; i++ )
	{
	        x+= a[i] * b[i] ;
        }

The unoptimized, default compile follows all the source code instructions literally. In this case, for each iteration of the loop, there would be a new load and a new store of the variable x. With -O2 optimization, the compiler would recognize that there is no need to store the value of x until the loop is completed, and intermediate values would be kept in registers. Even if the loads and stores are cached, the optimization could lead to an order of magnitude or better improvement in the performance of this loop.

-O3

The -O3 level of optimization peforms all of the optimizations done at the -O2 level as well as several other optimizations that require more memory or time to accomplish.

Some optimizations may be done that will change the semantics of the program slightly, and might cause numeric differences between the results of the program and the same program compiled at the -O2 optimization level or with no optimization. To disable those optimizations that might produce different results, include the -qstrict option on the compile line after -O3 is specificied.

These are the types of optimizations done at this level that are not done at the -O2 level:

Some limitations of the -O3 level are:

-O4

The -O4 level of optimization peforms all of the optimizations done at the -O3 level as well as several other optimizations. This argument is equivalent to:

% -O3 -qarch=auto -qtune=auto -qcache=auto -qhot -qipa 

This flag should be specified both at compile and link time.

-O5

The -O5 level of optimization peforms all of the optimizations done at the -O4 as well as the optimization specified by -qipa=level=2 which is described below. This argument is equivalent to:

This flag should be specified both at compile and link time.

-qstrict

The -qstrict option ensures that any optimization done will not alter the semantics of a program, and that the numeric results of a program will be identical with those produced by an unoptimized program. This option actually limits the amount of optimization done when it is included with any other optimization argument.

Specifically, optimizations that perform operations like these are not done:

With this option, a strict computational order is observed based on the language rules for operator precedence and left to right operation regardless of the potential negative effect on performance.

The following example of the potentially inhibiting effects of this argument on optimization is taken from the IBM document "Power3 Introduction and Tuning Guide", SG24-5155-00, which is available at the IBM Documentation website.

When evaluating an expression like this:

      A*B*C + B*C*D

an optimized compile would recognize that B*C is a "common sub-expression" and evaluate it only once. However, if -qstrict is specified, this optimization would not be done, since that would violate the left to right ordering rule on A*B*C. Since floating point arithmetic is not associative there is no guarantee the results of (A*B)*C would be bitwise identical to those of A*(B*C).

In practice, we have observed that compiling with this argument rarely has a negative impact on the performance of a code. It can even be the case a code compiled with -qstrict is actually faster at run time than it was when compiled without this argument.

-qhot

The -qhot option is by far the most expensive option in terms of adding to the elapsed time of a compile. Adding this option often more than triples the compile time of a code.

The -qhot option performs several loop oriented optimizations:

The Fortran option -qreport=hotlist will produce a listing describing all of the transformations done by -qhot. The listing is in a file called program.lst where program.f is the name of the source code file that was compiled.

It is possible for the -qhot option to decrease the performance of a program if the compiler does not have enough information about loop bounds and array dimensions, so that it attempts inappropriate optimizations.

-qipa

The -qipa option enables interprocedural analysis (IPA) by the compiler. This enables the compiler to identify optimization opportunities across procedural boundaries. It does this by extending the area that is examined during optimization and inlining from a single procedure to multiple procedures (possibly in different source files) and the linkage between them. This option should be included in both the compile and link phases.

There are a rich collection of suboptions of the form -qipa=suboption. See the compiler man pages for more details about these suboptions. Some useful options are the -qipa=level=n options where n can be 0, 1, or 2. These determine the amount of interprocedural analysis and optimization that is performed. As with the -On options, the higher the number, the greater the amount of optimization performed.

-qipa=level=0

This does only minimal interprocedural analysis and optimization.

-qipa=level=1 (-qipa)

This is the default level for ipa. The two options -qipa=level=1 and -qipa are identical.

This level turns on inlining, limited alias analysis, and limited call-site tailoring.

Inlining a procedure causes a procedure call to be replaced by the procedure itself to eliminate the overhead of the call.

An alias occurs when different variables in a program refer to the same area of storage. If the compiler is unsure whether a given global variable is aliased, it will assume that every procedure call might cause the variable to be read or changed. For this reason it will generate extra loads and stores to preserved the value of the variable when the procedure is called instead of storing it in a register.

Call-site tailoring is IBM's generic term for optimizations that are performed on a function-call basis like cloning and inlining.

-qipa=level=2

This ipa level performs full interprocedural data flow and alias analysis.

-qarch

The -qarch argument specifies the type of processor on which the executable code will be run, and produces an executable program that contains machine instructions specific to that processor. This allows the compiler to take advantage of processor-specific instructions that can improve performance at the cost of producing an executable program that will run on only one type of processor. The default for this argument is -qarch=com, which will produce a program that is runnable on any POWER or POWERPC processor.

The recommended value for this argument is

      -qarch=auto

The -qarch=auto option tells the compiler to produce a program with machine instructions specific to the processor on which it is compiled.

-qtune

The -qtune argument specifies the type of processor for which the program should be tuned to produce the best performance. Tuning for a processor involves instruction selection, scheduling, taking advantage of cache sizes and setting up pipelining to take advantage of the specified processor's hardware. Unlike the -qarch option, this option does not produce processor specific code. A code that is compiled with a -qtune argument designated for one type of processor will run correctly on any other POWER or POWERPC processor, although its performance may be worse than it would be if it were compiled with the appropriate -qtune argument.

The recommended value for this argument is

      -qtune=auto

The -qtune=auto option tells the compiler to produce a program tuned for the processor on which it is compiled.

-qcache

The -qcache argument specifies the cache configuration of the processor on which the program will be run. The argument must be used in combination with the -O4, -O5, or -qipa options with a C or C++ compile and in combination with the -qhot option with a Fortran compile. The compiler uses the information provided by this argument to determine how loop operations can be structured or blocked to process only the amount of data that can fit into the data cache. The default value for this argument is the same as that of the -qtune option.

As with -qtune, a code compiled with this argument designated for one type of processor will run correctly on any other POWER or POWERPC processor, although its performance may be worse than it would be if it were compiled with the appropriate -qcache argument.

This option is designed for those cases in which the cache on the processor on which the code will be run is different from the standard cache for that processor.

If this argument is used in a compile, NERSC recommends it be given this value:

      -qcache=auto

The -qcache=auto option tells the compiler to produce a program for the cache configuration of the processor on which it is compiled.

-lessl

The Engineering and Scientific Subroutine Library (ESSL) is a collection of high performance numerical routines. These routines are very highly optimized for the POWER3 architecture, and using them will almost always produce the fastest possible code for any given numerical algorithm.

If the -lessl argument is specified at link time, single threaded versions of these routines will be used from the ESSL serial library.

-lesslsmp

Some of these subroutines are also available in multithreaded versions that will run in parallel over a node using the shared memory parallel processing programming model. If the -lesslsmp argument is used at link time, these versions will be loaded if they are available, and by default these routines will be run with sixteen user threads. The number of user threads that the routine will use can be controlled by the OMP_NUM_THREADS in the manner described at Changing the number of threads and tasks. The routine will run with sixteen user threads if this variable is not set.

-qessl

The -qessl option for Fortran compiles directs the compiler to substitute the much faster routines from the Engineering and Scientific Subroutine Library (ESSL) for their equivalent Fortran 90 intrinsic procedures when it is safe to do so. Both 32 and 64 bit datatypes are supported. In addition to -qessl, at least one of these other options must be specified for this optimization to take effect: -qsmp, -qipa , -qhot, -O3, -O4, or -O5.

In addition to -qessl, either -lessl or -lesslsmp must be specified at link time. If -lessl is specified the single threaded versions from the ESSL library will be used. If -lesslsmp is specified, the multi-threaded versions will be used. Codes linked with -lesslsmp will be run with OMP_NUM_THREADS user threads or with one user threads per processor if this environment variable has not been set.

-qsmp

The -qsmp option, which is equivalent to -qsmp=auto, tells the compiler to attempt to parallelize the user code. It will do this by attempting to parallelize explicitly coded loops, and, with Fortran, loops that are generated by the compiler for array language. This option must be used with a "thread safe" version of the compiler, one with a "_r" suffix, e.g. xlf90_r, mppCC_r, xlc_r, etc.

Codes compiled with this option will be run with OMP_NUM_THREADS user threads or with one user thread per processor if this environment variable has not been set.

IBM Compiler Optimization Argument Examples

This describes the compilation and run time impact on several different publicly available benchmarks of a variety of compiler optimization arguments.

Introduction

Publicly available benchmarks are compiled and run with several different sets of optimization options and the performance recorded. The time required to compile and link the code is also recorded, and the results summarized.

The following information is given for each benchmark:

The numbers given in the tables below for the individual benchmarks are the best of several dedicated runs in batch mode.

Linpack

The Linpack benchmark solves a dense set of linear equations. The version tested here is the 1000x1000 double precision version obtained from 1000d. It is contained in a single 755 line source code file containing 11 subroutines and functions in addition to the main driver. It is a simple Fortran 77 code originally written in 1978 and last modified in 1992.

Source Code Changes

All references to the second() in the source were replace with references to rtc().

Compile Changes

The code was compiled with the xlf compiler with no options beyond the optimization options.

Times

These runs were done with the 8.1.1.3 version of xlf in December, 2003.

The Compile Time is the wall-clock time for the compile and link returned by the unix time command.

The MFLOPS result is that returned by the internal timer in the code.

Results

Optimization Compile MFLOPS
unoptimized  .52  49
-O2 1.36 243
-O3 -qstrict -qarch=pwr3 -qtune=pwr3 1.98 256
-O3 -qarch=pwr3 -qtune=pwr3 2.07 257
-O3 -qhot -qarch=pwr3 -qtune=pwr3 3.75 256
-O4 -qnohot 4.08 264
-O4 6.56 264
-O5 -qnohot 4.19 266
-O5 5.96 263

Comments

There was no significant difference in the code's performance at any optimization level when the threaded compiler (xlf_r) was used or when the mass library was included. In this case, the recommended optimization options, -O3 -qstrict -qarch=pwr3 -qtune=pwr3 compilation give close to the best performance obtainable by any other optimization options.

This example exhibits the limitations of the use of compiler options alone to improve a code's performance, since the best performance obtained on a single processor in this example attains only 18% of the processor's theoretically peak performance of 1.5 GFlops.

Most of the work in this example is done by four BLAS routines, daxpy, ddot, dscal, and idamax, that are also in the IBM high performance ESSL library. However, when the benchmark versions of these routines are replaced with the ESSL routines the performance attained is no better than 250 MFlops, worse than with the benchmark versions.

Fortran Livermore Loops

This version of the double precision Livermore Loop benchmark was obtained from livermore. This is the 1991 update of the benchmark whose earliest version dates from the 1970's. It contains 24 numeric kernels written in fairly straightforward, uncomplicated Fortran 77. Several summary figures are returned by the program at the end of the run.

Source Code Changes

The only changes to the original source required were to the timing routines. The SECOND function definition in the main routine at line 556 was uncommented:

	REAL*8 SECOND

These three lines, 4469-4471, in the SECOND function were commented out and replaced with a call to the system elapsed time measurement function rtc():

C         REAL*4 CPUTYM(4), ETIME
C         XT= ETIME( CPUTYM)
C         SECOND=    CPUTYM(1)
	second=rtc()

Compile Changes

The code was compiled with the xlf compiler with no additional arguments beyond those for optimization.

Timers

The internal Compile Time result is the seconds required to compile and link the test program. To compare the effects of the various optimization levels, the Average (mean), Minimum, and Maximum MFLOP Rates for the loops returned by the code are listed.

Livermore Loop MFLOPS

Optimization Compile Time Average Minimum Maximum
unoptimized  1.63  48 11  141
-O2  8.73 245 22  995
-O3 -qstrict -qarch=pwr3 -qtune=pwr3 14.22 322 60 1259
-O3 -qarch=pwr3 -qtune=pwr3 15.07 339 60 1257
-O3 -qhot -qarch=pwr3 -qtune=pwr3 67.62 315 58  996
-O4 -qnohot 34.08 353 58 1532
-O4 80.04 321 58  998
-O5 -qnohot 49.10 360 58 1541
-O5 91.23 319 58  997

Comments

The performance of this benchmark is significantly degraded when the -qhot option is specified. Not only is the compile time greatly increased, but both the Average and Maximum MFLOPS are significantly worse than the corresponding optimization level without the -qhot option. This may be due to the fact that all of the loops are fairly small and uncomplicated, and the sophisticated analysis and loop restructuring done by this option add too much overhead at execution time.

Another interesting feature is that two of the higher level optimizations that do not include -qhot, -O4 -qnohot and -O5 -qnohot, are reported as attaining a MFLOP total greater than the theoretical peak performance of the POWER3 processor, 1.5 GFLOPS. The loop that attains this speed is Kernel 7, a very short loop representing an equation of state fragment:

 1007 DO 7 k= 1,n
        X(k)=     U(k  ) + R*( Z(k  ) + R*Y(k  )) +
     1        T*( U(k+3) + R*( U(k+2) + R*U(k+1)) +
     2        T*( U(k+6) + Q*( U(k+5) + Q*U(k+4))))
    7 CONTINUE

Very likely, these optimizations make use of the fact that several of the elements in the equation are used in more than one iteration of the loop and need only be computed once. When the POWER3 hardware performance monitor is applied to this loop by means of hpmcount, the measured MFLOPS for this loop are around 750.

NAS Kernels

The NAS Kernel Benchmark consists of seven Fortran test kernels that perform calculations typical of scientific applications run at the NASA Ames Research Center. It was written in the 1980's and consists of approximately 1000 lines of Fortran code, organized into seven separate tests.

Source Code Changes

The only changes made were to the CPTIME internal timer routine. The original version was replaced by this SP specific version:

      common /savetime/tx
      real*8 rtc,tx,t
      T = rtc()
      if (tx.gt.t) tx=0
      CPTIME = real(T - TX)
      TX = T
      RETURN
      END

Compile Changes

The code was compiled with the xlf compiler with no options beyond the optimization options.

Timers

The Compile Time result is the wall clock seconds for the compile and link returned by the unix time command. In this table the average MFLOPS for all the kernels returned by the program's internal is given.

Timings

Optimization Compile Time Average MFLOPS
unoptimized   .72 31
-O2  3.07 75
-O3 -qstrict -qarch=pwr3 -qtune=pwr3  8.31 79
-O3 -qarch=pwr3 -qtune=pwr3  8.49 79
-O3 -qhot -qarch=pwr3 -qtune=pwr3 42.01 91
-O4 -qnohot 15.96 79
-O4 45.16 90
-O5 -qnohot 20.32 79
-O5 44.78 90

Comments

This provides a contrast with the Livermore kernels in that the -qhot option significantly improves performance when it is added to other optimization options at the cost of of an almost five fold increase in compile time in some cases.

Individual Kernels

This benchmark also provides timings for the seven individual kernels.

Individual Kernel MFLOPS

Optimization MXM FFT CHOL BTRIX GMTRY EMIT VPENTA
unoptimized 53 34 18 57 13 97 26
-O2 236 162 136 210 18 150 37
-O3 -qstrict -qarch=pwr3 -qtune=pwr3 753 157 192 218 18 153 36
-O3 -qarch=pwr3 -qtune=pwr3 765 146 206 221 17 153 36
-O3 -qhot -qarch=pwr3 -qtune=pwr3 424 164 210 287 18 472 57
-O4 -qnohot 763 149 207 220 18 156 37
-O4 503 151 212 285 18 473 57
-O5 -qnohot 749 153 210 220 18 157 37
-O5 431 162 211 303 18 473 57

References

These are useful references for IBM compiler optimization:


LBNL Home
Page last modified: Thu, 27 May 2004 04:56:56 GMT
Page URL: http://www.nersc.gov/nusers/resources/software/ibm/opt_options/print.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science