|
XL Fortran for AIX® Version 6 provides different methods for improving your program's performance. This article presents eight methods (in order from simplest to most advanced) that you can use to boost the performance of most programs. You can:
Many compiler users shy away from using a compiler's optimization capabilities because historically compiler optimizers were overly aggressive, or just bug-ridden, and tended to introduce errors into valid Fortran programs. The XL Fortran Compiler optimizer is highly reliable. The -O2 option provides an intermediate level of optimization that avoids any techniques that could alter the semantics of valid Fortran programs. Although results at -O2 may not be identical to those produced when you do not select any optimization options, typically -O2 provides better precision than not using any optimization. Why should you use -O2? Our performance tests suggest that using -O2 typically doubles the performance of both fixed-point and floating-point programs. You do not need to hand-tune your code, or use any other compiler options, to obtain this level of performance improvement, and you don't have to worry about -O2 introducing semantic changes into your program. Note that -O and -O2 provide the same level of optimization. The compiler provides a higher level of optimization when you use -O3. This option increases the range of optimizations the compiler performs, but it can also increase compilation time and memory use by the compiler. Our experience suggests that -O3 usually provides improvements. In a few cases -O3 can decrease performance. If you use -O3 you may want to compare your programs' performance at -O3 to its performance when compiled with -O2. In certain programs -O3 can change the behavior or results of your program, unless you also specify the -qstrict option, which we recommend for novice users. The -qhot option provides a different set of optimizations than -O3 does. Its tuning efforts concentrate on efficient scalarization of array language and iteration-reordering transformations. These transformations may produce results that are not bitwise identical to those produced only at -O2 or -O3. We have found that this option is primarily beneficial when used in conjunction with the -qarch, -qtune, and -qcache options (for iteration-reordering transformations), or with programs using array language, but you may wish to experiment with it in other situations as well. -qhot can occasionally reduce performance if it does not have enough information about the size of loop bounds and array dimensions, and you may want to use timing techniques to determine whether -qhot improves your program's performance. You must specify -qhot for any of the -qcache settings to have an effect. The -O4 option aggressively optimizes the source program, trading off additional compile time for potential improvements in the generated code. You can specify the option at compile time or link time. If you specify the option at link time, it has no effect unless you also specify it at compile time for at least the file which contains the main program. Specifying the -O4 option implies the following other options: -qhot, -qipa, -O3 (and all of the options and settings which it implies), -qarch=auto, -qtune=auto, and -qcache=auto. You can specify additional options following the -O4 option; these options will override the implied options listed in the previous sentence. For example, if you are compiling on a 604e machine, you can specify -O4 -qarch=pwr2 to produce executables for a POWER2 target machine. Recommendation: Use -O4 if you are not concerned about compilation time. Otherwise, use -O2 or -O3 -qstrict for any production-level program you compile. Try using -qhot if you have time to test different versions of your executable file for performance or if your program uses Fortran 90/95 array language. Using -qarch and -qtuneThe RISC System/6000 includes models based on four different chip configurations: the original POWER processor, the PowerPC® processor (including the 601 processor, which is a bridge between the POWER and PowerPC processors), the POWER2 processor, and the POWER3 processor. You can use -qarch and -qtune to target your program to particular machines. If you intend your program to run only on a particular architecture, you can use the -qarch option to instruct the compiler to generate code specific to that architecture. This allows the compiler to take advantage of machine-specific instructions that can improve performance. -qarch provides arguments for you to specify certain chip models; for example, you can specify -qarch=604 to indicate that your program is to be executed on any PowerPC 604 hardware platform. Note: The -qarch option may result in a program that cannot be run on machines with processors other than those supported by the option. If you run such a program on an unsupported processor under AIX Version 4, your program may fail at execution time. If you want your program to run on more than one architecture, but to be tuned to a particular architecture, you can use a combination of the -qarch and -qtune options. -qarch and -qtune are primarily of benefit for floating-point intensive programs. On PowerPC systems, programs that process mainly unpromoted single-precision variables are more efficient when you specify -qarch=ppc. On POWER2 and POWER3 systems, programs that process mainly double-precision variables (or single-precision variables promoted to double by one of the -qautodbl options) become more efficient with -qarch=pwr2 and -qarch=pwr3. If your program is likely to be run on all four types of processors equally often, do not specify any -qarch or -qtune options. The default for these options is to support only the common subset of instructions of all processors. If you specify -q32, the defaults are -qarch=com and -qtune=pwr2. If you specify -q64, the defaults are -qarch=ppc and -qtune=pwr3. If you specify the auto suboption for the -qarch option, XL Fortran automatically detects the specific architecture of the compiling machine. If you specify the auto suboption for the -qtune option, XL Fortran automatically detects the specific processor type of the compiling machine. For both options, XL Fortran assumes that the execution environment will be the same as the compilation environment. You can further enhance the performance of programs intended for specific machines by using the -qcache and -qhot options. Recommendation: If your program is intended for the full range of RISC System/6000 implementations, and is not intended primarily for one processor type, do not use either -qarch or -qtune. Using Interprocedural AnalysisInterprocedural analysis (IPA) enhances the -O optimizations by performing detailed analysis across procedures. It extends the area examined during optimization and inlining from a single procedure to multiple procedures (which can be in different source files) and the linkage between them. You request IPA by specifying the -qipa option. You can fine-tune the optimizations performed by specifying -qipa suboptions. You must also specify one of -O, -O2, -O3, and -O4. For additional performance benefits, you can also specify the -Q option. To use IPA, you must:
Recommendation: We strongly recommend that you specify -qipa on both the compile and link steps, and that you specify -qipa=noobject on the compile step. Using Floating-Point OptionsThis section explains what default floating-point options you can change to improve performance of floating-point intensive programs. Some of these options can affect conformance to floating-point standards. They can change the results of computations, but in many cases the result is an increase in accuracy.
Recommendation for POWER, POWER2, and POWER3 platforms: For single-precision programs, you can improve performance while preserving accuracy by using these floating-point options: -qfloat=fltint:rsqrt:hssngl If your single-precision program is not memory-intensive (for example, if it does not access more data than the available cache space), you can obtain equal or better performance, and greater precision, by using: -qfloat=fltint:rsqrt -qautodbl=dblpad4 For programs that do not contain single-precision variables, use -qfloat=rsqrt:fltint only. Note that -O3 without -qstrict automatically sets -qfloat=rsqrt:fltint. Recommendation for PowerPC platform: Single-precision programs are generally more efficient than double-precision programs on PowerPC systems, so promoting default REAL values to REAL(8) can reduce performance. Use the following -qfloat suboptions: -qfloat=hssngl:fltint:rsqrt Using a Fortran PreprocessorThe KAP and VAST preprocessors for Fortran can produce tuned versions of your source code. You can obtain these preprocessors directly from Kuck and Associates (for KAP) and Pacific Sierra Research (for VAST). You may find that compiling with these preprocessors improves performance for some programs. The preprocessors perform memory management optimizations, algebraic transformations, inlining, interprocedural analysis, and other optimizations. If your program contains a large proportion of common algebraic algorithms, these algorithms may already exist in specially tuned libraries such as the BLAS (Basic Linear Algebraic Subroutines) library or ESSL (Engineering and Scientific Subroutine Library). The BLAS Library is currently shipped with the AIX Operating System, while the ESSL library is a separate program product that provides a greater range of algorithms and improved performance. ESSL algorithms are tuned to individual hardware implementations, and take advantage of whatever memory and processor configuration is detected at run time. Both the KAP and VAST preprocessors can generate calls to these libraries. Using -qcache for Single-Platform ProgramsIf your program is intended to run exclusively on a single machine or configuration, you can help the compiler tune your program to the memory layout of that machine by using the -qcache option. (You must also specify the -qhot option for -qcache to have any effect. -qhot uses information on cache configurations to determine appropriate memory management optimizations.) There are three types of cache: Data, Instruction, and Combined. Models generally fall into two categories: those with both data and instruction caches, and those with a single, combined data/instruction cache. The type=C|D|I suboption lets you identify the type of cache to which the -qcache option refers. The -qcache options can also be used to identify the size and set associativity of a model's level-2 cache, or the Translation Lookaside Buffer (TLB), which is a table used to locate recently referenced pages of memory. In most cases you do not need to specify the -qcache entry for a TLB unless your program uses more than 512 KB of data space. If you specify the auto suboption of -qcache, XL Fortran automatically detects the specific cache configuration of the compiling machine. XL Fortran assumes that the execution environment will be the same as the compilation environment. Using SMPXL Fortran for AIX, Version 6.1 exploits the RS/6000 symmetric multi-processing (SMP) architecture. It supports both automatic parallelization of a Fortran program and explicit parallelization (through a set of directives that you can use to parallelize selected portions of your program). SMP support includes the following:
You may need to use asynchronous I/O to gain speed and efficiency in scientific programs that perform I/O for large amounts of data. Synchronous I/O blocks the execution of an application until the I/O operation is completed. Asynchronous I/O allows an application to continue processing while the I/O operation is performed in the background. You can modify applications to take advantage of the ability to overlap processing and I/O operations. Multiple asynchronous I/O operations may also be performed simultaneously, on multiple files residing on independent devices. Further ReadingThe XL Fortran for AIX User's Guide, SC09-2719, describes how to compile, link, and run your programs using XL Fortran Version 6.1. See in particular Chapter 5, "XL Fortran Compiler-Option Reference", and Chapter 7, "XL Fortran Floating-Point Processing." The XL Fortran for AIX Language Reference, SC09-2718, describes the XL Fortran programming language. See in particular Chapter 8, "Input/Output Concepts", for more information on asynchronous I/O. The following books describe the hardware architecture of the RISC System/6000 family of processors:
AIX, RISC System/6000, PowerPC, and IBM are trademarks of International Business Machines Corporation in the
United States and/or other countries.
|