NERSCPowering Scientific Discovery Since 1974

CESM Case Study

CESM MG2 Kernel Code Description

The Community Earth System Model (CESM) is a coupled multi-physics code which consists of multiple model components: Atmosphere, Ocean, Sea-ice, Land-ice, Land, River Runoff, and Coupler. During the course of a CESM run, the model components integrate forward in time, periodically stopping to exchange information with the coupler.  The active (dynamical) components are generally fully prognostic, and they are state-of-the-art climate prediction and analysis tools. Cost of Atmosphere components typically dominates the total run time (~60%). CESM code has more than 1.5 Million lines of primarily Fortran code, using 1977-2008 standards. It has been developed for over 20+ years, very challenging code for compilers.

The MG2 kernel is the version 2 of the Morrison-Gettleman micorphysics package that has been extracted from CESM 1.3. It represents the radiative transfer workload in the atmosphere model and typically consumes about 10% of total CESM run time. The MG2 kernel is core bound, not bandwidth limited at all, and shows very little vectorization due to the following reasons:  Some loop bounds are short (e.g. 10);  There are dependent sequence of instructions; Heavy use of math instrinsics that do not vectorize: pow(), gamma(), log10(); Intel intrinsic gamma() is 2.6x slower than MG2 internal function; Kernel has long complex loops with interleaved conditionals and elemental function calls (Mixed conditionals and non-inlined functions inhibit vectorization; Some send array sections to elemental functions).

The work described here on MG2 was performed by the the CESM NESAP team, consists of NCAR CESM developers: John Dennis (PI), Christopher Kerr,  Sean Santos;  Intel engineers: Nadezhda Plotnikova, Martyn Corden; Cray Center of Excellence: Marcus Wagner; and NERSC Liaison: Helen He.

MG2 Vectorization Prototype

The MG2 kernel was brought to Intel for a Dungeon Session in March 2015.  There, we used Intel compiler report to check and make sure key functions are vectorized (and all functions on the call stack are vectorized too). Some of the lessons learned are: Elemental functions need to be inlined; “-qopt-report=5” reports highest level of details; “-ipo” is needed if functions are in different source codes.

The following figure shows the example call stack for vectorization and inlining.  !$OMP DECLARE SIMD were added for funcA (which calls funcB and intrinsic function “pow”), and !DIR$ ATTRIBUTE FORCEINLINE were added for funcC (which calls funcD and intrinsic function “pow”) when needed. The result funC was inlined, and the complete call tree from funcA down to fundD were vectorized.

MG2 prototype

Some high level recommendations from the Dungeon Session were: Divide major loops when possible and localize vectorization: work to be done by MG2 developers; Implement inlining as close to hotspot as possible; or use vector functions on the low level; Follow up with MKL team on Gamma function vectorization; and work with compiler team for a flag to replace FORCE INLINE, and portable options for other compilers.

Optimization Strategies

With lessons learned from the dungeon session, the following changes were made to the code.

Changes #1: Remove ‘elemental’ attribute and move the ‘mgncol’ loop inside routine

The problems are that routines with ‘elemental’ attribute don’t inline; And without ‘elemental’ attribute routines still don’t inline!  The ‘mgncol’ loop also needed to be moved inside routine in order for the function to be inline.

Before change:
elemental function wv_sat_svp_to_qsat(es, p) result(qs)
  real(r8), intent(in) :: es  ! SVP
  real(r8), intent(in) :: p real(r8) :: qs
  ! If pressure is less than SVP, set qs to maximum of 1.
  if ( (p - es) <= 0._r8 ) then
      qs = 1.0_r8
     qs = epsilo*es / (p - omeps*es)
  end if
end function wv_sat_svp_to_qsat

After change:
function wv_sat_svp_to_qsat(es, p, mgncol) result(qs)
  integer, intent(in) :: mgncol
  real(r8), dimension(mgncol), intent(in) :: es  ! SVP
  real(r8), dimension(mgncol), intent(in) :: p
  real(r8), dimension(mgncol) :: qs
  integer :: i
  do i=1,mgncol
  if ( (p(i) - es(i)) <= 0._r8 ) then
     qs(i) = 1.0_r8
     qs(i) = epsilo*es(i) / (p(i) - omeps*es(i))
  end if
end function wv_sat_svp_to_qsat

Impact of these code changes for Elemental functions are: No changes to algorithm; Algorithm gives same answers; Code readability not effected; Revised code looks almost identical to original; Provide scalar and vector version of functions; Overload function names to maintain single naming convention.

Changes #2: Do not use assumed-shaped arrays

Before change:
subroutine size_dist_param_liq (qcic, ...)
  real(r8), intent(in) :: qcic(:)
  do i =1, SIZE(qcic)

After change:
subroutine size_dist_param_liq (qcic, ..., mgncol)
  real, dimension(mgncol), intent(in) :: qcic
  do i =1, mgncol

Changes #3:

  • Divide loop blocks into manageable sizes. Allows compiler to vectorize loops. Can fuse loops during optimization step.
  • Remove array syntax: plev(:,:) and replace with loops
  • Replace divides: 1.0/plev(i,k) with *plev_inv(i,k)
  • Remove initialization of variables that are over written

Changes #4:

  • Rearrange loop order to allow for data alignment

Before change:

do i = 1, mgncol
  do k = 1, nlev
  plev(i,k) = ...

After change:

do i = 1, nlev
  do k = 1, mgncpl
  plev(i,k) = ...

  • Use more aggressive compiler options: -O3 -xAVX -fp-model fast=2 -no-prec-div -no-prec-sqrt -ip -fimf-precision=low -override-limits -qopt-report=5 -no-inline-max-total-size -inline-factor=200
  • Use Profile-guided Optimization (PGO) to improve code performance
  • Compare performance of code with different vendors compilers

Changes #5: Add OMP SIMD ALIGNED

Using OMP SIMD was suggested and tested at the Dungeon session to improve vectorization.  As compared to: !DIR$ VECTOR ALIGNED, which tells the compiler that the arrays are aligned; It is Intel compiler specific, not portable.

!$OMP SIMD ALIGNED is a portable solution that it tells the compiler that the arrays are aligned on specific byte boundaries; asserts that there are no dependencies; and requires to use PRIVATE or REDUCTION clauses to ensure correctness.It forces the compiler to vectorize, whether or not it thinks if it is a good idea or not.  It is independent of vendor, however it can be overly intrusive in code.

The “ALIGNED” attribute for the OMP SIMD directive approved to be important in this test case that it achieved 8% performance gain when the list was explicitly provided. !$OMP SIMD ALIGNED was added in 48 loops in MG2 kernel, many with list of 10+ variables.  However, the process was tedious and error-prone, and often times impossible in large real applications.

Our concerns are how can compilers know better which arrays are aligned so users do not have to specify. We would like that a variable can be declared as aligned, or a variable can be set to aligned with a compiler flag, and compiler would know it is aligned when it is in scope. We hope that the Fortran standard can adopt something like “!$DIR ATTRIBUTES ALIGNED: 64 :: A” that is similar to what is available in C/C++ standard: "float A[1000] __attribute__((aligned(64)))", or all compilers will support a compiler flag such as "“-align array64byte” as supported by Intel and Cray compiler currently.

Optimization Steps and Performance Summary

Version 1
    •    Simplify expressions to minimize #operations
    •    Use internal GAMMA function

Version 2
    •    Remove “elemental” attribute, move loop inside.
    •    Inline subroutines. Divide, fuse, exchange loops.
    •    Replace assumed shaped arrays with loops
    •    Replace division with inversion of multiplication
    •    Remove initialization of loops to be overwritten later
    •    Use more aggressive compiler flags
    •    Use profile-guided optimization (PGO)

Version 3 (Intel compiler only)
    •    Use !$OMP SIMD ALIGNED to force vectorization

The final run time is shown in the figure below. Using Intel compiler, Ver3 is about 2X from the Original version. The work performed on MG2 as summarized here has been put into the main CESM code trunk.

MG2 performance

Performance Comparisons on Different Compilers and Hardware

The following table shows the final run time performance with different compilers on different architecture. Cray compiler performs the best for this kernel. The exploration of using Cray compiler for the full CESM code is under exploration.

HardwareCompilerPerformance [usec per iteration]
Sandy-Bridge Intel/15.0.2 541
Sandy-Bridge PGI/15.5 600
Ivy-Bridge Intel/15.0.1 407
Ivy-Bridge CCE/8.3.11 347

Lessons Learned

  • Directives and flags can be helpful, however not a replacement for programmers’ work on code modifications.
  • Break up loops and push loops into functions where vectorization can be dealt with directly and can expose logic to compiler.
  • Incremental improvements not necessary a BIG win from any one thing. Accumulative results matter.
  • Performance and portability is a major goal: use !$OMP SIMD proves to be beneficial but very hard to use regarding the need of providing the aligned list.
  • Requested optional alignment declaration in Fortran Language Standard.