NERSCPowering Scientific Discovery Since 1974

Vectorization

Introduction

Vectorization allows you to execute a single instruction on multiple data objects in parallel within a single CPU core, thus improving performance. This is different from task parallelism using MPI, OpenMP or other parallel libraries where additional cores or nodes are added to take care of data belonging to separate tasks placed on different cores or nodes. This way of parallelism (data parallelism) is possible since modern computer architecture includes SIMD (Single Instruction, Multiple Data) instructions which are able to perform a single instruction on multiple data objects on a single CPU core at the same time.

SSE (Streaming SIMD Extensions) is an SIMD extension to the x86 instruction set architecture first introduced in 1999 by Intel and subsequently expanded later. 128-bit registers are provided to handle SSE instructions, to operate simultaneously on two 64-bit (8-byte) double precison numbers, four 32-bit (4-byte) floating point numbers, two 64-bit integers, four 32-bit integers, eight 16-bit short integers or sixteen 8-bit bytes.

Advanced Vector Extensions (AVX) are vector instructions added to to the x86 instruction set architecture whose vector width is further extended to 256 bits (or 32 bytes). 256-bit wide registers are available to handle such instructions. These instructions were introduced by Intel with the Sandy Bridge processor shipping in 2011 and later that year by AMD with the Bulldozer processor. AVX instructions simultaneously operate on 8 pairs of single precision (4 bytes) operands or 4 pairs of double precision (8 bytes) operands, etc. This can speed up a vectorizable loop by up to 8 times in case of single precison (4 bytes), or 4 times in case of double precision (8 bytes), etc.

AVX and SSE instructions are available on Edison nodes. On Hopper, only SSE instructions are available.

We often use the term 'vector length' when talking about vectorization. This is the number of variables that can be simultaneouly operated on by a vector instruction. Since SSE registers are 128-bit (16-byte) wide, the vector length is four if single precison (4 bytes) variables are used, and two with double precison variables. AVX registers are 256-bit (32-byte) wide, thus the vector length will double the vector length for SSE instructions, eight for single precison and four for double precision.

Vectorization

To vectorize a loop, the compiler first unrolls the loop by the vector length, and then packs multiple scalar instructions into a single vector instruction. This optimization process is called 'vectorization'. In this example the vector length is 4. Then, the following loop

      do i=1,100
         a(i) = b(i) + c(i)
      end do

is transformed into

      do i=1,100,4        ! unrolled 4 times
         a(i)   = b(i)   + c(i)
         a(i+1) = b(i+1) + c(i+1)
         a(i+2) = b(i+2) + c(i+2)
         a(i+3) = b(i+3) + c(i+3)
      end do

and then into

      do i=1,100,4
         <load b(i), b(i+1), b(i+2), b(i+3) into a vector register, vB>
         <load c(i), c(i+1), c(i+2), c(i+3) into a vector register, vC>
         vA = vB + vC      ! 4 additions are now done in a single step
         <store a(i), a(i+1), a(i+2), a(i+3) from the vector register, vA>
      end do

By executing multiple operations in a single step, performance can potentially improve by a factor of up to the vector length (4 in this example), over scalar mode where one pair of operands are being operated on sequentially.

Compilers can auto-vectorize loops for you that are considered safe for vectorization. In case of the Intel compiler, this happens when you compile at default optimization level (-O2) or higher. On the other hand, if you want to disable vectorization for any loop in a source file for any reason, you can do that by specifying the '-no-vec' compile flag.

For the Cray compiler, there are four levels of automatic vectorization that you can specify with the -h vectorn where n ranges from 0 to 3.  -hvector0 is minimal vectorization which turns off vectorization for all loops except for Fortran statements using array syntax.  You can turn off vectorization for these statements by combining -hvector0 with -hpf0 or -hfp1.  -hvector1 turns on conservative vectorization which does not restructure loop nests.  -hvector2 is moderate vectorization which will restructure loop nests and is the default for Cray compilers.  -hvector3 is aggressive vectorization.

The following test code demonstrates the effect of vectorization in a simple vector add operation. The main computation loop is repeated itmax times, partly to use small arrays in order to show the effect of vectorization without this being severely affected by slow memory bandwidth with a large array, and also to have the computation time larger than the time resolution of the timing function used (1 microsecond). For a finer resolution, MPI_Wtime() could be used.

      program vecadd

!...  Prepared by NERSC User Services Group
!...  April, 2014

      implicit none
      integer :: n = 5184
      integer :: itmax = 10000000
#ifdef REAL4
      real, allocatable :: a(:), b(:), c(:)
#else
      real*8, allocatable :: a(:), b(:), c(:)
#endif
      integer i, it
      integer*8 c1, c2, cr, cm
      real*8 dt

      allocate (a(n), b(n), c(n))

!...  Initialization

      do i=1,n
         a(i) = cos(i * 0.1)
         b(i) = sin(i * 0.1)
         c(i) = 0.
      end do

!...  Main loop

      call system_clock(c1, cr, cm)
      do it=1,itmax
         do i=1,n
            c(i) = a(i) + b(i)
         end do
         b(n) = sin(b(n))
      end do
      call system_clock(c2, cr, cm)

      dt = dble(c2-c1)/dble(cr)
      print *, c(1)+c(n/2)+c(n), dt

      deallocate(a, b, c)
      end

The plot below shows performance speedup with vectorization for different array sizes observed on Edison using single ('R*4') and double ('R*8') precisions with the Intel compiler. The speedup is over 4 in single precision and over 2 in double precision when the data is all in the L1 cache.

Vectorization inhibitors

Not all loops can be vectorized. A good introduction to vectorization with the Intel compiler ('A Guide to Vectorization with Intel C++ Compilers') summarizes the criteria for vectorization quite well:

(1) The loop trip count must be known at entry to the loop at runtime. Statements that can change the trip count dynamically at runtime (such as Fortran's 'EXIT', computed 'IF', etc. or C/C++'s 'break') must not be present inside the loop.

(2) In general, branching in the loop inhibits vectorization. Thus, C/C++'s 'switch' statements are not allowed. However, 'if' statements are allowed as long as they can be implemented as masked assignments. The calculation is done for all 'if' branches but the results is stored only for those elements for which the mask evaluates to true.

(3) Only the innermost loop is eligible for vectorization. If the compiler transforms an outer loop into an inner loop as a result of optimization, then the loop may be vectorized.

(4) A function call or I/O inside a loop prohibits vectorization. Intrinsic math functions such as 'cos()', 'sin()', etc. are allowed because such library functions are usually vectorized versions. A loop containing a function that is inlined by the compiler can be vectorized because there will be no more function call.

(5) There should be no data dependency in the loop. Some dependency types are as follows:

Read-after-write (also known as "flow dependency"): An example is shown below.

      do i=2,n
         a(i) = a(i-1) + 1
      end do

This type of dependency results in incorrect numerical results when performed in vector mode. Let's assume that the vector length is again 4. If the above is computed in vector mode, the old values of a(1), a(2), a(3), and a(4) from the previous loop will be loaded into the right hand side. On the other hand, in scalar mode, the new value computed in the previous scalar iteration, a(i-1), will be loaded into the right hand side. Therefore, the loop cannot be vectorized:

% ifort -vec-report3 atest.f
atest.F(14): (col. 7) remark: loop was not vectorized: existence of vector dependence
atest.F(16): (col. 7) remark: vector dependence: assumed FLOW dependence between atest line 16 and atest line 16

Write-after-read (also known as "anti-dependency"): An example is:

      do i=2,n
         a(i-1) = a(i) + 1
      end do

This loop can be vectorized.

Write-after-write (also known as "output dependency"): The following loop cannot be vectorized because incorrect numerical results obtained in vector mode.

      do i=2,n
         a(i-1) = x(i)
         ...
         a(i)   = 2. * i
      end do

However, reduction operations can be vectorized. In the following example, the compiler will vectorize and compute partial sums. At the end, the total sum is computed from partial sums.

      s = 0.
      do i=1,n
         s = s + a(i) * b(i)
      end do

(6) Non-contiguous memory access hampers vectorization efficiency. Eight consecutive ints or floats, or four consecutive doubles, may be loaded directly from memory in a single AVX instruction. But if they are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient.

Useful tools

Intel provides two very useful vector analysis tools.

There is an Intel compiler option '-vec-report=<n>', to generate a diagnostic report about vectorization. Users can see why the compiler vectorizes certain loops and why it doesn't vectorize others. The report level can be selected by setting the flag to any number from 0 to 7. See the ifort, icc or icpc man page for details. This is an example of the report:

% ftn -o v.out -vec-report=3 yax.f
yax.f(12): (col. 10) remark: LOOP WAS VECTORIZED
yax.f(13): (col. 22) remark: loop was not vectorized: not inner loop
yax.f(18): (col. 10) remark: LOOP WAS VECTORIZED
yax.f(26): (col. 13) remark: LOOP WAS VECTORIZED
yax.f(25): (col. 10) remark: loop was not vectorized: not inner loop
yax.f(24): (col. 7) remark: loop was not vectorized: not inner loop

The compiler option '-guide' causes Guided Auto-Parallelization (GAP) feature turned on. It generates parallelization diagnostics, suggesting ways to imporove auto-vectorization, auto-parallelization, and data transformation. The option requires an optimization level of '-O2' or higher.

% ftn -real-size 64 -vec-report=3 -parallel -guide -o v.out alg.F
alg.F(20): (col. 12) remark: LOOP WAS VECTORIZED
alg.F(21): (col. 12) remark: LOOP WAS VECTORIZED
alg.F(22): (col. 7) remark: LOOP WAS VECTORIZED
alg.F(22): (col. 7) remark: loop was not vectorized: not inner loop
alg.F(35): (col. 7) remark: LOOP WAS VECTORIZED
alg.F(34): (col. 7) remark: loop was not vectorized: not inner loop
alg.F(43): (col. 16) remark: LOOP WAS VECTORIZED
alg.F(20): (col. 12) remark: LOOP WAS VECTORIZED
alg.F(21): (col. 12) remark: LOOP WAS VECTORIZED
alg.F(22): (col. 7) remark: loop was not vectorized: vectorization possible but seems inefficient
alg.F(35): (col. 7) remark: LOOP WAS VECTORIZED
alg.F(43): (col. 16) remark: LOOP WAS VECTORIZED
GAP REPORT LOG OPENED ON Thu Apr 24 13:04:28 2014

alg.F(20): remark #30525: (PAR) Insert a "!dir$ loop count min(512)" statement right before the loop at line 20 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 512 iterations.
alg.F(21): remark #30525: (PAR) Insert a "!dir$ loop count min(512)" statement right before the loop at line 21 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 512 iterations.
alg.F(22): remark #30525: (PAR) Insert a "!dir$ loop count min(512)" statement right before the loop at line 22 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 512 iterations.
Number of advice-messages emitted for this compilation session: 3.
END OF GAP REPORT LOG

The advice may include suggestions for source code modifications, applying specific pragmas, or add compiler options. In all cases, applying a particular suggestion requires the user to verify that it is safe to apply that particular suggestion.

The compiler does not produce any objects or executables when the -guide option is specified.

Useful Tools from the Cray Compiler

The Cray compiler listing option -rm provides a "loopmark" source code listing with optimizations including in the actual source listing.  It writes this to a file source.lst where source.ext is the source file, e.g. source.f, source.F, source.f90, etc.

The -hnopattern option in this example is included to turn off Cray pattern matching optimization which replace source code lines with calls to math libraries, in this case to the DGEMM BLAS routine.

% ftn -hnopattern -hvector3 -rm matmat.F

produces a matmat.lst file which includes a source file listing with all loop optimizations indicated.

   57.  + 1------------<       do it=1,itmax
   58.  + 1 br4--------<          do j=1,n
   59.  + 1 br4 b------<             do k=1,n
   60.    1 br4 b Vr2--<                do i=1,nr
   61.    1 br4 b Vr2                      c(i,j) = c(i,j) + a(i,k) * b(k,j)
   62.    1 br4 b Vr2-->                end do
   63.    1 br4 b------>             end do
   64.    1 br4-------->          end do
   65.    1------------>       end do


Elsewhere in the .lst file these marks are defined and the specific optimizations for each line described:

b - blocked
r - unrolled
V - Vectorized
....
ftn-6254 ftn: VECTOR File = matmat.F, Line = 57
A loop starting at line 57 was not vectorized because a recurrence was found on "c" at line 61.

ftn-6294 ftn: VECTOR File = matmat.F, Line = 58
A loop starting at line 58 was not vectorized because a better candidate was found at line 60.

ftn-6049 ftn: SCALAR File = matmat.F, Line = 58
A loop starting at line 58 was blocked with block size 8.

ftn-6005 ftn: SCALAR File = matmat.F, Line = 58
A loop starting at line 58 was unrolled 4 times.

ftn-6254 ftn: VECTOR File = matmat.F, Line = 59
A loop starting at line 59 was not vectorized because a recurrence was found on "c" at line 61.

ftn-6049 ftn: SCALAR File = matmat.F, Line = 59
A loop starting at line 59 was blocked with block size 16.

ftn-6005 ftn: SCALAR File = matmat.F, Line = 60
A loop starting at line 60 was unrolled 2 times.

ftn-6204 ftn: VECTOR File = matmat.F, Line = 60
A loop starting at line 60 was vectorized.

Compiler directives and pragmas

The compiler will not vectoriaze a loop if it perceives that there can be data dependency. However, when a user knows clearly that there is no data dependency, that information can be fed to the compiler to make the loop vectorized. This information is specified in source codes using compiler directives in Fortran or pragmas in C/C++. They are placed just before the start of the loop of interest. Other directives/pragmas are to command the compiler to perform a certain task such as to allocate memory for an array in a ceratin fashion.

Some of the commonly used ones in the Intel compiler are shown below. Please see the Intel documentation for a complete list.

'!dir$ ivdep' or '#pragma ivdep'

This tells the compiler to ignore vector dependencies in the loop that immediately follows the directive/pragma. However, this is just a recommendataion, and the compiler will not vectorize the loop if there is a clear dependency.

The Cray compiler supports the same directive.

'!dir$ vector' or '#pragma vector'

This overrides default heuristics for vectorization of the loop. A user can provide a clause for a specific task. For example, '!dir$ vector always' (#pragma vector always) is to try to vectorize the immediately following loop that the compiler normally would not do so because of a performance efficiency reason. As another example, '!dir$ vector aligned' (#pragma vector aligned) is to inform that all data in the loop is aligned at a certain byte boundary so that aligned load or store SSE or AVX instructions can be used (they are more efficient and less expensive instructions than unaligned load or store instructions).

This directive may be ignored by the compiler when it thinks that there is a data dependency in the loop.

This directive is interpreted differently by the Cray Fortran and C compilers.  It turns on vectorization for all succeeding loop nests until a novector compiler directive is encountered or the end of the programming unit is reached.

'!dir$ novector' or '#pragma novector'

This tells the compiler to disable vectorizaton for the loop that follows.

The Cray C compiler interprets it solely for the following loop in the same way as the Intel compiler, but the Cray Fortran compiler turns off vectorization for all succeeding loop nests until a vector compiler directive is encountered or the end of the programming unit is reached.

'!dir$ simd' or '#pragma simd'

This is used to enforce vectorization for a loop that the compiler doesn't auto-vectorize even with the use of vectorization hints such as '!dir$ vector always' ('#pragma vector always') or '!dir$ ivdep' ('#pragam ivdep'). Because of this nature of enforcement, it is called user-mandated vectorization. A clause can be accompanied to give a more specific direction.

The loop in the following surboutine will not be vectorized since the compiler cannot know what the value of 'k' might be.

      subroutine sub(a,b,n,k)
      integer n, k
      real a(n), b(n)
      integer i
      do i=k+1,n
         a(i) = 0.5 * (a(i-k) + b(i-k))
      end do
      end
% ftn -c -vec-report=3 sub.F
sub.F(5): (col. 7) remark: LOOP WAS VECTORIZED
sub.F(5): (col. 7) remark: loop was not vectorized: existence of vector dependence
sub.F(6): (col. 9) remark: vector dependence: assumed ANTI dependence between a line 6 and a line 6
sub.F(6): (col. 9) remark: vector dependence: assumed FLOW dependence between a line 6 and a line 6
sub.F(6): (col. 9) remark: vector dependence: assumed FLOW dependence between a line 6 and a line 6
sub.F(6): (col. 9) remark: vector dependence: assumed ANTI dependence between a line 6 and a line 6
sub.F(5): (col. 7) remark: loop skipped: multiversioned

Depending on the actual value, there can be a data dependency. However, if a user who knows for sure that the value of k will be always larger than 8, then the user can use '!dir$ simd vectorlength(8)' (or simply '!dir$ simd') to vectorize the loop.

!dir$ simd vectorlength(8)
      do i=k+1,n
         a(i) = 0.5 * (a(i-k) + b(i-k))
      end do
% ftn -c -vec-report=3 sub.F
sub.F(6): (col. 7) remark: SIMD LOOP WAS VECTORIZED

Vectorization guidelines

Memory alignment

Data movement (i.e., loading into a vector register and storing from a register) instructions of SSE operate more efficiently on data objects when they are aligned at 16 byte boundaries in memory (that is, the start memory address of the data objects is a multiple of 16 bytes). Data objects for AVX instructions are to be aligned at 32 bytes boundaries for good performance.

One can enforce memory alignment using the clause '__attribute__((aligned(...)))' in the declaration statement and use '#pragma vector aligned' in relevant for-loops. In Fortran you can use the 'attributes align' directive in the declaration statement and 'vector aligned' directive for loops. Example codes on this issue can be found in the Intel compiler package: in the ${INTEL_PATH}/Samples/en_US/{Fortran,C++}/vec_samples directory (the environment variable INTEL_PATH is defined whenever an 'intel' module is loaded into your programming environment).

% cd ${INTEL_PATH}/Samples/en_US/C++/vec_samples
% cat Driver.c
...
#ifdef ALIGNED
...
        FTYPE a[ROW][COLWIDTH]  __attribute__((aligned(16)));
        FTYPE b[ROW]                    __attribute__((aligned(16)));
        FTYPE x[COLWIDTH]               __attribute__((aligned(16)));
...
#else
...
#endif
...

% cat Multiply.c
...
#ifdef ALIGNED
// The pragma vector aligned below tells the compiler to assume that the data in
// the loop is aligned on 16-byte boundary so the vectorizer can use
// aligned instructions to generate faster code.
#pragma vector aligned
#endif

                for (j = 0;j < size2; j++) {
                        b[i] += a[i][j] * x[j];
                }
...

The following test code examines the effect of memory alignment in a simple-minded matrix-matrix multiplication case. Here, we pad the matrices with extra rows to make them aligned at certain boundaries. We examine memory alignment at 16, 32, and also 64 byte boundaries. The 64-byte case is also considered here since Edison and Hopper's cache line size of 64 bytes has a performance implication (in terms of better cache utilization).

      program matmat

!...  Prepared by NERSC User Services Group
!...  April, 2014

      implicit none
      integer :: n = 31
      integer :: itmax = 200000
#ifdef REAL4
      real, allocatable :: a(:,:), b(:,:), c(:,:)
#else
      real*8, allocatable :: a(:,:), b(:,:), c(:,:)
#endif
#ifdef ALIGN16
!dir$ attributes align : 16 :: a,b,c
#elif defined(ALIGN32)
!dir$ attributes align : 32 :: a,b,c
#elif defined(ALIGN64)
!dir$ attributes align : 64 :: a,b,c
#endif
      integer i, j, k, it
      integer :: vl, nr
      integer*8 c1, c2, cr, cm
      real*8 dt

!...  Vector length

#ifdef ALIGN16
      vl = 16 / (storage_size(a) / 8)
#elif defined(ALIGN32)
      vl = 32 / (storage_size(a) / 8)
#elif defined(ALIGN64)
      vl = 64 / (storage_size(a) / 8)
#else
      vl = 1
#endif

      nr = ((n + (vl - 1)) / vl) * vl      ! padded row dimension
      allocate (a(nr,n), b(nr,n), c(nr,n))

!...  Initialization

      do j=1,n 
#if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64)
!dir$ vector aligned
#endif
         do i=1,nr
            a(i,j) = cos(i * 0.1 + j * 0.2)
            b(i,j) = sin(i * 0.1 + j * 0.2)
            c(i,j) = 0.
         end do
      end do

!...  Main loop

      call system_clock(c1, cr, cm)
      do it=1,itmax
         do j=1,n
            do k=1,n
#if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64)
!dir$ vector aligned
#endif
               do i=1,nr
                  c(i,j) = c(i,j) + a(i,k) * b(k,j)
               end do
            end do
         end do
      end do
      call system_clock(c2, cr, cm)

      print *, c(1,1)+c(n,n), dble(c2-c1)/dble(cr)
      deallocate(a, b, c)
      end

The speedup factors observed on Edison are plotted for small array dimensions using single ('R*4') precision.

Note that the i index value goes up to nr, the padded dimension for the matrix row. Performance appears better by going up to nr instead of n. The data from rows from n+1 to nr should be simply ignored.

AoS (Array of Structures) vs. SoA (Structure of Arrays)

A data object can become complex with multiple component elements or attributes. Programmers often represent a group of such data objects using an array of Fortran's derived data type or C's struct objects (i.e., an array of structures or AoS). Although an AoS provides a natural way to represent such data, memory reference of any component requires non-unit stride access. Such a situation is illustrated in the following example code. When the main loop is transformed into a vector loop, three components of a 'coords' object will be stored into three separate vector registers, one for each component. With the AoS data layout, loading into such a register will require stride 3 (or more) access, reducing efficiency of the vector load.

A better data structure for vectorization is to separate each component of the objects into its own array, and then form a data object composed of three arrays (i.e., a structure of arrays or SoA). When the main loop is vectorized, each component will be loaded into a separate register but this will be done with unit-stride access. Therefore, vectorization will be more efficient.

      program aossoa

!...  Prepared by NERSC User Services Group
!...  April, 2014

      implicit none
      integer :: n = 1000
      integer :: itmax = 10000000
#ifdef SOA
      type coords
         real, pointer :: x(:), y(:), z(:)
      end type
      type (coords) :: p
#else
      type coords
         real :: x, y, z
      end type
      type (coords), allocatable :: p(:)
#endif
      real, allocatable :: dsquared(:)
      integer i, it
      integer*8 c1, c2, cr, cm
      real*8 dt

!...  Initialization

#ifdef SOA
      allocate(p%x(n), p%y(n), p%z(n), dsquared(n))
      do i=1,n
         p%x(i) = cos(i + 0.1)
         p%y(i) = cos(i + 0.2)
         p%z(i) = cos(i + 0.3)
      end do
#else
      allocate(p(n), dsquared(n))
      do i=1,n
         p(i)%x = cos(i + 0.1)
         p(i)%y = cos(i + 0.2)
         p(i)%z = cos(i + 0.3)
      end do
#endif

!...  Main loop

      call system_clock(c1, cr, cm)
      do it=1,itmax
#ifdef SOA
         do i=1,n
            dsquared(i) = p%x(i)**2 + p%y(i)**2 + p%z(i)**2
         end do
#else
         do i=1,n
            dsquared(i) = p(i)%x**2 + p(i)%y**2 + p(i)%z**2
         end do
#endif
      end do
      call system_clock(c2, cr, cm)

      dt = dble(c2-c1)/dble(cr)
      print *, dsquared(1)+dsquared(n/2)+dsquared(n), dt
#ifdef SOA
      deallocate(p%x, p%y, p%z, dsquared)
#else
      deallocate(p, dsquared)
#endif
      end

The following performance comparison observed on Edison shows that the SoA data layout gives better performance.

Elemental function

Elemental functions are functions that can be also invoked with an array actual argument and return array results of the same shape as the argument array. This convenient feature is quite common in Fortran as it is widely used in many intrinsic functions.

The Intel compiler allows a user to define a function as elemental for a vectorization purpose. The compiler generates a vector version of the function as well as a scalar version.

Although a function call inside a loop generally inhibits vectorization, the loop may be executed in vector mode if an elemental function is involved. Below is a simple example code showing how to declare an elemental function for the Intel compiler.

!...  Prepared by NERSC User Services Group
!...  April, 2014

      module fofx
        implicit none
      contains
        function f(x)
#ifdef ELEMENTAL
!dir$ attributes vector :: f
#endif
#ifdef REAL4
          real f, x
#else
          real*8 f, x
#endif
          f = cos(x * x + 1.) / (x * x + 1.)
        end function f
      end module fofx

      program elemental
        use fofx
        implicit none
        integer :: n = 1024
        integer :: itmax = 1000000
#ifdef REAL4
        real, allocatable :: a(:), x(:)
#else
        real*8, allocatable :: a(:), x(:)
#endif
        integer i, it
        integer*8 c1, c2, cr, cm
        real*8 dt

        allocate (a(n), x(n))

!...    Initialization

        do i=1,n
          x(i) = cos(i * 0.1)
        end do

!...    Main loop

        call system_clock(c1, cr, cm)
        do it=1,itmax
          do i=1,n
            a(i) = f(x(i))
          end do
          x(n) = cos(x(n))
        end do
        call system_clock(c2, cr, cm)

        dt = dble(c2-c1)/dble(cr)
        print *, a(1)+a(n/2)+a(n), dt

        deallocate(a, x)
      end

Again, the following performance was obtained on Edison.

 

'restrict' keyword

The Intel compiler's 'restrict' keyword may be used for a pointer argument in a C/C++ function to indicate that the actual pointer argument provides exclusive access to the memory referenced in the function and no other pointer can access it. This is to clarify about the ambiguity whether two pointers in a loop may reference a common memory region which may create a vector dependency. An example code is found in the ${INTEL_PATH} directory again.

% cat ${INTEL_PATH}/Samples/en_US/C++/vec_samples/Multiply.h
...
#ifndef FTYPE
#define FTYPE double
#endif
...

% cat ${INTEL_PATH}/Samples/en_US/C++/vec_samples/Multiply.c
...
// The "restrict" qualifier tells the compiler that pointer b does
// not alias with pointers a and x meaning that pointer b does not point
// to the same memory location as a or x and there is no overlap between
// the corresponding arrays.
...
#ifdef NOALIAS
void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[restrict], FTYPE x[])
#else
void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[], FTYPE x[])
#endif
...
                for (j = 0;j < size2; j++) {
                        b[i] += a[i][j] * x[j];
                }
...

As the comment in the code explains, the 'restrict' keyword declares that the memory referenced by 'b' doesn't overlap the region that the pointers a and x reference (see the 'NOALIAS' portion). Therefore, there is no vector dependency in the loop. Copy the files to your working directory and try to compile it as follows:

% cp ${INTEL_PATH}/Samples/en_US/C++/vec_samples/Multiply.{c,h} .
% cc -c -vec-report=3 -restrict -DNOALIAS -I. Multiply.c
Multiply.c(55): (col. 3) remark: LOOP WAS VECTORIZED
...

Vectorization with respect to arithmetic operations

In this section, we present floating-point and integer arithmetic operation rates measured with respect to vectorization compile flags and operand data types on compute nodes on Edison and Hopper as well as Babbage's KNC (Intel Knights Corner). This work was inspired by a study for Sandy Bridge and Westmere CPUs. By comparing the performance with and witout vectorization-enabling compile flags, we can see an expected vectorization speedup associated with a particular data type and operation type.

A Fortran code is built with the Intel compiler, and a single task is run on a node. Elapsed times are measured for up to 20 times an arithmetic operation is applied, and the per-operation cost is estimated after least-squares fitting of the elapsed times. Below we show plots of arithmetic operation rates for different data types and operation types on different NERSC machines. The resuls are in billion floating point operations per second (Gflops) in case of floating point operands, or billion integer operations per second in case of integer operands. 'R*4' is for 4 byte reals, 'R*8' is for 8 byte reals, 'I*4' is for 4 byte integers, 'I*8' is for 8 byte integers, 'C*8' is for 8 byte complex variables (4 bytes for real and imaginary parts), and 'C*16' is for 16 byte complex variables (8 bytes for real and imaginary parts).

Edison's Ivy Bridge

Edison compute nodes (Ivy Bridge) support both SSE and AVX instructions. By default, the compiler uses AVX instructions, but we can also select a type with the Intel compile flags '-xSSE4.2' and '-xAVX'. Vector widths with SSE and AVX instructions are 128 bits (16 bytes) and 256 bits (32 bytes), respectively. The Intel compiler version used is 14.0.2.144.

For some data and operation types, we see the expected vector speedup with respect to the used vectorization type (4 times with SSE and 8 times with AVX in case of 4-byte operands; 2 times with SSE and 4 times with AVX in case of 8-byte operands). But others (such as SQRT() operations with AVX or integer operations with AVX) don't show the expected vectorization speedup. For some data and operation types such as all integer operations, no additional speedup with AVX over SSE is observed. Some operations (for example, EXP()) show worse performance with vectorization.

Hopper's Magny-Cours

Hopper's compute nodes (Magny-Cours) sppport SSE2 instructions. Vector registers are 128-bit (16-byte) wide. The Intel compiler version used is 14.0.0.080.

Babbage's KNC

NERSC Intel Xeon Phi testbed cluster called babbage has Intel Knight Corner (KNC) cards. KNC vector registers are 512-bit (64-byte) wide. The Intel compiler version used is 14.0.0.080.