Optimizing for Intel Xeon Phi Knights Landing

Steven Warren

swarren@cray.com
Legal Disclaimer

Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publicly announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: CHAPEL, CLUSTER CONNECT, CLUSTERSTOR, CRAYDOC, CRAYPAT, CRAYPORT, DATAWARP, ECOPLEX, LIBSCI, NODEKARE, REVEAL. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used on this website are the property of their respective owners.
Outline

• Make KNL run my code faster!
  • Vectorization
  • Cache blocking

• --exclusive
Optimizing for Intel Xeon Phi Knights Landing

- Many KNL-specific optimizations involve MCDRAM in “Flat” mode
  - Since Cori uses “cache” mode, these optimizations generally do not apply
  - If application can strongly scale efficiently, can use enough nodes such that the memory footprint/node is less than 16 GB and fit into MCDRAM

- KNL is an x86 processor, thus many of the things you would do for any x86 processor will apply
  - i.e., work done to improve KNL performance will generally improve performance on other modern processors as well
KNL strengths and weaknesses

• Strengths
  • MCDRAM memory bandwidth
    • Effectively a large L3 in “cache” mode (no dedicated L3 on KNL)
  • Larger L2 per core (1 MB / 2-core tile)
  • AVX512 vectors
    • Allows more operations per cycle than previous generations of processors

• Weaknesses
  • Clock GHz
    • Affects scalar operations

• Optimization strategy
  • Vectorize and/or cache block important kernels
But first, a note about affinity…
Process / Thread / Memory Affinity

• Correct process, thread and memory affinity is the basis for getting optimal performance on KNL. It is also essential for guiding further performance optimizations.
  — Process Affinity: bind MPI tasks to CPUs
  — Thread Affinity: bind threads to CPUs allocated to its MPI process
  — Memory Affinity: allocate memory from specific NUMA domains

• Our goal is to promote OpenMP standard settings for portability. For example, OMP_PROC_BIND and OMP_PLACES are preferred to Intel specific KMP_AFFINITY and KMP_PLACE_THREADS settings.

The following NERSC slides stolen from Helen. Thanks Helen!
xthi.c

• XTHI is a **very** useful application that will tell you whether or not you are getting the expected placement behavior.
  
  • [https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c](https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c)

• Different compilers and MPI stacks have different affinity rules
  
  • i.e., what works for Intel likely will not work for Cray or GNU

• Replace the call to your application binary to the xthi binary in your srun line to check affinity.
  
  • Can do this at any scale, but it’s best to change the number of PEs to use a single node to avoid confusion of the output.
“numactl -H” displays NUMA info

68-core Quad Cache node:
NUMA Domain 0: all 68 cores (272 logic cores)

```
yunhe@cori01:~$ salloc -N 1 --qos=interactive -C knl,quad,cache -t 30:00
salloc: Granted job allocation 5291739

yunhe@nid02305:~$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 96762 MB
node 0 free: 93067 MB
node distances:
  node 0
    0: 10
```

- The quad,cache mode has only 1 NUMA node with all CPUs on the NUMA node 0 (DDR memory)
- The MCDRAM is hidden from the numactl -H command (it is a cache).
Can We Just Do a Naïve Srun?

Example: 16 MPI tasks x 8 OpenMP threads per task on a single 68-core KNL quad, cache node:

% export OMP_NUM_THREADS=8
% export OMP_PROC_BIND=spread
% export OMP_PLACES=threads

(ways to specify explicit lists, etc.)

(other choice are “close”, “master”, “true”, “false”)

% srun -n 16 ./xthi | sort -k4n,6n
Hello from rank 0, thread 0, on nid02304. (core affinity = 0)  (on physical core 8)
Hello from rank 0, thread 1, on nid02304. (core affinity = 144)
Hello from rank 0, thread 2, on nid02304. (core affinity = 17)
Hello from rank 0, thread 3, on nid02304. (core affinity = 161)  (on physical core 25)
Hello from rank 0, thread 4, on nid02304. (core affinity = 34)
Hello from rank 0, thread 5, on nid02304. (core affinity = 178)  (on physical core 42)
Hello from rank 0, thread 6, on nid02304. (core affinity = 51)
Hello from rank 0, thread 7, on nid02304. (core affinity = 195)  (on physical core 59)
Hello from rank 1, thread 0, on nid02304. (core affinity = 0)
Hello from rank 1, thread 1, on nid02304. (core affinity = 144)

It is a mess!
Importance of -c and --cpu_bind Options

• The reason: 68 is not divisible by #MPI tasks!
  — Each MPI task is getting 68x4/#MPI tasks of logical cores as the domain size
  — MPI tasks are crossing tile boundaries

• Set number of logical cores per MPI task (-c) manually by wasting extra 4 cores on purpose: 256/#MPI_tasks_per_node.
  — Meaning to use 64 cores only on the 68-core KNL node, and spread the logical cores allocated to each MPI task evenly among these 64 cores.
  — Now it looks good!
  — % srun -n 16 -c 16 --cpu_bind=cores ./xthi
    Hello from rank 0, thread 0, on nid09244. (core affinity = 0)
    Hello from rank 0, thread 1, on nid09244. (core affinity = 136)
    Hello from rank 0, thread 2, on nid09244. (core affinity = 1)
    Hello from rank 0, thread 3, on nid09244. (core affinity = 137)
Now It Looks Good!

Process/thread affinity are good! (Marked first 6 and last MPI tasks only)

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
<th>12</th>
<th>13</th>
<th>14</th>
<th>15</th>
<th>16</th>
<th>17</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>68</td>
<td>69</td>
<td>70</td>
<td>71</td>
<td>72</td>
<td>73</td>
<td>74</td>
<td>75</td>
<td>76</td>
<td>77</td>
<td>78</td>
<td>79</td>
<td>80</td>
<td>81</td>
<td>82</td>
<td>83</td>
<td>84</td>
</tr>
<tr>
<td>1</td>
<td>138</td>
<td>137</td>
<td>136</td>
<td>139</td>
<td>140</td>
<td>141</td>
<td>142</td>
<td>143</td>
<td>144</td>
<td>145</td>
<td>146</td>
<td>147</td>
<td>148</td>
<td>149</td>
<td>150</td>
<td>151</td>
<td>152</td>
</tr>
<tr>
<td>2</td>
<td>204</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>204</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>220</td>
<td>221</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>220</td>
<td>221</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>18</td>
<td>19</td>
<td>20</td>
<td>21</td>
<td>22</td>
<td>23</td>
<td>24</td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
<td>30</td>
<td>31</td>
<td>32</td>
<td>33</td>
<td>34</td>
</tr>
<tr>
<td>5</td>
<td>86</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>86</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>154</td>
<td>155</td>
<td>156</td>
<td>157</td>
<td>158</td>
<td>159</td>
<td>160</td>
<td>161</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>222</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>36</td>
<td>37</td>
<td>38</td>
<td>39</td>
<td>40</td>
<td>41</td>
<td>42</td>
<td>43</td>
<td>44</td>
<td>45</td>
<td>46</td>
<td>47</td>
<td>48</td>
<td>49</td>
<td>50</td>
<td>51</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>104</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>172</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>240</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12</td>
<td>52</td>
<td>53</td>
<td>54</td>
<td>55</td>
<td>56</td>
<td>57</td>
<td>58</td>
<td>59</td>
<td>60</td>
<td>61</td>
<td>62</td>
<td>63</td>
<td>64</td>
<td>65</td>
<td>66</td>
<td>67</td>
<td>64</td>
</tr>
<tr>
<td>13</td>
<td>120</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>14</td>
<td>188</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td>256</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

And so on for other MPI tasks and threads ....
Essential Runtime Settings for Process/Thread Affinity

• Use `srun -c` and `--cpu_bind` flags to bind tasks to CPUs
  — `-c <n>` (or `--cpus-per-task=n`) allocates (reserves) `n` CPUs per task (process). It helps to evenly spread MPI tasks, can use up to `n` OpenMP threads per MPI task.
  — Use `--cpu_bind=cores` (no hyperthreads) or `--cpu_bind=threads` (if hyperthreads are used)

• Use OpenMP envs: `OMP_PROC_BIND`, `OMP_PLACES` to fine pin each thread to a subset of CPUs allocated to the host task
  — Different compilers may have different implementations
  — The following provide compatible thread affinity among Intel, GNU and Cray compilers:
    
    `OMP_PROC_BIND=true` # Specifying threads may not be moved between CPUs
    `OMP_PLACES=threads` # Specifying a thread should be placed on a single CPU

• Verify with XTHI before running your code!
But second… Dynamic vs Static linking on KNL
Dynamic vs Static Linking for Qbox on KNL

• Lines 248 – 249 in qb.C require glibc, which is a collection of dynamic libraries in many current operating systems

  • if ( getlogin() != 0 )
    cout << "<user> " << getlogin() << " </user>" << endl;

• Performance can be greatly increased on KNL for statically linked executables.
Options to link statically

• Can statically link in Cray Libsci libraries (executable remains dynamic) to alleviate some of the performance loss by setting:

  • \texttt{LIBS = -Wl,-Bstatic -lsci\_cray\_mpi\_mp -lsci\_cray\_mp \\ -lfftw3f\_mpi -lfftw3f\_omp -lfftw3f -lfftw3\_mpi \\ -lfftw3\_omp -lfftw3 -Wl,-Bdynamic}

• Or compile fully static but add extra compile flags to \texttt{qb.C}:

  • \	exttt{'-Dmain=stealthy()\{return 0;\} char* stealth()\{return getenv("USER");\} int main' -Dgetlogin=stealth}

• Or one could simply modify the code in \texttt{qb.C} to use \texttt{getenv()} instead of \texttt{getlogin()} and compile fully static.
Dynamic vs Static Linking for Qbox on KNL

- For a 256 node, 880 atom Qbox run using 32 MPI ranks/node and 2 OpenMP threads/rank with nrowmax set to 256 yields the following results:

<table>
<thead>
<tr>
<th>Link type</th>
<th>Dynamic Linking</th>
<th>Static Linking</th>
<th>Dynamic Linking with Statically Linked Cray Libsci libraries</th>
</tr>
</thead>
<tbody>
<tr>
<td>max time (run time)</td>
<td>330 s</td>
<td>198 s</td>
<td>215 s</td>
</tr>
</tbody>
</table>
Example Analysis and Optimizations:

Vectorization
What is vectorization?

• Vectorization is the practice of converting an algorithm to work on a set of values simultaneously instead of a single value one-by-one.

What prevents vectorization?

• Complexity in loops which the compiler can not interpret
  • Indirect memory accesses
  • Logical statements
  • Recurrences on variables
**How To Know If Your Loops Are Vectorizing**

- CCE can provide “listing” files with compilation which will give an easily interpreted and detailed description of every line in your source
  - `hlist=a`

- Intel and GNU compiler provide similar capabilities.

- Use the listing file to determine if your changes allow the compiler to apply better optimizations
  - You do NOT need to execute the code to check if the compiler applies optimizations

<table>
<thead>
<tr>
<th>%%%</th>
<th>Loop mark</th>
<th>Legend</th>
<th>%%%</th>
</tr>
</thead>
<tbody>
<tr>
<td>Primary Loop Type</td>
<td>Modifiers</td>
<td></td>
<td></td>
</tr>
<tr>
<td>A - Pattern matched</td>
<td>a - atomic memory operation</td>
<td></td>
<td></td>
</tr>
<tr>
<td>b - blocked</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>C - Collapsed</td>
<td>c - conditional and/or computed</td>
<td></td>
<td></td>
</tr>
<tr>
<td>D - Deleted</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>E - Cloned</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F - Flat - No calls</td>
<td>f - fused</td>
<td></td>
<td></td>
</tr>
<tr>
<td>G - Accelerated</td>
<td>g - partitioned</td>
<td></td>
<td></td>
</tr>
<tr>
<td>I - Inlined</td>
<td>i - interchanged</td>
<td></td>
<td></td>
</tr>
<tr>
<td>M - Multithreaded</td>
<td>m - partitioned</td>
<td></td>
<td></td>
</tr>
<tr>
<td>n - non-blocking remote transfer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>p - partial</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>R - Rerolling</td>
<td>r - unrolled</td>
<td></td>
<td></td>
</tr>
<tr>
<td>s - shortloop</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>V - Vectorized</td>
<td>w - unwound</td>
<td></td>
<td></td>
</tr>
<tr>
<td>+ - More messages listed at end of listing</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

© 2019 Cray Inc.
Example Loop

67. 1 2  PF = 0.0
68. + 1 2 3--< DO 44030 I = 2, N
69. 1 2 3  AV = B(I) * RV
70. 1 2 3  PB = PF
71. 1 2 3  PF = C(I)
72. 1 2 3  IF ((D(I) + D(I+1)) .LT. 0.) PF = -C(I+1)
73. 1 2 3  AA = E(I) - E(I-1) + F(I) - F(I-1)
74. 1 2 3  1  + G(I) + G(I-1) - H(I) - H(I-1)
75. 1 2 3  BB = R(I) + S(I-1) + T(I) + T(I-1)
76. 1 2 3  1  - U(I) - U(I-1) + V(I) + V(I-1)
77. 1 2 3  2  - W(I) + W(I-1) - X(I) + X(I-1)
78. 1 2 3  A(I) = AV * (AA + BB + PF - PB + Y(I) - Z(I)) + A(I)
79. 1 2 3--> 44030 CONTINUE

ftn-6254 ftn: VECTOR LP44030, File = lp44030.f, Line = 68
A loop starting at line 68 was not vectorized because a recurrence was found on "pf"
at line 71.

• There is a recurrence on the scalar ‘PF’
• Use the ‘explain’ tool to learn more about what a recurrence is
  > explain ftn-6254
Example

Baseline Performance
Single Core

Mflops

knl  hsw  bdw1

baseline

© 2019 Cray Inc.
What’s Preventing Vectorization?

• Let’s do a vector dependency analysis assuming VL=2

```
68. + 1 2 3--< DO 44030 I = 2, N
...
70. 1 2 3 PB = PF
71. 1 2 3 PF = C(I)
72. 1 2 3 IF ((D(I) + D(I+1)) .LT. 0.) PF = -C(I+1)
...
78. 1 2 3 A(I) = AV * (AA + BB + PF - PB + Y(I) - Z(I)) + A(I)
```

\[
PB \begin{pmatrix} 2 \\ 3 \end{pmatrix} \propto PF \begin{pmatrix} 1 \\ 2 \end{pmatrix} \quad PF \begin{pmatrix} 2 \\ 3 \end{pmatrix} \propto C \begin{pmatrix} 2 \\ 3 \\ 4 \end{pmatrix} \quad A \begin{pmatrix} 2 \\ 3 \end{pmatrix} \propto PB, PF \rightarrow PF \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}
\]

• Vectorization may be possible with modification, but loop is not concurrent safe
What can we do to vectorize this loop?

- Convert PF from a scalar to a vector (1-D array)
  - Warning! Be cognizant of how changing this variable may affect other regions of the code
    - Is PF a global or local variable? Is the final result of PF used elsewhere?
      - May need to use a temporary variable array for the loop and store back into PF if needed
  - Eliminates the need for the PB scalar variable in the loop
Optimization changes

• What optimizations did the compiler apply to our new version?

66.  1 2                      VPF(1) = 0.0
67.  1 2 Vr2--<               DO 44031 I = 2, N
68.  1 2 Vr2                  AV = B(I) * RV
69.  1 2 Vr2                  VPF(I) = C(I)
70.  1 2 Vr2                  IF ((D(I) + D(I+1)) .LT. 0.) VPF(I) = -C(I+1)
71.  1 2 Vr2                  AA = E(I) - E(I-1) + F(I) - F(I-1)
72.  1 2 Vr2                  1   + G(I) + G(I-1) - H(I) - H(I-1)
73.  1 2 Vr2                  BB = R(I) + S(I-1) + T(I) + T(I-1)
74.  1 2 Vr2                  1   - U(I) - U(I-1) + V(I) + V(I-1)
75.  1 2 Vr2                  2   - W(I) + W(I-1) - X(I) + X(I-1)
76.  1 2 Vr2                  A(I) = AV * (AA + BB + VPF(I) - VPF(I-1) + Y(I) - Z(I)) +
A(I)
77.  1 2 Vr2---> 44031 CONTINUE

ftn-6005  ftn: SCALAR LP44030, File = lp44030.f, Line = 67
A loop starting at line 67 was unrolled 2 times.

ftn-6204  ftn: VECTOR LP44030, File = lp44030.f, Line = 67
A loop starting at line 67 was vectorized.

• How does the performance of this version compare with the original?
Original vs Vectorized performance

Performance Single Core

Mflops

- knl
- hsw
- bdw1

baseline
Example
Analysis and Optimizations:
Cache blocking
Data Reuse will be important

• Data reuse will be critical to performance

• Reuse out of MCDRAM will reduce requirements on main memory

• Reuse out of lower levels of cache will lower requirements on MCDRAM

• In order to know how to cache block properly we need to know the trip counts of loops and the sizes of various arrays as accurately as possible
A SIMPLE EXAMPLE

- 2D 5-point Laplacian
  
  \[
  \text{do } j = 1, 8 \\
  \text{do } i = 1, 16 \\
  d(i,j) = u(i-1,j) + u(i+1,j) \& \\
  \quad - 4*u(i,j) \& \\
  \quad + u(i,j-1) + u(i,j+1) \\
  \text{end do} \\
  \text{end do}
  \]

- Simple cache structure for this example:
  - Assume each cache line holds 4 array elements
  - And cache can hold 12 lines of u data
  - No cache reuse between outer loop iterations
BLOCKING = STRIPMINE + INTERCHANGE

**Stripmine**

\[
\text{do } j = 1, 8 \\
\text{do } i = 1, 16 \\
\text{d}(i,j) = \text{stencil} \\
\text{end do} \\
\text{end do}
\]

**Interchange**

\[
\text{do } j = 1, 8 \\
\text{do } \text{IBLOCK} = 1, 16, 4 \\
\text{do } i = \text{IBLOCK}, \text{IBLOCK}+3 \\
\text{d}(i,j) = \text{stencil} \\
\text{end do} \\
\text{end do} \\
\text{end do}
\]

**Blocked!**

\[
\text{do IBLOCK} = 1, 16, 4 \\
\text{do } j = 1, 8 \\
\text{do } i = \text{IBLOCK}, \text{IBLOCK}+3 \\
\text{d}(i,j) = \text{stencil} \\
\text{end do} \\
\text{end do} \\
\text{end do}
\]
• Block the inner loop

\[
\begin{align*}
\text{do } IBLOCK &= 1, 16, 4 \\
\text{do } j &= 1, 8 \\
\text{do } i &= IBLOCK, IBLOCK + 3 \\
\text{d}(i,j) &= u(i-1,j) + u(i+1,j) & \text{&} \\
&\quad - 4*u(i,j) & \text{&} \\
&\quad + u(i,j-1) + u(i,j+1) \\
\text{end do} \\
\text{end do} \\
\text{end do}
\end{align*}
\]

• Now we have reuse of the \(j+1\) data
EVEN BETTER!

- Iterate over $4 \times 4$ blocks for better spatial locality

```plaintext
do JBLOCK = 1, 8, 4
    do IBLOCK = 1, 16, 4
        do j = JBLOCK, JBLOCK + 3
            do i = IBLOCK, IBLOCK + 3
                d(i,j) = u(i-1,j) + u(i+1,j) &
                - 4*u(i,j) &
                + u(i,j-1) + u(i,j+1)
            end do
        end do
    end do
end do
```

- CCE has directives for this
  - `!dir$ blockable(i,j)`
  - `!dir$ blockingsize(4)`
Example Analysis and Optimization:
miniGhost
Example app: miniGhost

• “mini-app” from the NERSC8 procurement.

• 27-point 3-D stencil application

• Simulates diffusion

• Like most stencil codes, it is main memory bandwidth bound
  • Data reuse will lessen contention for memory accesses
Main compute loop

- Craypat suggests the following loop is about ~50% of the run time

```fortran
288. + b------------< DO K = 1, NZ
289. + b------------< DO J = 1, NY
290. b b Vb--------< DO I = 1, NX
291. b b Vb
292. b b Vb SLICE_BACK = GRID(I-1,J-1,K-1) + GRID(I-1,J,K-1) + GRID(I-1,J+1,K-1) + &
293. b b Vb GRID(I,J-1,K-1) + GRID(I,J,K-1) + GRID(I,J+1,K-1) + &
294. b b Vb GRID(I+1,J-1,K-1) + GRID(I+1,J,K-1) + GRID(I+1,J+1,K-1)
295. b b Vb
296. b b Vb SLICE_MINE = GRID(I-1,J-1,K) + GRID(I-1,J,K) + GRID(I-1,J+1,K) + &
297. b b Vb GRID(I,J-1,K) + GRID(I,J,K) + GRID(I,J+1,K) + &
298. b b Vb GRID(I+1,J-1,K) + GRID(I+1,J,K) + GRID(I+1,J+1,K)
299. b b Vb
300. b b Vb SLICE_FRONT = GRID(I-1,J-1,K+1) + GRID(I-1,J,K+1) + GRID(I-1,J+1,K+1) + &
301. b b Vb GRID(I,J-1,K+1) + GRID(I,J,K+1) + GRID(I,J+1,K+1) + &
302. b b Vb GRID(I+1,J-1,K+1) + GRID(I+1,J,K+1) + GRID(I+1,J+1,K+1)
303. b b Vb
304. b b Vb WORK(I,J,K) = ( SLICE_BACK + SLICE_MINE + SLICE_FRONT ) / 27.0
305. b b Vb
306. b b Vb--------> END DO
307. b b--------> END DO
308. b--------> END DO
```

- CCE does vectorize and also attempts to cache block the inner loop, but can we do better?
Listing file explanations

• CCE may attempt to cache block for L2 based upon the targeted architecture.
  • Generally, L1 is too small and L3 is too “slow”

```plaintext
ftn-6294 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 287
  A loop starting at line 287 was not vectorized because a better candidate was found at line 289.

ftn-6049 ftn: SCALAR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 287
  A loop starting at line 287 was blocked with block size 8.

ftn-6294 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 288
  A loop starting at line 288 was not vectorized because a better candidate was found at line 289.

ftn-6049 ftn: SCALAR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 288
  A loop starting at line 288 was blocked with block size 8.

ftn-6049 ftn: SCALAR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 289
  A loop starting at line 289 was blocked with block size 256.

ftn-6204 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 289
  A loop starting at line 289 was vectorized.
```
Blocking = Stripmine + Interchange

287.  + 1------------<  DO KK = 1, NZ, block_k
288.  + 1 2-----------<   DO JJ = 1, NY, block_j
289.  + 1 2 3--------<   DO II = 1, NX, block_i
290.  + 1 2 3 4------<   DO K = KK, KK+(block_k-1)
291.  + 1 2 3 4 5-----<   DO J = JJ, JJ+(block_j-1)
292. 1 2 3 4 5 V--<>   DO I = II, II+(block_i-1)
293. 1 2 3 4 5 V
294. 1 2 3 4 5 V
295. 1 2 3 4 5 V
296. 1 2 3 4 5 V
297. 1 2 3 4 5 V
298. 1 2 3 4 5 V
299. 1 2 3 4 5 V
300. 1 2 3 4 5 V
301. 1 2 3 4 5 V
302. 1 2 3 4 5 V
303. 1 2 3 4 5 V
304. 1 2 3 4 5 V
305. 1 2 3 4 5 V
306. 1 2 3 4 5 V
307. 1 2 3 4 5 V
308. 1 2 3 4 5 V-->
309. 1 2 3 4 5----->
310. 1 2 3 4------>
311. 1 2 3-------->
312. 1 2---------->
313. 1----------->

SLICE_BACK = GRID(I-1,J-1,K-1) + GRID(I-1,J,K-1) + GRID(I-1,J+1,K-1) + &
   GRID(I ,J-1,K-1) + GRID(I ,J,K-1) + GRID(I ,J+1,K-1) + &
   GRID(I+1,J-1,K-1) + GRID(I+1,J,K-1) + GRID(I+1,J+1,K-1)

SLICE_MINE = GRID(I-1,J-1,K) + GRID(I-1,J,K) + GRID(I-1,J+1,K) + &
   GRID(I ,J-1,K) + GRID(I ,J,K) + GRID(I ,J+1,K) + &
   GRID(I+1,J-1,K) + GRID(I+1,J,K) + GRID(I+1,J+1,K)

SLICE_FRONT = GRID(I-1,J-1,K+1) + GRID(I-1,J,K+1) + GRID(I-1,J+1,K+1) + &
   GRID(I ,J-1,K+1) + GRID(I ,J,K+1) + GRID(I ,J+1,K+1) + &
   GRID(I+1,J-1,K+1) + GRID(I+1,J,K+1) + GRID(I+1,J+1,K+1)

WORK(I,J,K) = ( SLICE_BACK + SLICE_MINE + SLICE_FRONT ) / 27.0

END DO
END DO
END DO
END DO
END DO
Listing file explanations

ftn-6306 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 287
   A loop starting at line 287 was not vectorized because the iteration space is too irregular.

ftn-6306 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 288
   A loop starting at line 288 was not vectorized because the iteration space is too irregular.

ftn-6303 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 289
   A loop starting at line 289 was not vectorized because an inter-loop dependence relation is too complicated.

ftn-6303 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 290
   A loop starting at line 290 was not vectorized because an inter-loop dependence relation is too complicated.

ftn-6303 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 291
   A loop starting at line 291 was not vectorized because an inter-loop dependence relation is too complicated.

ftn-6204 ftn: VECTOR MG_STENCIL_3D27PT, File = MG_STENCIL_COMPS.F, Line = 292
   A loop starting at line 292 was vectorized.
How to set the correct block sizes

• Typically, you want a larger amount of the inner iteration with smaller amounts in the other loops
  • Depends on the loop characterization and what data should be / could be / need to be reused
  • Powers of 2 generally are best if full index can not be held in cache

• Depending on the particular problem size, a proper cache blocking can provide a 50% speed-up for this particular loop on KNL
  • May see smaller impact on earlier Xeon processors since L2 misses are supported by an L3 cache.
Summary

• Code Characterization is an important first step in preparing for KNL
  • Target Science
  • Target Scaling
  • Hotspot identification

• Process affinity is critical for run performance

• Statically linked binaries likely to perform better than dynamically linked binaries.

• KNL node is different from XEON node
  • Single node optimizations will be an early focus
  • A properly designed kernel will help with optimization efforts
  • Vectorization is important and will become even more so with future processors

• Data reuse is important, but how important will depend on memory footprints and access patterns
THANK YOU

QUESTIONS?

swarren@cray.com

@cray_inc

linkedin.com/company/cray-inc-

cray.com