NERSCPowering Scientific Discovery Since 1974

TensorFlow at NERSC

We offer a number of different versions of Tensorflow at NERSC. Depending on the level of compute performance you require, you can chose between the Tensorflow version (1.1) available in the standard anaconda distribution of python we have installed on our systems, or a specific version of Tensorflow that is more up-to-date (see "module avail tensorflow"). Note that we recommend using the versions of Tensorflow that have been compiled to include specific optimizations contributed by the Intel team to improve compute performance on KNL.

Intel Optimized TensorFlow

The tensorflow modules on Cori contain various optimizations at the computation graph level and utilize the MKL and MKL-DNN libraries. See this page and these docs for further information about the optimizations as well as performance tuning tips.

At time of this writing the latest TF version available on Cori is "tensorflow/intel-1.9.0-py27". You can load this software with

module load tensorflow/intel-1.9.0-py27

Note that this module contains python and a number of additional analytics python libraries so no need to load the standard Cori Anaconda python module.

How to optimize Tensorflow performance on KNL

There are a number of settings you can play with to get optimal compute performance with Tensorflow on KNL (and indeed Haswell) nodes. Again, you can find more complete documentation in these docs and this page. Here we summarize some recommended parameters to set, but the optimal compute performance for your network will depend on the specific architecture. These are only recommended starting points. 

Affinity settings:

  • KMP_AFFINITY="granularity=fine,verbose,compact,1,0". 
    • Enables the run-time library to bind threads to physical processing units.
  • KMP_SETTINGS=1
    • Enables (true) or disables (false) the printing of OpenMP* run-time library environment variables during program execution.
  • KMP_BLOCKTIME=1
    • Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.

These can be set as environmental variables in your batch submission script. 

Thread settings: 

  • OMP_NUM_THREADS: This defaults to the number of physical cores (i.e. 68 on KNL). Adjusting this parameter beyond matching the number of cores can have a negative impact. 
  • intra_op_parallelism_threadstensorflow can use multiple threads to parallelize execution, and will schedule the individual pieces into this pool. Setting this equal to the number of physical cores (i.e. 68 on KNL) is recommended. Setting the value to 0, which is the default and will result in the value being set to the number of logical cores, is an option to try for some architectures. This value and OMP_NUM_THREADS should be equal.
  • inter_op_parallelism_threads: Setting this equal to the number of sockets (i.e. 1 on KNL) is recommended. Setting the value to 0, which is the default, results in the value being set to the number of logical cores.

These can be set in your tensorflow code using: 

config = tf.ConfigProto()
config
.intra_op_parallelism_threads =68
config
.inter_op_parallelism_threads =4
tf
.session(config=config)

Performance Benchmarks

We run a suite of benchmarks on our various Tensorflow installations to monitor performance, particularly of the Intel-optimized releases. These results are a little outdated now; we will be updating them in the near future.

We measure the standard Tensorflow benchmarks, augmented with a benchmark based on a DC-GAN architecture used in this paper. We will continue to add benchmarks derived from our real-world science use cases. Please let us know if you have a good candidate for a benchmark.

All benchmarks were run on a single KNL node of Cori in quad, cache mode, using the following affinity settings. We recommend using these settings when running Tensorflow on KNL at NERSC. 

  • KMP_AFFINITY="granularity=fine,verbose,compact,1,0". 
  • KNP_SETTINGS=1
  • KMP_BLOCKTIME=1

There are several parameters that can be tuned to obtain optimal performance listed in the following table - the exact setting will depend on the structure of the network you are trying to train. The settings used in these benchmarks have not been exhaustively optimized, but can give you a starting point for how to run effectively on KNL nodes. 

The data used in each benchmark is the dummy data generated by the benchmarking code itself. Please note that the batch size and parameter settings can be tuned further to achieve higher performance. You can find more advice on performance tuning on these pages. The numbers reported here are the number of images processed per second. 

TF benchmarks

Performance Benchmarks
Benchmark tensorflow/
intel-head
NCHW
tensorflow/
intel-head
NHWC
tensorflow/
intel-1.3
NCHW
tensorflow/
intel-1.3
NHWC
tensorflow/
1.3.0rc2
NHWC
tensorflow/
intel-1.2
NHWC
python/
2.7-

ana-conda
4.4 (V1.1)
NHWC
alexnet
(batch size=256; intra_threads=136; inter_threads=2)
540.2 453.7 507.5 436 .2 12.9 16.3 13.3
googlenet
(batch size=256; intra_threads=66; inter_threads=4)
174.8 107.4 172.4 107.4 6.6 8.6 6.7
inception3
(batch size=32; intra_threads=66; inter_threads=4)
30.0 21.5 26.5 19.8 1.5 6.5 1.4
vggll
(batch size=128; intra_threads=68; inter_threads=2)
65.5 61.1 64.1 62.2 1.8 13.8 1.8
resnet50
(batch size=128; intra_threads=66; inter_threads=3)
55.7 56.2 48.7 50.9 1.5 11.0 1.5
DCGAN
(batch size=64; intra_threads=66; inter_threads=2)
109 111 - 112 16 7 14


Note that the NCHW format has only been available in the more recent releases of Tensorflow.

Profiling

This link gives a useful recipe for profiling your TensorFlow recipe. If you are having performance issues with TensorFlow, please try profiling your code and send the results to NERSC consultants, and we can help optimize your setup.

To profile your code, you'll want to use the Timeline object (code taken from this link):

import tensorflow as tf
from tensorflow.python.client import timeline

x = tf.random_normal([1000,1000])
y = tf.random_normal([1000,1000])
res = tf.matmul(x, y)# Run the graph with full trace optionwith tf.Session()as sess:
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=run_options, run_metadata=run_metadata)# Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()with open('timeline.json','w')as f:
        f.write(ctf)

You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file to see which stages are taking the most time. 

TensorBoard

NERSC does not have browsers installed on our login or compute nodes - if you want to use the TensorBoard you will either need to install your own browser (e.g. FireFox) in your home directory, or use ssh-forwarding as follows. Note that you will need your ssh public key stored in NIM for this to work.

  1. On a Cori login node start up TensorBoard:
    tensorboard --logdir=path/to/logs --port 9998
  2. From your laptop/desktop/remote host you will want to make sure you ssh into the same Cori login node on which you started up the tensorboard:
    ssh -L 9998:localhost:9998 cori.nersc.gov
  3. Now open http://0.0.0.0:9998/ in a browser on your laptop/desktop.