NERSCPowering Scientific Discovery Since 1974

TensorFlow at NERSC

We offer a number of different versions of Tensorflow at NERSC. Depending on the level of compute performance you require, you can chose between the Tensorflow version (1.1) available in the standard anaconda distribution of python we have installed on our systems, or a specific version of Tensorflow that is more up-to-date (see "module avail tensorflow"). Note that we recommend using the versions of Tensorflow that have been compiled to include specific optimizations contributed by the Intel team to improve compute performance on KNL.

Intel Optimized TensorFlow

The module "tensorflow/intel-head" release contains optimizations for NHWC image data format (which is the recommended data format). Users can choose either of NHWC or NCHW formats. There are also many other optimizations for CPU architecture, including:

  • MKL Convolution filter optimization to avoid recomputation in the backward pass

  • Element-wise optimization to avoid conversions by using Eigen ops

  • AddN implemented using MKL-ML

See these docs for more information on Intel-optimized TensorFlow, and for tips on performance-tuning.

This page from Tensorflow is an excellent reference for the variable you can set to optimise performance. 

To use Intel's specially optimized TensorFlow version for KNL and Haswell, you need to do two things:

  1. Rather than "module load python", use:

    module load tensorflow/intel-head

  2. Replace:
    sess = tf.Session() 
    in your code with this:
    sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=int(os.environ['NUM_INTER_THREADS']),
    intra_op_parallelism_threads=int(os.environ['NUM_INTRA_THREADS'])))

This loads the very latest version of TensorFlow that has been compiled with all available Intel optimizations. We also offer an optimized python3 version under tensorflow/intel-head-python3. 

Note that this loads a conda environment that also contains a small set of additional python modules (h5py, matplotlib, keras, sklearn, yaml, scipy) and not the full suite of modules available in the anaconda distro. If there are other python modules you would like to see included in this env then please let us know

 

Older Intel-optimisedTensorFlow releases 1.2 and 1.3 are also available under tensorflow/intel-1.2 and tensorflow/intel-1.3.

Standard TensorFlow:

Standard TensorFlow is available in the anaconda distribution that is the default python module on both Cori and Edison. Simply use:

module load python/2.7-anaconda-4.4

to access it. This contains Tensorflow V1.1. Note that the compute performance with this version will not be great.

 

We also have a (typically more up-to-date) version of non-optimized tensorflow in a separate module. However, we do recommend that you use either the version within Anaconda, as that has most other necessary packages already set up, or the Intel-optimized version, as that will give you better performance. Check what versions are available using:

module avail tensorflow

You can access the most recent (as of 10/2017) version module using:

module load tensorflow/1.4.0rc0

Note that this is the "vanilla" TensorFlow, and has no optimization for KNL (or Haswell) architecture. Your performance may not be as you expect!

How to optimize Tensorflow performance on KNL

There are a number of settings you can play with to get optimal compute performance with Tensorflow on KNL (and indeed Haswell) nodes. You can find more complete documentation in these docs and this page. Here we summarize some recommended parameters to set, but the optimal compute performance for your network will depend on the specific architecture. These are only recommended starting points. 

Affinity settings:

  • KMP_AFFINITY="granularity=fine,verbose,compact,1,0". 
    • Enables the run-time library to bind threads to physical processing units.
  • KMP_SETTINGS=1
    • Enables (true) or disables (false) the printing of OpenMP* run-time library environment variables during program execution.
  • KMP_BLOCKTIME=1
    • Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.

These can be set as environmental variables in your batch submission script. 

Thread settings: 

  • OMP_NUM_THREADS: This defaults to the number of physical cores (i.e. 68 on KNL). Adjusting this parameter beyond matching the number of cores can have a negative impact. 
  • intra_op_parallelism_threadstensorflow can use multiple threads to parallelize execution, and will schedule the individual pieces into this pool. Setting this equal to the number of physical cores (i.e. 68 on KNL) is recommended. Setting the value to 0, which is the default and will result in the value being set to the number of logical cores, is an option to try for some architectures. This value and OMP_NUM_THREADS should be equal.
  • inter_op_parallelism_threads: Setting this equal to the number of sockets (i.e. 1 on KNL) is recommended. Setting the value to 0, which is the default, results in the value being set to the number of logical cores.

These can be set in your tensorflow code using: 

config = tf.ConfigProto()
config
.intra_op_parallelism_threads =68
config
.inter_op_parallelism_threads =4
tf
.session(config=config)

Performance Benchmarks

We run a suite of benchmarks on our various Tensorflow installations to monitor performance, particularly of the Intel-optimized releases. The results emphasise that for optimal performance, you should use the tensorflow/intel-head module, and the NCHW data format. 

We measure the standard Tensorflow benchmarks, augmented with a benchmark based on a DC-GAN architecture used in this paper. We will continue to add benchmarks derived from our real-world science use cases. Please let us know if you have a good candidate for a benchmark.

All benchmarks were run on a single KNL node of Cori in quad, cache mode, using the following affinity settings. We recommend using these settings when running Tensorflow on KNL at NERSC. 

  • KMP_AFFINITY="granularity=fine,verbose,compact,1,0". 
  • KNP_SETTINGS=1
  • KMP_BLOCKTIME=1

There are several parameters that can be tuned to obtain optimal performance listed in the following table - the exact setting will depend on the structure of the network you are trying to train. The settings used in these benchmarks have not been exhaustively optimized, but can give you a starting point for how to run effectively on KNL nodes. 

The data used in each benchmark is the dummy data generated by the benchmarking code itself. Please note that the batch size and parameter settings can be tuned further to achieve higher performance. You can find more advice on performance tuning on these pages. The numbers reported here are the number of images processed per second. 

TF benchmarks

Performance Benchmarks
Benchmark tensorflow/
intel-head
NCHW
tensorflow/
intel-head
NHWC
tensorflow/
intel-1.3
NCHW
tensorflow/
intel-1.3
NHWC
tensorflow/
1.3.0rc2
NHWC
tensorflow/
intel-1.2
NHWC
python/
2.7-

ana-conda
4.4 (V1.1)
NHWC
alexnet
(batch size=256; intra_threads=136; inter_threads=2)
540.2 453.7 507.5 436 .2 12.9 16.3 13.3
googlenet
(batch size=256; intra_threads=66; inter_threads=4)
174.8 107.4 172.4 107.4 6.6 8.6 6.7
inception3
(batch size=32; intra_threads=66; inter_threads=4)
30.0 21.5 26.5 19.8 1.5 6.5 1.4
vggll
(batch size=128; intra_threads=68; inter_threads=2)
65.5 61.1 64.1 62.2 1.8 13.8 1.8
resnet50
(batch size=128; intra_threads=66; inter_threads=3)
55.7 56.2 48.7 50.9 1.5 11.0 1.5
DCGAN
(batch size=64; intra_threads=66; inter_threads=2)
109 111 - 112 16 7 14


Note that the NCHW format has only been available in the more recent releases of Tensorflow.

Profiling

This link gives a useful recipe for profiling your TensorFlow recipe. If you are having performance issues with TensorFlow, please try profiling your code and send the results to NERSC consultants, and we can help optimize your setup.

To profile your code, you'll want to use the Timeline object (code taken from this link):

import tensorflow as tf
from tensorflow.python.client import timeline

x = tf.random_normal([1000,1000])
y = tf.random_normal([1000,1000])
res = tf.matmul(x, y)# Run the graph with full trace optionwith tf.Session()as sess:
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=run_options, run_metadata=run_metadata)# Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()with open('timeline.json','w')as f:
        f.write(ctf)

You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file to see which stages are taking the most time. 

TensorBoard

NERSC does not have browsers installed on our login or compute nodes - if you want to use the TensorBoard you will either need to install your own browser (e.g. FireFox) in your home directory, or use ssh-forwarding as follows. Note that you will need your ssh public key stored in NIM for this to work.

  1. On a Cori login node start up TensorBoard:
    tensorboard --logdir=path/to/logs --port 9998
  2. From your laptop/desktop/remote host you will want to make sure you ssh into the same Cori login node on which you started up the tensorboard:
    ssh -L 9998:localhost:9998 cori.nersc.gov
  3. Now open http://0.0.0.0:9998/ in a browser on your laptop/desktop.