NERSCPowering Scientific Discovery Since 1974

TensorFlow at NERSC

We offer a number of different versions of Tensorflow at NERSC. Depending on the level of compute performance you require, you can chose between the Tensorflow version (1.1) available in the standard anaconda distribution of python we have installed on our systems, or a specific version of Tensorflow that is more up-to-date (see "module avail tensorflow"). Note that we recommend using the versions of Tensorflow that have been compiled to include specific optimizations contributed by the Intel team to improve compute performance on KNL.

Intel Optimized TensorFlow:

The module "tensorflow/intel-head" release contains optimizations for NHWC image data format (which is the recommended data format). Users can choose either of NHWC or NCHW formats. There are also many other optimizations for CPU architecture, including:

  • MKL Convolution filter optimization to avoid recomputation in the backward pass

  • Element-wise optimization to avoid conversions by using Eigen ops

  • AddN implemented using MKL-ML

See these docs for more information on Intel-optimized TensorFlow, and for tips on performance-tuning.

This page from Tensorflow is an excellent reference for the variable you can set to optimise performance. 

To use Intel's specially optimized TensorFlow version for KNL and Haswell, you need to do two things:

  1. Rather than "module load python", use:

    module load tensorflow/intel-head

  2. Replace:
    sess = tf.Session() 
    in your code with this:
    sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=int(os.environ['NUM_INTER_THREADS']),
    intra_op_parallelism_threads=int(os.environ['NUM_INTRA_THREADS'])))

This loads the very latest version of TensorFlow that has been compiled with all available Intel optimizations. We also offer an optimised python3 version under tensorflow/intel-head-python3. 

Intel-optimisedTensorFlow releases 1.2 and 1.3 are also available undertensor flow/intel-1.2 and tensorflow/intel-1.3.

Standard TensorFlow:

Standard TensorFlow is available in the anaconda distribution that is the default python module on both Cori and Edison. Simply use:

module load python/2.7-anaconda-4.4

to access it. This contains Tensorflow V1.1. Note that the compute performance with this version will not be great.

 

We also have a (typically more up-to-date) version of non-optimized tensorflow in a separate module. However, we do recommend that you use either the version within Anaconda, as that has most other necessary packages already set up, or the Intel-optimized version, as that will give you better performance. Check what versions are available using:

module avail tensorflow

You can access the most recent (as of 10/2017) version module using:

module load tensorflow/1.4.0rc0

Note that this is the "vanilla" TensorFlow, and has no optimization for KNL (or Haswell) architecture. Your performance may not be as you expect!

Performance Benchmarks

We run a suite of benchmarks on our various Tensorflow installations to monitor performance, particularly of the Intel-optimized releases. The results emphasise that for optimal performance, you should use the tensorflow/intel-head module, and the NCHW data format. 

We measure the standard Tensorflow benchmarks, augmented with a benchmark based on a DC-GAN architecture used in this paper. We will continue to add benchmarks derived from our real-world science use cases. Please let us know if you have a good candidate for a benchmark.

All benchmarks were run on a single KNL node of Cori in quad, cache mode, using the affinity settings:

  • KMP_AFFINITY="granularity=fine,verbose,compact,1,0".
  • KNP_SETTINGS=1
  • KMP_BLOCKTIME=1

The data used in each benchmark is the dummy data generated by the benchmarking code itself. Please note that the batch size and parameter settings can be tuned further to achieve higher performance. You can find more advice on performance tuning on these pages. The numbers reported here are the number of images processed per second.

TF benchmarks

Performance Benchmarks
Benchmark tensorflow/
intel-head
NCHW
tensorflow/
intel-head
NHWC
tensorflow/
intel-1.3
NCHW
tensorflow/
intel-1.3
NHWC
tensorflow/
1.3.0rc2
NHWC
tensorflow/
intel-1.2
NHWC
python/
2.7-

ana-conda
4.4 (V1.1)
NHWC
alexnet
(batch size=256; num_intra_threads=136; num_inter_threads=2)
540.2 453.7 507.5 436 .2 12.9 16.3 13.3
googlenet
(batch size=256; num_intra_threads=66; num_inter_threads=4)
174.8 107.4 172.4 107.4 6.6 8.6 6.7
inception3
(batch size=32; num_intra_threads=66; num_inter_threads=4)
30.0 21.5 26.5 19.8 1.5 6.5 1.4
vggll
(batch size=128; num_intra_threads=68; num_inter_threads=2)
65.5 61.1 64.1 62.2 1.8 13.8 1.8
resnet50
(batch size=128; num_intra_threads=66; num_inter_threads=3)
55.7 56.2 48.7 50.9 1.5 11.0 1.5
DCGAN
(batch size=64; num_intra_threads=66; num_inter_threads=2)
109 111 - 112 16 7 14


Note that the NCHW format has only been available in the more recent releases of Tensorflow.

Profiling

This link gives a useful recipe for profiling your TensorFlow recipe. If you are having performance issues with TensorFlow, please try profiling your code and send the results to NERSC consultants, and we can help optimize your setup.

To profile your code, you'll want to use the Timeline object (code taken from this link):

import tensorflow as tf
from tensorflow.python.client import timeline

x = tf.random_normal([1000,1000])
y = tf.random_normal([1000,1000])
res = tf.matmul(x, y)# Run the graph with full trace optionwith tf.Session()as sess:
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=run_options, run_metadata=run_metadata)# Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()with open('timeline.json','w')as f:
        f.write(ctf)

You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file to see which stages are taking the most time. 

TensorBoard

NERSC does not have browsers installed on our login or compute nodes - if you want to use the TensorBoard you will either need to install yoru own browser (e.g. FireFox), or use ssh-forwarding as follows:

  1. On a Cori login node start up TensorBoard:
    tensorboard --logdir=path/to/logs --port 9998
  2. From your laptop/desktop/remote host you will want to make sure you ssh into the same Cori login node on which you started up the tensorboard:
    ssh -L 9998:localhost:9998 cori.nersc.gov
  3. Now open http://0.0.0.0:9998/ in a browser on your laptop/desktop.

Using Distributed TensorFlow at NERSC:

Cori's SLURM scheduler is pretty handy for using TensorFlow's tools for multinode execution. See this example of training a neural network with TensorFlow on Cori.