TensorFlow at NERSC
We offer a number of different versions of Tensorflow at NERSC. Depending on the level of compute performance you require, you can chose between the Tensorflow version (1.1) available in the standard anaconda distribution of python we have installed on our systems, or a specific version of Tensorflow that is more up-to-date (see "module avail tensorflow"). Note that we recommend using the versions of Tensorflow that have been compiled to include specific optimizations contributed by the Intel team to improve compute performance on KNL.
The module "tensorflow/intel-head" release contains optimizations for NHWC image data format (which is the recommended data format). Users can choose either of NHWC or NCHW formats. There are also many other optimizations for CPU architecture, including:
MKL Convolution filter optimization to avoid recomputation in the backward pass
Element-wise optimization to avoid conversions by using Eigen ops
AddN implemented using MKL-ML
See these docs for more information on Intel-optimized TensorFlow, and for tips on performance-tuning.
This page from Tensorflow is an excellent reference for the variable you can set to optimise performance.
To use Intel's specially optimized TensorFlow version for KNL and Haswell, you need to do two things:
- Rather than "module load python", use:
module load tensorflow/intel-head
sess = tf.Session()in your code with this:
sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=int(os.environ['NUM_INTER_THREADS']),
This loads the very latest version of TensorFlow that has been compiled with all available Intel optimizations. We also offer an optimised python3 version under tensorflow/intel-head-python3.
Intel-optimisedTensorFlow releases 1.2 and 1.3 are also available undertensor flow/intel-1.2 and tensorflow/intel-1.3.
Standard TensorFlow is available in the anaconda distribution that is the default python module on both Cori and Edison. Simply use:
module load python/2.7-anaconda-4.4
to access it. This contains Tensorflow V1.1. Note that the compute performance with this version will not be great.
We also have a (typically more up-to-date) version of non-optimized tensorflow in a separate module. However, we do recommend that you use either the version within Anaconda, as that has most other necessary packages already set up, or the Intel-optimized version, as that will give you better performance. Check what versions are available using:
module avail tensorflow
You can access the most recent (as of 10/2017) version module using:
module load tensorflow/1.4.0rc0
Note that this is the "vanilla" TensorFlow, and has no optimization for KNL (or Haswell) architecture. Your performance may not be as you expect!
We run a suite of benchmarks on our various Tensorflow installations to monitor performance, particularly of the Intel-optimized releases. The results emphasise that for optimal performance, you should use the tensorflow/intel-head module, and the NCHW data format.
We measure the standard Tensorflow benchmarks, augmented with a benchmark based on a DC-GAN architecture used in this paper. We will continue to add benchmarks derived from our real-world science use cases. Please let us know if you have a good candidate for a benchmark.
All benchmarks were run on a single KNL node of Cori in quad, cache mode, using the affinity settings:
The data used in each benchmark is the dummy data generated by the benchmarking code itself. Please note that the batch size and parameter settings can be tuned further to achieve higher performance. You can find more advice on performance tuning on these pages. The numbers reported here are the number of images processed per second.
(batch size=256; num_intra_threads=136; num_inter_threads=2)
(batch size=256; num_intra_threads=66; num_inter_threads=4)
(batch size=32; num_intra_threads=66; num_inter_threads=4)
(batch size=128; num_intra_threads=68; num_inter_threads=2)
(batch size=128; num_intra_threads=66; num_inter_threads=3)
(batch size=64; num_intra_threads=66; num_inter_threads=2)
Note that the NCHW format has only been available in the more recent releases of Tensorflow.
This link gives a useful recipe for profiling your TensorFlow recipe. If you are having performance issues with TensorFlow, please try profiling your code and send the results to NERSC consultants, and we can help optimize your setup.
To profile your code, you'll want to use the Timeline object (code taken from this link):
import tensorflow as tf from tensorflow.python.client import timeline x = tf.random_normal([1000,1000]) y = tf.random_normal([1000,1000]) res = tf.matmul(x, y)# Run the graph with full trace optionwith tf.Session()as sess: run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() sess.run(res, options=run_options, run_metadata=run_metadata)# Create the Timeline object, and write it to a json tl = timeline.Timeline(run_metadata.step_stats) ctf = tl.generate_chrome_trace_format()with open('timeline.json','w')as f: f.write(ctf)
You can then open Google Chrome, go to the page
chrome://tracing and load the
timeline.json file to see which stages are taking the most time.
NERSC does not have browsers installed on our login or compute nodes - if you want to use the TensorBoard you will either need to install yoru own browser (e.g. FireFox), or use ssh-forwarding as follows:
- On a Cori login node start up TensorBoard:
tensorboard --logdir=path/to/logs --port 9998
- From your laptop/desktop/remote host you will want to make sure you ssh into the same Cori login node on which you started up the tensorboard:
ssh -L 9998:localhost:9998 cori.nersc.gov
- Now open http://0.0.0.0:9998/ in a browser on your laptop/desktop.
Using Distributed TensorFlow at NERSC:
Cori's SLURM scheduler is pretty handy for using TensorFlow's tools for multinode execution. See this example of training a neural network with TensorFlow on Cori.