TensorFlow at NERSC
We offer a number of different versions of Tensorflow at NERSC. Depending on the level of compute performance you require, you can chose between the Tensorflow version (1.1) available in the standard anaconda distribution of python we have installed on our systems, or a specific version of Tensorflow that is more up-to-date (see "module avail tensorflow"). Note that we recommend using the versions of Tensorflow that have been compiled to include specific optimizations contributed by the Intel team to improve compute performance on KNL.
The tensorflow modules on Cori contain various optimizations at the computation graph level and utilize the MKL and MKL-DNN libraries. See this page and these docs for further information about the optimizations as well as performance tuning tips.
At time of this writing the latest TF version available on Cori is "tensorflow/intel-1.9.0-py27". You can load this software with
module load tensorflow/intel-1.9.0-py27
Note that this module contains python and a number of additional analytics python libraries so no need to load the standard Cori Anaconda python module.
How to optimize Tensorflow performance on KNL
There are a number of settings you can play with to get optimal compute performance with Tensorflow on KNL (and indeed Haswell) nodes. Again, you can find more complete documentation in these docs and this page. Here we summarize some recommended parameters to set, but the optimal compute performance for your network will depend on the specific architecture. These are only recommended starting points.
- Enables the run-time library to bind threads to physical processing units.
- Enables (true) or disables (false) the printing of OpenMP* run-time library environment variables during program execution.
- Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
These can be set as environmental variables in your batch submission script.
- OMP_NUM_THREADS: This defaults to the number of physical cores (i.e. 68 on KNL). Adjusting this parameter beyond matching the number of cores can have a negative impact.
- intra_op_parallelism_threads: tensorflow can use multiple threads to parallelize execution, and will schedule the individual pieces into this pool. Setting this equal to the number of physical cores (i.e. 68 on KNL) is recommended. Setting the value to 0, which is the default and will result in the value being set to the number of logical cores, is an option to try for some architectures. This value and
OMP_NUM_THREADSshould be equal.
- inter_op_parallelism_threads: Setting this equal to the number of sockets (i.e. 1 on KNL) is recommended. Setting the value to 0, which is the default, results in the value being set to the number of logical cores.
These can be set in your tensorflow code using:
config = tf.ConfigProto()
We run a suite of benchmarks on our various Tensorflow installations to monitor performance, particularly of the Intel-optimized releases. These results are a little outdated now; we will be updating them in the near future.
We measure the standard Tensorflow benchmarks, augmented with a benchmark based on a DC-GAN architecture used in this paper. We will continue to add benchmarks derived from our real-world science use cases. Please let us know if you have a good candidate for a benchmark.
All benchmarks were run on a single KNL node of Cori in quad, cache mode, using the following affinity settings. We recommend using these settings when running Tensorflow on KNL at NERSC.
There are several parameters that can be tuned to obtain optimal performance listed in the following table - the exact setting will depend on the structure of the network you are trying to train. The settings used in these benchmarks have not been exhaustively optimized, but can give you a starting point for how to run effectively on KNL nodes.
The data used in each benchmark is the dummy data generated by the benchmarking code itself. Please note that the batch size and parameter settings can be tuned further to achieve higher performance. You can find more advice on performance tuning on these pages. The numbers reported here are the number of images processed per second.
(batch size=256; intra_threads=136; inter_threads=2)
(batch size=256; intra_threads=66; inter_threads=4)
(batch size=32; intra_threads=66; inter_threads=4)
(batch size=128; intra_threads=68; inter_threads=2)
(batch size=128; intra_threads=66; inter_threads=3)
(batch size=64; intra_threads=66; inter_threads=2)
Note that the NCHW format has only been available in the more recent releases of Tensorflow.
This link gives a useful recipe for profiling your TensorFlow recipe. If you are having performance issues with TensorFlow, please try profiling your code and send the results to NERSC consultants, and we can help optimize your setup.
To profile your code, you'll want to use the Timeline object (code taken from this link):
import tensorflow as tf from tensorflow.python.client import timeline x = tf.random_normal([1000,1000]) y = tf.random_normal([1000,1000]) res = tf.matmul(x, y)# Run the graph with full trace optionwith tf.Session()as sess: run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() sess.run(res, options=run_options, run_metadata=run_metadata)# Create the Timeline object, and write it to a json tl = timeline.Timeline(run_metadata.step_stats) ctf = tl.generate_chrome_trace_format()with open('timeline.json','w')as f: f.write(ctf)
You can then open Google Chrome, go to the page
chrome://tracing and load the
timeline.json file to see which stages are taking the most time.
NERSC does not have browsers installed on our login or compute nodes - if you want to use the TensorBoard you will either need to install your own browser (e.g. FireFox) in your home directory, or use ssh-forwarding as follows. Note that you will need your ssh public key stored in NIM for this to work.
- On a Cori login node start up TensorBoard:
tensorboard --logdir=path/to/logs --port 9998
- From your laptop/desktop/remote host you will want to make sure you ssh into the same Cori login node on which you started up the tensorboard:
ssh -L 9998:localhost:9998 cori.nersc.gov
- Now open http://0.0.0.0:9998/ in a browser on your laptop/desktop.