Performance and optimization
Case Studies -
1) SHOC benchmark from ORNL -CUDA vs OpenCL
The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.
In addition to OpenCL-based benchmark programs, SHOC also includes a Compute Unified Device Architecture (CUDA) version of many of its benchmarks for comparison with the OpenCL version.
Features
-
Multiple benchmark applications written in both OpenCL and CUDA
-
Cluster-level parallelism with MPI
-
Node-level parallelism for multiple GPUs per node
-
Harness for running and easy reporting (in spreadsheet format) of the suite
-
Stability tests for large scale cluster resiliency testing
The Benchmarks
The SHOC benchmark suite is divided into two primary categories: stress tests and performance tests. The stress tests use computationally demanding kernels to identify OpenCL devices with bad memory, insufficient cooling, or other component problems. The performance tests are further subdivided according to their complexity and the nature of the device capability they exercise. This categorization is similar in spirit to that used in the BLAS API. Currently, the levels are:
-
Performance Tests
Comparison of some of kernels as shown below depict that CUDA performs better on NVIDIA GPUs than OpenCL. Note that GFLOPS is measured for Kernel only.

Here is detailed output running on single node of Dirac using 1 GPU running cuda 3.2/cuda 4.0.
[virajp83@dirac20 tools]$ perl driver.pl -cuda -s 4
--- Welcome To The SHOC Benchmark Suite ---
Hostname: dirac20
Number of available devices: 1
Device 0: 'Tesla C2050'
--- Starting Benchmarks ---
- Level 0: "Feeds and Speeds" -
-- This can take several minutes. --
-PCIe Bandwidth Tests (GB/s)-
Dev 0: 'Tesla C2050' H->D: 6.10089
Dev 0: 'Tesla C2050' D->H: 5.81925
-MaxFlops Test (GFLOPS)-
Dev 0: 'Tesla C2050' SP: 1002.38
Dev 0: 'Tesla C2050' DP: 503.241
-Device Memory Bandwidth Tests (GB/s) (Read / Write)-
Dev 0: 'Tesla C2050'
Global Memory Contiguous: 90.1328 / 100.912
Global Memory Strided: 11.8911 / 3.98538
Shared Memory: 373.061 / 421.558
Texture (Random Access): 70.1163
--- Level 1 - Basic Algorithms and Parallel Primitives ---
-- This can take several minutes. --
-FFT (GFLOPS) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SP FFT: 297.578 / 30.8169
Dev 0: 'Tesla C2050' SP IFFT+Norm: 299.055 / 30.8326
Dev 0: 'Tesla C2050' DP FFT: 139.893 / 15.3166
Dev 0: 'Tesla C2050' DP IFFT+Norm: 140.24 / 15.3207
-GEMM (GFLOPS/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SGEMM: 613.999 / 533.981
Dev 0: 'Tesla C2050' DGEMM: 297.31 / 239.43
-MD (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SP MD: 28.4851 / 12.6922
Dev 0: 'Tesla C2050' DP MD: 39.8846 / 19.4913
-Reduction (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SP Reduction: 92.1877 / 5.70531
Dev 0: 'Tesla C2050' DP Reduction: 92.8906 / 5.70657
-S3D (GFLOPS) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SP S3D: 43.3052 / 38.1263
Dev 0: 'Tesla C2050' DP S3D: 23.9854 / 20.857
-Scan (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SP Scan: 25.8179 / 0.00580344
Dev 0: 'Tesla C2050' DP Scan: 18.4115 / 0.00580535
-Sort (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' Sort: 1.65577 / 1.06452
-Sparse Matrix-Vector Multiply (SpMV) (GFLOPS) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' CSR-Scalar: 0.70517 / 0.47554
Dev 0: 'Tesla C2050' Padded CSR-Scalar: 0.644673 / 0.448017
Dev 0: 'Tesla C2050' CSR-Vector: 9.91805 / 1.27311
Dev 0: 'Tesla C2050' Padded CSR-Vector: 10.5387 / 1.28894
Dev 0: 'Tesla C2050' DP CSR-Scalar: 0.672379 / 0.399681
Dev 0: 'Tesla C2050' Padded DP CSR-Scalar: 0.674062 / 0.400994
Dev 0: 'Tesla C2050' DP CSR-Vector: 8.76532 / 0.885867
Dev 0: 'Tesla C2050' Padded DP CSR-Vector: 9.29075 / 0.894539
Dev 0: 'Tesla C2050' SP ELLPACKR: 7.8153
Dev 0: 'Tesla C2050' DP ELLPACKR: 6.18968
-Stencil2D (s) (Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' SP Sten2D: 3.66513
Dev 0: 'Tesla C2050' DP Sten2D: 5.30471
-Triad (GB/s) (Kernel + PCIe transfer)-
Dev 0: 'Tesla C2050' Triad: 6.49068
[virajp83@dirac20 tools]$ perl driver.pl -opencl -s 4
--- Welcome To The SHOC Benchmark Suite ---
Hostname: dirac20
Number of available devices: 1
Device 0: Tesla C2050
--- Starting Benchmarks ---
- Level 0: "Feeds and Speeds" -
-- This can take several minutes. --
-PCIe Bandwidth Tests (GB/s)-
Dev 0: Tesla C2050 H->D: 6.00652
Dev 0: Tesla C2050 D->H: 5.84119
-MaxFlops Test (GFLOPS)-
Dev 0: Tesla C2050 SP: 1005.74
Dev 0: Tesla C2050 DP: 503.943
-Device Memory Bandwidth Tests (GB/s) (Read / Write)-
Dev 0: Tesla C2050
Global Memory Contiguous: 91.8004 / 100.444
Global Memory Strided: 14.7917 / 3.89989
Local Memory: 371.921 / 468.07
Image (Random Access): 69.953
-OpenCL Kernel Compilation (s)-
Dev 0: Tesla C2050 Kernel Compilation: 0.183856
-OpenCL Queuing Delay (ms)-
Dev 0: Tesla C2050 Submit-Start Delay: 1.94722e-06
--- Level 1 - Basic Algorithms and Parallel Primitives ---
-- This can take several minutes. --
-FFT (GFLOPS) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SP FFT: 55.9078 / 55.542
Dev 0: Tesla C2050 SP IFFT+Norm: 53.8754 / 53.5356
Dev 0: Tesla C2050 DP FFT: 21.9327 / 21.8189
Dev 0: Tesla C2050 DP IFFT+Norm: 18.5506 / 18.4692
-GEMM (GFLOPS/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SGEMM: 412.116 / 361.885
Dev 0: Tesla C2050 DGEMM: 170.732 / 138.13
-MD (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SP MD: 23.6974 / 11.586
Dev 0: Tesla C2050 DP MD: 23.7171 / 14.5336
-Reduction (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SP Reduction: 90.4555 / 5.63373
Dev 0: Tesla C2050 DP Reduction: 91.1109 / 5.63569
-S3D (GFLOPS) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SP S3D: 45.9158 / 36.2952
Dev 0: Tesla C2050 DP S3D: 21.4861 / 17.2808
-Scan (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SP Scan: 145.561 / 2.9019
Dev 0: Tesla C2050 DP Scan: 134.314 / 2.89856
-Sort (GB/s) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 Sort: 196.608 / 2.90564
-Sparse Matrix-Vector Multiply (SpMV) (GFLOPS) (Kernel Only / Kernel + PCIe transfer)-
Dev 0: Tesla C2050 CSR-Scalar: 0.711484 / 0.317706
Dev 0: Tesla C2050 Padded CSR-Scalar: 0.639609 / 0.302956
Dev 0: Tesla C2050 CSR-Vector: 4.48412 / 0.50889
Dev 0: Tesla C2050 Padded CSR-Vector: 4.81349 / 0.51411
Dev 0: Tesla C2050 DP CSR-Scalar: 0.656797 / 0.24467
Dev 0: Tesla C2050 Padded DP CSR-Scalar: 0.676247 / 0.22831
Dev 0: Tesla C2050 DP CSR-Vector: 3.6134 / 0.351946
Dev 0: Tesla C2050 Padded DP CSR-Vector: 3.95269 / 0.317033
Dev 0: Tesla C2050 SP ELLPACKR: 6.2704
Dev 0: Tesla C2050 DP ELLPACKR: 4.21125
-Stencil2D (s) (Kernel + PCIe transfer)-
Dev 0: Tesla C2050 SP Sten2D: 3.91131
Dev 0: Tesla C2050 DP Sten2D: 5.42798
-Triad (GB/s) (Kernel + PCIe transfer)-
Dev 0: Tesla C2050 Triad: 5.53576
2) NAMD
NAMD, recipient of a 2002 Gordon Bell Award, is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
STMV1 apoa benchmark was run which inlcudes simulation of 1 million atoms. The first graph shows that is better to use 1 MPI per GPU as the performance degrades when one increases MPI processes oer node and share GPU due to CPU-GPU communication bottleneck.

Scaling study was done on the same benchmark. The folowing graph shows multi-GPU early results. Y -axis is walltime(sec) and X axis is cores per node used i.e ppn with # of nodes scaling from 1 to 4.

If we focus on two cases one with 1 node CPU-GPU when 2 pp and with 4 node and 4 ppn those has same height.
So can we say
2 core + 1 GPU = 16 cores
1 GPU = 14 cores ?
If we explored enough parallelsim within node this might be a good indication of how much CPU equivalent performance is shown by single GPU for NAMD. Ofcourse this is open to debate.
3) HPL Benchmark
HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.
The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.
The HPL package provides a testing and timing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. The best performance achievable by this software on your system depends on a large variety of factors. Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scalable in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage.
Following are the performance results of multithreaded HPL using CUDA i.e one GPU per node. The results are obtained running 1 MPI process per node and 8 OpenMP threads. MPI process communicates with NVIDIA card (C 2050) to offload the work to GPU . Computational efficiency is calculated by the formula
GPU FLOPS / (CPU-GPU) FLOPS
The first graph shows that Fermi cards give 4 times Double precision flops compared to Tesla (previous generation). The second graph shows that as number of nodes grow, GPU contribution decreases to less than half. This is from the fact that MPI communication bottleneck kicks in as you across the node over infiniband.




