An Investigation of Reported Anomolies When Running the BeamBeam3D Code on Seaborg

Richard Gerber

Followup to this report.

Abstract

A number of small runs with Ji Qiang's BeamBeam 3D code (used by Rob Ryne's and Kwok Ko's SciDAC Accelerator project) have been performed with a variety of task/node configurations on Seaborg. Global performance results are reported. It is shown that the runtime characteristics of the code depend strongly on the number of tasks per node used. Comparisons to results obtained on Oak Ridge's Eagle, which has only 4 CPUs/node, are probably not relevant for this code if running with more than 4 tasks/node on Seaborg. This example further emphasizes that interpreting simple performance and scaling metrics can be difficult.

Background

Ji Qiang reported that the scaling of his code has changed since the upgrade of Seaborg to the NERSC 3E system; concurrent to that upgrade major updates to the compilers and parallel operating system were installed. Ji said that his single node performance has remained the same, but when scaling to more than 1 node, his code's performance is reduced by approximately a factor of 4 when compared to results obtained before the upgrade. Ji said he does not see this reduction in performance on the POWER 3 SP system at Oak Ridge. Ji speculated that off-node MPI communication was being degraded when compared to on-node MPI, which used shared memory and does not involve the SP switch.

Ji gave me a version of his code and an input deck that he said exhibited the slowdown. The run was short enough to permit running a number of times.

I originally intended to explore possible Seaborg off-node effects, but the results of the initial runs led me to explore the scaling performance more generally. I performed a number of runs all at fixed problem size, but with different numbers of nodes and tasks per node.

Below are the results, as extracted from the output of the NERSC poe+ utility and the code's internal timers. IBM's hpmcount, from which poe+ derives its output, was found to have an almost constant overhead of 6-9 seconds per run. This would have affected the quoted run times and MFlop rates, especially for the shorter runs. For that reason, the code's internal timer was used to record run times and poe+ was used to record the number of floating point operations (Flops) performed by the CPUs. The number of Flops was approximately contstant for a given number of tasks. The run time as reported by the code's internal timer was very close to the sum of user time and system time reported by hpmcount, which is a indicator of time spent executing user code. Quoted MFlops/s rates were calculated as (Number of MFlops/run time).

With this small test case, communication overhead costs likely dominate with 256 or more tasks and a different run configuration will be needed to perform scaling studies with a large number of tasks.

Code and Compilation Parameters

Compile options -O3 -qtune=pwr3 -qarch=pwr3 -qstrict -qsave -bmaxdata:0x70000000 -bmaxstack:0x10000000
Memory usage~700MB
Number of particles2,000,000
3D mesh size64x64x32

Results

The shortest run time is shown in red.

Code execution time in seconds (fixed problem size)
Number of Nodes
Number of Tasks124 81632
4 209.0207.5 201.7
8 115.6100.8 98.096.1
16 106.653.2 45.944.7 44.0
32 62.0 27.222.8 21.6
64 73.921.7 14.112.7
128 75.4 16.712.1
256 181.232.9

Another way to look at the data is in tables that emphasize the importance of tasks per node.

Code execution time in seconds (fixed problem size)
Tasks per Node
Number of Nodes1684 21
1 106.6115.6 209.0
2 62.053.2 100.8207.5
4 73.927.2 45.998.0 201.7
8 75.421.7 22.844.7 96.1
16 181.216.7 14.121.6 44.0
32 32.9 12.112.7
Code execution time in seconds (fixed problem size)
Tasks per Node
Number of Tasks1684 21
4 209.0207.5 201.7
8 115.6 100.898.0 96.1
16 106.653.2 45.944.7 44.0
32 62.027.2 22.821.6
64 73.921.7 14.112.7
128 75.416.7 12.1
256 181.232.9

Given this data, what does it mean? There are a number of different questions that can be asked.

What configuration gives the shortest execution time?

From the data available, the problem will be solved in the least amount of time - 12.1 seconds - by running 128 tasks on 32 nodes: 4 tasks per node.

What run configuration incurs the lowest MPP charge?

The cheapest jobs are highlighted in blue. (MPP charges based on internal timer run times are reported. Actual MPP charges would likely include the parallel operating system startup and shutdown times; however, these are ignored here and are assumed to be insignificant when extrapolating results to long-running production jobs.)

MPP charge seconds (fixed problem size)
Number of Nodes
Number of Tasks124 81632
4 8,36016,600 32,272
8 4,6248,064 15,68030,750
16 4,2644,256 7,34414,304 28,160
32 4,960 4,3527,296 13,824
64 11,8246,944 9,02416,230
128 24,128 10,68815,539
256 115,96842,112

A number of run configurations accrue similar charges. Running this job on 4 nodes with 32 total tasks is probably the best strategy from a combined allocations/runtime perspective since that gives the shortest run time (27.2 seconds) among the cheapest jobs. The job can be run in less than half the time on 32 nodes (shown in red), but at 3.7 times the cost. (Tests at other configurations, say 12 tasks on 1 node, might reveal a cost savings.)

What about the MFlops metric?

Since the problem size, and therefore the number of Flops, is fixed the total MFlops for the code should scale inversely with the run time. This is not strictly true due to some system and MPI overheads that scale with task and node numbers, as well as the details of how the code partitions work among the tasks. In particular, the code appears to scale better than "perfectly" for small numbers of tasks. This is likely due to the way the code divides computional work among tasks for a fixed problem size. Runs with different numbers of tasks produced different numbers of floating point operations. For reference:
Number of tasks MFlops
469810
870060
1676035
3276659
6472215
12874681
25679650
Here are the total MFlops and MFlops/task.

Total MFlops/s (fixed problem size)
Tasks per Node
Number of Tasks1684 21
4 334336 346
8 605 694714 728
16 7131,429 1,6561,701 1,728
32 1,2362,818 3,3623,549
64 9773,328 5,1225,695
128 9904,472 6,152
256 4402,241

As expected the top aggregate rate of 6.15 GFlops/s is achieved by the 128-task, 4 tasks/node job.

Total MFlops/s/task (fixed problem size)
Tasks per Node
Number of Tasks1684 21
4 8484 86
8 76 8789 91
16 4589 104106 108
32 3988 105111
64 1552 8089
128 835 48
256 210

How well does the code scale?

From a scaling perspective, the code scales to 128 processors running 4 tasks/node with a respectable efficiency of 56% (48/86, see previous table), but its efficiency at 128 tasks at 16 tasks/node is only 9% (8/86). So it depends on how you look at it. Here is a scaling plot with results grouped by tasks/node; shown in red are the N3 results reported by Ji.

The lousy scaling at 16 tasks/node is apparent and is even much worse at 256 tasks. Ji's results from the N3 system appear to be in line with the results obtained using 4-8 tasks/node. Scaling results Ji produced on the Oak Ridge machine can not be compared to Seaborg's 8 and 16 tasks per node runs because that POWER 3 machine, Eagle, has only 4 processors/node.

Since Ji reported having run with as many tasks/node as possible on Seaborg, here is a plot comparing N3 and N3E runs.

The factor of 4 difference quoted by Ji and readily seen at 64 tasks. This discrepancy is investigated further in the followup to this report.

What is the best run strategy?

At first glance, you might say "Don't run with 16 tasks/node." That configuration gives the worst performance metrics. On the other hand, if the results are not time critical, Running on 1 node with 16 tasks/node maximizes ease of scheduling and nearly minimizes cost.

Conclusions

There does not appear to be a slowdown on Seaborg when running on more than one node, compared to running on a single node, as Ji had speculated. However, Ji claims his code scaled much better before the upgrade and that difference is still unexplained. The issue is addressed in the followup to this report.

This example again emphasizes that simple performance metrics can be difficult to interpret. An array of runs was necessary to extract useful results. Even apparently simple questions like "How does your code scale" and "How many MFlops does your code achieve?" do not always have simple meaningful answers.