Richard Gerber
Report on RANDOM_NUMBER on Seaborg
This report updates and extends the initial scaling report on this code, in which it was shown that the runtime performance of the beam-beam code used by Rob Ryne's and Kwok Ko's SciDAC Accelerator project depended strongly on the number of tasks per node used. Further investigation reveals that the poor scaling on individual nodes is due to the inefficient default threading strategy used by IBM to implement the Fortran RANDOM_NUMBER() intrinsic function. Enabling a runtime setting, found in an obscure IBM reference, increases code performance by a factor of 2 when running 16 tasks on one node.
This report updates and extends the initial scaling report. Ji Qiang reported that the scaling of his code has changed since the upgrade of Seaborg to the NERSC 3E system, an upgrade that also included major updates of compilers and run time system software. Ji said that his single node performance has remained the same, but when scaling to more than 1 node, his code's performance is reduced by approximately a factor of 4 when compared to results obtained before the upgrade. Ji gave me a version of his code and an input deck that he said exhibited the slowdown. The run was short enough to permit running a number of times.
Initial results showed that the code's runtime performance was severely degraded when running with 16 MPI tasks per node on Seaborg, which has 16 processors per node, compared to runs with 1, 2, 4, or 8 tasks per node. Further investigation of the issues was proposed, with comparisons to any runtime data that was collected before the N3E upgrade, which was requested from Ji.
Ji provided the following runtime data he collected using the poe+ utliity before the the N3E upgrade. He stated that runs were performed with 16 tasks/node where possible. He could not definitively recall the exact compilation options used.
N_procs running_time(secs) Mflop/s memory_usage(MB) time/per_step/per_particle/per_procs 4 212.9 429.6 606.0 1.01e-5 8 105.8 803.5 920.0 1.01e-5 16 63.2 1350.9 992.0 1.20e-5 32 33.2 2593.0 1143.0 1.26e-5 64 27.6* 3156.8* 1441.0 2.10e-5*
The output from his code's internal timer when run on the N3 system is:
4 205.243764480110258 8 97.6185375999193639 16 53.1187435067258775 32 23.9866705865133554 64 17.8390514401253313
This information was used to create the plots in the initial report, in which a significant difference was apparent between N3 and N3E runs.
Since Ji was not sure about the compiler options used with his earlier runs, runs using a number of reasonable options were performed and compared to Ji's performance numbers. No significant difference was noted. Compilations were made with and without -q64, -qsave, -qtune=pwr3, -qarch=pwr3, -bmaxdata:, -baxstack:, and thread-safe compiler invocations.
The difference in runtime performance between Ji's N3 runs and runs on the current Seaborg with 16 or more tasks is striking. The code has internal timers which allows easy access to run times for specific routines. In addition the poe+ output is available. When comparisons are made among N3E runs with 16 tasks/node, fewer than 16 tasks/node, and Ji's N3 runs, the follow patterns emerge:
N3E:4 tasks/node: Average amount of time in system mode : 2.287500 seconds N3E:16 tasks/node: Average amount of time in system mode : 40.655000 seconds
Code
routine
name
_______
N3E:16 tasks/node init 40.67322 seconds
N3E:4 tasks/node init 3.13134 seconds
N3:16 tasks/node init 1.35488 seconds
N3E:16 tasks/node kvdist 40.59779 seconds
N3E:4 tasks/node kvdist 3.04916 seconds
N3:16 tasks/node kvdist 1.19429 seconds
Since I had Ji's source code, profiling using IBM's HPM library routine calls would have been possible; but because the code already had internal timers it was easier to move those around than to insert new HPM calls. Using that strategy the slowness at 16 tasks/node was tracked to the Distributionclass module. The routines in that module make heavy use of the Fortran RANDOM_NUMBER() intrinsic function. This was significant because I had previously noted poor scaling performance of RANDOM_NUMBER in multi-threaded OpenMP test programs.
As an initial test, I replaced the RANDOM_NUMBER intrinsic calls with the ESSL calls SURAND and DURAND. No significant performance difference was noted.
But when the RANDOM_NUMBER calls were commented out of the code, the poor performance using 16 tasks/node disappeared.
Having identified the problem, but with no solution, I solicited advice from the rest of the User Services Group members. A search of the IBM documentation resulted in no further information. A Google search similarly turned up nothing useful.
Jonathan Carter found the following in a README file in the XLF installation directory on Seaborg:
The following run-time option introduces the ability to control the number
of threads used by the MATMUL and RANDOM_NUMBER intrinsic procedures:
intrinthds={num_threads}
num_threads specifies the number of threads for parallel
execution of the MATMUL and RANDOM_NUMBER intrinsic procedures.
The default value for num_threads is equal to the number of
processors online.
Mike Stewart then located the following website http://msgs.sp2.net/cgi-bin/get/sp0207/45.html which has a buried comment:
A number of customers have experienced the problem that
multi-threaded intrinsic function floods their systems with
uncontrollable number of threads, which can result in
unacceptable run-time performance. In order to
solve this problem, we have to provide a new run-time option to
give them the ability to control the number of threads used in
those functions. Currently only matmul and random_number have
the multi-threaded version.
Syntax:
intrinthds={num_threads}
The runtime option is specified by setting the proper string in the XLFRTEOPS environment variable. The following command was executed (under csh) before running the code:
% setenv XLFRTEOPTS 'intrinthds=1'
N3E:16 tasks/node init 1.98664 seconds N3E:16 tasks/node kvdist 1.84401 secondsNot quite what Ji had obtained on N3, but a factor of 20 times better than N3E with default values. Interestingly, setting intrinthds=16, which would appear to be the intended default, did not degrade performance.
As a follow-on, Mike Stewart investigated the performance of the MATMUL intrinsic with compiler versions 7.1.1.1 and the current default version 8.1.0.3. Mike found a significant slowdown in MATMUL using 8.1.0.3 in its default configuration and filed a informational PMR (problem report) with IBM. No such effect was seen in tests with RANDOM_NUMBER.
With the intrinthds parameter set, the runs from the initial study were resubmitted. The results follow.
| Compile options | -O3 -qtune=pwr3 -qarch=pwr3 -qstrict -qsave |
|---|---|
| Memory usage | ~700MB |
| Number of particles | 2,000,000 |
| 3D mesh size | 64x64x32 |
| XLFRTEOPTS environment variable runtime setting | intrinthds=1 |
The values using the default configuration (i.e., no XLFRTEOPTS setting) are in parentheses.
Another way to look at the data is in tables that emphasize the importance of tasks per node.
| Tasks per Node | |||||
|---|---|---|---|---|---|
| Number of Nodes | 16 | 8 | 4 | 2 | 1 |
| 1 | 53.1 (106.6) | 103.6 (115.6) | 206.1 (209.0) | ||
| 2 | 25.0 (62.0) | 47.0 (53.2) | 101.6 (100.8) | 205.2 (207.5) | |
| 4 | 15.6 (73.9) | 21.9 (27.2) | 45.3 (45.9) | 97.6 (98.0) | 202.1 (201.7) |
| 8 | 20.8 (75.4) | 14.1 (21.7) | 21.6 (22.8) | 44.2 (44.7) | 96.1 (96.1) |
| 16 | 38.1 (181.2) | 13.5 (16.7) | 12.7 (14.1) | 21.2 (21.6) | 43.7 (44.0) |
| 32 | 21.8 (32.9) | 12.2 (12.1) | 12.3 (12.7) | ||
| Tasks per Node | |||||
|---|---|---|---|---|---|
| Number of Tasks | 16 | 8 | 4 | 2 | 1 |
| 4 | 206.1 (209.0) | 205.2 (207.5) | 202.1 (201.7) | ||
| 8 | 103.6 (115.6) | 101.6 (100.8) | 97.6 (98.0) | 96.1 (96.1) | |
| 16 | 53.1 (106.6) | 47.0 (53.2) | 45.3 (45.9) | 44.2 (44.7) | 43.7 (44.0) |
| 32 | 25.0 (62.0) | 21.9 (27.2) | 21.6 (22.8) | 21.2 (21.6) | |
| 64 | 15.6 (73.9) | 14.1 (21.7) | 12.7 (14.1) | 12.3 (12.7) | |
| 128 | 20.8 (75.4) | 13.5 (16.7) | 12.2 (12.1) | ||
| 256 | 38.1 (181.2) | 21.8 (32.9) | |||
Given this data, what does it mean? There are a number of different questions that can be asked.
From the data available, the problem will be solved in the least amount of time - 12.2 seconds - by running 128 tasks on 32 nodes: 4 tasks per node. This result is unchanged by the RANDOM_NUMBER problem.
| Number of Nodes | ||||||
|---|---|---|---|---|---|---|
| Number of Tasks | 1 | 2 | 4 | 8 | 16 | 32 |
| 4 | 8,244 (8,360) | 16,416 (16,600) | 32,336 (32,272) | |||
| 8 | 4,144 (4,624) | 8,128 (8,064) | 15,616 (15,680) | 30,750 (30,750) | ||
| 16 | 2,124 (4,264) | 3,760 (4,256) | 7,248 (7,344) | 14,144 (14,304) | 27,968 (28,160) | |
| 32 | 2,000 (4,960) | 3,504 (4,352) | 6,912 (7,296) | 13,568 (13,824) | ||
| 64 | 2,496 (11,824) | 4,512 (6,944) | 8,122 (9,024) | 15,744 (16,230) | ||
| 128 | 6,656 (24,128) | 8,640 (10,688) | 15,616 (15,539) | |||
| 256 | 24,384 (115,968) | 27,904 (42,112) | ||||
Running this job on 2 node with 32 tasks is the best strategy from a purely allocations standpoint. With the RANDOM_NUMBER fix, the minimum cost is cut by more than half and the time to run the cheapest job is reduced from 53 seconds to 25 seconds.
Since the minimum run time is not affected by the RANDOM_NUMBER effect, the aggregate MFlops/s remains approximately 6,100.
| Tasks per Node | |||||
|---|---|---|---|---|---|
| Number of Tasks | 16 | 8 | 4 | 2 | 1 |
| 4 | 339 (334) | 340 (336) | 345 (346) | ||
| 8 | 676 (605) | 689 (694) | 717 (714) | 728 (728) | |
| 16 | 1,432 (713) | 1,618 (1,429) | 1,678 (1,656) | 1,720 (1,701) | 1,740 (1,728) |
| 32 | 3,066 (1,236) | 3,500 (2,818) | 3,549 (3,362) | 3,613 (3,549) | |
| 64 | 4,629 (977) | 5,122 (3,328) | 5,691 (5,122) | 5,871 (5,695) | |
| 128 | 3,590 (990) | 5,532 (4,472) | 6,121 (6,152) | ||
| 256 | 2,091 (440) | 3,654 (2,241) | |||
The cheapest job (shown in blue) now gets better than 3 GFlops/s.
| Tasks per Node | |||||
|---|---|---|---|---|---|
| Number of Tasks | 16 | 8 | 4 | 2 | 1 |
| 4 | 85 (84) | 85 (84) | 86 (86) | ||
| 8 | 85 (76) | 86 (87) | 90 (89) | 91 (91) | |
| 16 | 90 (45) | 101 (89) | 105 (104) | 108 (106) | 109 (108) |
| 32 | 96 (39) | 109 (88) | 111 (105) | 113 (111) | |
| 64 | 72 (15) | 80 (52) | 89 (80) | 92 (89) | |
| 128 | 28 (8) | 43 (35) | 47 (48) | ||
| 256 | 8 (2) | 14 (10) | |||
With the RANDOM_NUMBER fix, the code now scales slightly hetter than Ji had reported for the N3 system (shown in red). Here is the updated scaling plot with results grouped by tasks/node. These runs of the code do not scale well beyound 128 tasks; they are probably too short and small to use to measure large concurrency performance.
The updated plot showing Ji's N3 result and the current results when run with the same number of tasks per node shows that the N3E discrepancy has disappeared.
The discrepancy between run times on Seaborg before and after the N3E upgrade can be eliminated by using a runtime setting that controls the number of threads used by the Fortran intrinsic function RANDOM_NUMBER(). It is (as yet) unknown whether the behavior of RANDOM_NUMBER() changed with the N3E upgrade, which included major compiler version upgrades. However, there is strong evidence - the large run time of the routines containing RANDOM_NUMBER calls and the documented change in the behaviour of the MATMUL intrinsic - that this is the case. If so, this can explain the observed behavior of the BeamBeam code.
NERSC should make efforts to determine if other codes could benefit from making the runtime setting discussed in this report.