NERSCPowering Scientific Discovery Since 1974

Application Performance Variability on Hopper

Introduction

The Hopper system is a Cray XE6 system with roughly 6300 compute nodes.  In normal day to day operations, Hopper can be running hundreds of individual applications at any given time.  Some users have reported application runtime variability, in some cases as large as 30-40%.  Non-uniform runtimes makes it more difficult for scientists to measure the performance of their codes and to estimate the how many simulations can be run with their current allocation of computer time.  A common cause of variability on HPC systems is competition for I/O resources.  However, even users whose applications did little I/O reported runtime variability.  We hypothesized that an application's placement across the 3D torus network could impact performance.  In this study we proposed to understand the impact of Node Placement on performance and thus application runtime variability.

When the job scheduler on Hopper first attempts to place applications onto nodes after a reboot, applications are placed on contiguous nodes, following the order of a node list in order to reduce communication costs for applications.  Those first jobs scheduled after a reboot are placed very optimally on the system.  But because Hopper runs so many different jobs at a single time, which request different numbers of nodes and wall clock times, over time, the system can become fragmented which creates 'holes' in the system where when job has finished.  The job scheduler continues to place applications on nodes in order of the node list, starting from the beginning of the list, and placing an application on the first available nodes whether they are contiguous or not.  This can result in an application spread widely across nodes in the system.

All images and graphs can be viewed in full size by clicking on them.

Visual Example

The below graphic shows a visual picture of node placement for the MILC application run on 8192 cores three different times.  It's clear to see that in the blue case, the nodes are tightly packed together, while in the orange and red cases the nodes are more scattered around the system.   Our hypothesis is that the more tightly the nodes are packed the faster the runtime.

Shown here are three different node placements of MILC

As the chart below shows, the NERSC_TIME (benchmark time of MILC) of the three runs seem to fit their given node placement. The blue one, with the most compact and contiguous fit, runs almost three times faster than the red one, which has points scattered throughout the entire system.

The goal of this study is to find a way to mathematically describe the shape and placement of the nodes (or a property of the placement) that corresponds with the resulting NERSC_TIME.

Job NERSC_TIME
Blue 887.12 Seconds
Orange 1298.88 Seconds
Red 2462.33 Seconds

 

 

 

 

 

Methodology

Comparing the recorded variability of the different benchmarks, the MILC code stood out as being one of the programs with more variability and was chosen as the program for this study. The batch script created for this study ran the MILC code three times per job, while running a script that reports the network coordinates of each node between each MILC run. The coordinates are not the physical location on the machine, but the actual logical network location on the torus. By running the MILC code three times per job, the variability caused by noise (interference) can be found as the node placement is assigned with the batch script. The best way to eliminate noise as much as possible is to use only the only the lowest NERSC_TIME in a job. Since it is known that the MILC code can run at least as fast as the lowest time reported in the job, the difference of the two other times can be attributed to system noise. All of the following graphs have been graphed against the minimum NERSC_TIME in a job.

The analysis also included a basic weighting system, which gives a penalty to the Y axis; The bandwidth on Hopper in the Y direction in approximately half of the X and the Z directions.

As shown by this graph, the variability in a single run (with the same node placement) is very small; Many look as if the points are on top of each other.

Lognormal Distribution

Looking at the histogram and quantile comparison charts, the NERSC_TIME data seems to be of a lognormal distribution. This means there is a lot more statistical analysis left to do.

 

Histogram and QQ plot of NERSC_TIME

Data Analysis

The key to figuring out performance variability is to find the correct metric(s). Using the node coordinates and the NERSC_TIME (MILC's benchmark time), serveral metrics were tried.

Average Distance

This is the average distance of all possible communication.

Between Gemini

This is the average distance in XYZ between all of the Gemini.

This is the average distance (weighted) in XYZ between all of the Gemini.

 
Between Nodes

Average distance between two nodes

 

Maximum Distance

This is the maximum distance between any two Gemini.

 

Hops Needed

This is the number of hops in total that MILC needed to run. The metric was created using the output given by the IPM profiler (it provided the ID of each send and recieve). MILC is mainly composed of three calls: MPI_Isend, MPI_Irecv, and MPI_Allreduce. These are the three functions that were processed.

IPM Distance (number of hops required)

 

Volume and Surface Area

These calculations were done by doing a delaunay triangulation on the coordinates and then running a convex hull on the triangulated coordinates.

Note that this is NOT the cross section or bandwidth of the placement, but the surface area of the convex hull of the coordinates.

This is the volume of the convex hull of the coordinates.

 

Results

The metrics tested all showed some amount of positive correlation, but the r-squared values are not high enough to be conclusive and more work needs to be done.

  Total Triplets Minimum Times Difference
Minimum 887.1 secs 904 secs 904 secs -16.9 secs
Mean 1114.8 secs 1102 secs 1081 secs 33.8 secs
Maximum 2462.3 secs 1727 secs 1156 secs 906.3 secs

The ratio of the difference of means (of minimum times versus total times) is about 3.1%. That means the variability on a ~60 minute run (constant node placement) is about 3.1%.

The Next Steps

Variance in a Run

Part of the output produced by MILC is FFTIME, which is the time for one step of calculations. Each step is doing the similar calculations and should take around 2 seconds on Hopper, so any change in calculation can be attributed to system noise.

This run, for example, was subject to large amounts of noise for most of the run.

 

Some preliminary work was done comparing the metrics used above to the minimum FFTIME of each MILC run (rather than the minimum NERSC_TIME of each job). The results were also not very conclusive, as the minimum FFTIME of the runs very just too close for most runs.

Future Work

Cra is working to improve the scheduler on Hopper so that jobs are scheduled based on contiguous nodes, rather than first available in the node list.  When this new version of Moab becames available we will update this study.  (Expected spring 2012).

This work was performed by Dylan Wang from Aragon High School in San Mateo.