NERSCPowering Scientific Discovery Since 1974

SNAP

Description

SNAP is a proxy application to model the performance of a modern discrete ordinates neutral particle transport application. SNAP may be considered an update to Sweep3D [1], intended for hybrid computing architectures. It is modeled off the Los Alamos National Laboratory code PARTISN. PARTISN solves the linear Boltzmann transport equation (TE), a governing equation for determining the number of neutral particles (e.g., neutrons and gamma rays) in a multi dimensional phase space. [2] SNAP itself is not a particle transport application; SNAP incorporates no actual physics in its available data, nor does it use numerical operators specifically designed for particle transport. Rather, SNAP mimics the computational workload, memory requirements, and communication patterns of PARTISN. The equation it solves has been composed to use the same number of operations, use the same data layout, and load elements of the arrays in approximately the same order. Although the equation SNAP solves looks similar to the TE, it has no real world relevance.

Download

SNAP.tar.gz (June 13 edition)

Problem Sizing

Running a time-dependent problem: This requires 2 copies of the angular flux ('f' in the manual)--one for incoming, one for outgoing, of a time step. The angular flux arrays are sized by the number of cells (nx, ny, nz), the number of angles (8*nang), and the number of groups (ng). Memory requirements are as follows:

Memory in double precision words = 2 * (nx * ny * nz) * (8 * nang) * ng

Relevant problem ranges -- Single processor:
nx*ny*nz = ~1000-4000
nang = ~50-250
ng = ~30-100

All other arrays are significantly smaller than the angular flux arrays, hence estimating memory requirements for different calculations typically focuses on these arrays.

Compiling

SNAP requires a Fortran compiler with OpenMP capabiliites.

(1) Untar the source
(2) cd SNAP/src
(3) Edit the Makefile for your compiler environment
(4) Make snap. The executable, snap, will be in the src/ directory.

Enabling OpenMP requires no changes to the source code and requires no macros to be defined. Simply add the relevant compiler flag to the Makefile.

Running

There are scripts for running the small, and large problems, the scripts are called run-<problem size>.sh for the respective size problems. Sample outputs from the NERSC Hopper system are also included as well as instrumented performance output from IPM.

For runs where the use of OpenMP is allowed, using OpenMP threads requires two changes. First, the OMP_NUM_THREADS shell variable must be set. Second, the input deck includes a variable named "nthreads" which must also be changed. Its default value is 1. This variable allows the user to specify using fewer than the number of threads specified by OMP_NUM_THREADS.

Where allowed, the number of MPI ranks may be changed.  This will require modifying npey and npez (the total number of MPI tasks is npey*npez) as well as nx, ny, nz, in the input file because we want to keep the memory size for the large problem at ~200M cells and the small at ~50Kcells. 

Verification

If SNAP runs to completion you will see:
Success! Done in a SNAP! 

Reporting

You must report two (2) items from the output in the procurement spreadsheet:

(1) the "Solve" time, found in the Timing Summary at the end of the standout output.
Solve time is the total time minus the initialization and input/output times.

(2) the number of total iterations.

The number of total iterations can be found by searching for the following text
in the stdout file and reporting the value at the end of the line:

"Total inners for all time steps, outers"

For example, for the large test case, the sample output shows

Total inners for all time steps, outers = 3049

Results are written to stdout and please include all output in separate files with unique names that can easily be associated with the required runs.

Solve time is the total time minus the setup and input/output times. Grind time is an important performance metric, representing the time to solve for a single phase-space variable--i.e., the angular flux of a a single cell, in a single direction, for a single group, over one iteration (mesh sweep).

Problem Descriptions

We require the weak scaling results, where the problem size per processor is constant regardless of the number of processors(sockets) with 1-MPI rank per
core, no OpenMP threads. We focus on 16x16x16 per core on hopper and cielo.

**************************************************************************

Large - 1-MPI + no OpenMP threads [located in the directory "large"]

An example for NERSC's Hopper (2 socket, 12 cores/socket) 2048-node run -- Large problem: scale npey*npez to be the total number of ranks 256*192=49152, 12 MPI ranks per socket. Absent OpenMP, we will focus on a problem suitable for a single core. For the large problem we want 4096 cells per core, 4096 cells/core*
49152 cores = ~200M cells.

#cores/MPI ranks     nx     ny     nz      npey     npez    ichunk
----------------     --     --     --      ----     ----    ------
1                    16     16     16         1        1      8
2                    16     16     32         1        2      8
4                    16     32     32         2        2      8
8                    32     32     32         2        4      8
16                   32     32     64         4        4      8
... (continue this cycle of increasing ranks and dimensions)
2048                128    256    256        32       64      16
...
32768               512    512    512       128      256      32
49152               512    512    768       128      384      32

Large problem memory requirements:
Time-dependent
src_opt=0 (no angular source array)
ng = 32
nang = 50 (will be 50 per octant, 400 total)

Memory = ~0.9 GB/core

**************************************************************************
Small - 1-MPI + no OpenMP threads [located in the directory "small"]

An example for NERSC's Hopper 4-node run with the Small problem: 96 MPI ranks
total. Absent OpenMP, we will again focus on creating a weak scaling problem
by considering the problem per core.

#cores/MPI ranks     nx     ny     nz      npey     npez    ichunk
----------------     --     --     --      ----     ----    --------
1                     8      8      8         1        1      8
2                     8      8     16         1        2      8
4                     8     16     16         2        2      8
8                    16     16     16         2        4      8
16                   16     16     32         4        4      8
32                   16     32     32         4        8      8
64                   32     32     32         8        8      8
96                   32     32     48         8       12      8

Time-dependent
src_opt=0 (no angular source array)
ng = 32
nang = 24 (will be 25 per octant, 200 total)

Memory = ~0.07 GB/core

More Details

The solution to the time-dependent TE is a "flux" function of seven independent
variables: three spatial (3-D spatial mesh), two angular (set of discrete
ordinates, directions in which particles travel), one energy (particle speeds
binned into "groups"), and one temporal. PARTISN, and therefore SNAP, uses
domain decomposition over these dimensions to coherently distribute the data and
the tasks associated with solving the equation. The parallelization strategy is
expected to be the most efficient compromise between computing resources and the
iterative strategy necessary to converge the flux.

The iterative strategy is comprised of a set of two nested loops. These nested
loops are performed for each step of a time-dependent calculation, wherein any
particular time step requires information from the preceding one. No
parallelization is performed over the temporal domain. However, for time-
dependent calculations two copies of the unknown flux must be stored, each copy
an array of the six remaining dimensions. The outer iterative loop involves
solving for the flux over the energy domain with updated information about
coupling among the energy groups. Typical calculations require tens to hundreds
of groups, making the energy domain suitable for threading with the node's (or
nodes') provided accelerator. [3] The inner loop involves sweeping across the
entire spatial mesh along each discrete direction of the angular domain. The
spatial mesh may be immensely large. Therefore, SNAP spatially decomposes the
problem across nodes and communicates needed information according to the KBA
method [4]. KBA is a transport-specific application of general parallel
wavefront methods. Lastly, although KBA efficiency is improved by pipelining
operations according to the angle, current chipsets operate best with vectorized
operations. During a mesh sweep, SNAP operations are vectorized over angles to
take advantage of the modern hardware.

SNAP is written to the Fortran 90/95 standard. Modules are used to provide
explicit interfacing among the different procedures. Distributed memory
communications are performed using MPI commands, and threading is achieved with
OpenMP pragmas. [See docs subdirectory for more detailed information]