NERSCPowering Scientific Discovery Since 1974

SNAP

Description

SNAP is a proxy application to model the performance of a modern discrete ordinates neutral particle transport application. SNAP may be considered an update to Sweep3D [1], intended for hybrid computing architectures. It is modeled off the Los Alamos National Laboratory code PARTISN. PARTISN solves the linear Boltzmann transport equation (TE), a governing equation for determining the number of neutral particles (e.g., neutrons and gamma rays) in a multi-dimensional phase space. [2] SNAP itself is not a particle transport application; SNAP incorporates no actual physics in its available data, nor does it use numerical operators specifically designed for particle transport. Rather, SNAP mimics the computational workload, memory requirements, and communication patterns of PARTISN. The equation it solves has been composed to use the same number of operations, use the same data layout, and load elements of the arrays in approximately the same order. Although the equation SNAP solves looks similar to the TE, it has no real world relevance.

Download

SNAP-May06 tar file (May 06, 2013 version)

How to Build

SNAP is written to the Fortran 90/95 standard. Modules are used to provide explicit interfacing among the different procedures. Distributed memory communications are performed using MPI commands, and threading is achieved with OpenMP pragmas.
[See docs subdirectory for more detailed information]

To build SNAP change to the 'src' subdirectory, edit the Makefile and simply type

make

How to Run

Because SNAP currently requires building with MPI, to execute SNAP, use something similar to the following command:

mpirun -np [#] [path]/snap [infile] [outfile]

This command will automatically run with the number of threads specified by the input file, which is used to set the number of OpenMP threads, overwriting any environment variable to set the number of threads. 

The command line is read for the input/output file names. If one of the names is missing, the code will not execute. Moreover, the output file overwrites any pre-existing files of the same name.

There are scripts for running the small, and large problems in the 'benchmark' subdirectory, in which there are also sample output files.

The scripts are configured for running on NERSC's Hopper machine.  The number of MPI ranks can be modified to suit the Offeror's target architecture (this will require modifying npey,npez).

Required Runs

We require the weak scaling results, where the problem size per processor is constant regardless of the number of processors(sockets) with 1-MPI rank per core, no OpenMP threads. Focus on 16x16x16 per processor on hopper.

In *addition* to the above you can adjust the # MPI ranks and OpenMP threads that give the best performance for your architecture and report the configuration and results.

Problem sizing

Running a time-dependent problem: This requires 2 copies of the angular flux ('f' in the manual)--one for incoming, one for outgoing, of a time step. The angular flux arrays are sized by the number of cells (nx, ny, nz), the number of angles (8*nang), and the number of groups (ng). Memory requirements are as follows:

Memory in double precision words = 2 * (nx * ny * nz) * (8 * nang) * ng

Relevant problem ranges -- Single processor:
nx*ny*nz = ~1000-4000
nang = ~50-250
ng = ~30-100

Large - 1-MPI + no OpenMP threads [located in the directory "large"]

As example for NERSC's Hopper (2 socket, 12 cores/socket) 2048 node run -- Large problem:
scale npey*npez to be the total number of ranks 256*192=49152, 12 MPI ranks per socket;
For the large problem we want around 16x16x16 cells per socket, 16x16x16*4096 sockets = approx 16M cells, nx=16, ny=1024 nz=1152 gives us 18874368 cells. keeping ny divisible by npey and npez divisible by nz.
**
for 2048 nodes, 2048*24 = (256*192) = (npey*npez)= 49152 cores,
for 49152 cores == large == 16x(4*256)x(6*192) == 16x1024x1152

Large problem memory requirements:
nx*ny*nz = 16x1024x1152 = 18874368 = 3932 per socket
nang = 200 (will be 200 per octant)
ng = 72
Memory = 7.2 Gbytes/socket, 35TB total

Small - 1-MPI + no OpenMP threads [located in the directory "small"]
As example for NERSC's Hopper 4 node run with the Small problem:
8 sockets therefore 8*12=96 MPI ranks npey=12 npez=8, nx=16 ny=48 nz=48
**
for 4 nodes, 4*24 = (12*8) = (npey*npez)= 96 cores,
for 96 cores == small == 16x(4*12)x(6*8) == 16x48x48

Small problem:
nx*ny*nz = 16x48x48 = 36848 = 4608 per socket
nang = 50
ng = 32

Memory in double precision words = .94 Gbytes/socket, ~8GB total 

Timing

Report the "Solve" time.

Verification

If SNAP runs to completion you will see:
Success! Done in a SNAP!

Record the MMS Verification deltas as show below:

MMS Verification
********************************************************************************
Manufactured/Computed Solutions Max Diff= 2.712446E-01
Manufactured/Computed Solutions Min Diff= 2.190662E-08

 

More Details on SNAP

The solution to the time-dependent TE is a “flux” function of seven independent variables: three spatial (3-D spatial mesh), two angular (set of discrete ordinates, directions in which particles travel), one energy (particle speeds binned into “groups”), and one temporal. PARTISN, and therefore SNAP, uses domain decomposition over these dimensions to coherently distribute the data and the tasks associated with solving the equation. The parallelization strategy is expected to be the most efficient compromise between computing resources and the iterative strategy necessary to converge the flux.

The iterative strategy is comprised of a set of two nested loops. These nested loops are performed for each step of a time-dependent calculation, wherein any particular time step requires information from the preceding one. No parallelization is performed over the
temporal domain. However, for time-dependent calculations two copies of the unknown flux must be stored, each copy an array of the  six remaining dimensions. The outer iterative loop involves solving for the flux over the energy domain with updated information about coupling among the energy groups. Typical calculations require tens to hundreds of groups, making the energy domain suitable for threading with the node’s (or nodes’) provided accelerator. [3] The inner loop involves sweeping across the entire spatial mesh along each discrete direction of the angular domain. The spatial mesh may be immensely large. Therefore, SNAP spatially decomposes the problem across nodes and communicates needed information according to the KBA method. [4] KBA is a transport-specific application of  general parallel wavefront methods. Lastly, although KBA efficiency is improved by pipelining operations according to the angle, current chipsets operate best with vectorized operations. During a mesh sweep, SNAP operations are vectorized over angles to take advantage of the modern hardware.