NERSCPowering Scientific Discovery for 50 Years

NERSC Initiative for Scientific Exploration (NISE) 2011 Awards

CACHE novel architecture research

Brian Van Straalen, Lawrence Berkeley National Lab

Associated NERSC Project: Algorithms and Software for Communication Avoidance and Communication Hiding at the Extreme Scale (m1270)
Principal Investigator: Erich Strohmaier, Lawrence Berkeley National Lab

NISE Award: 470,000 Hours
Award Date: June 2011

We plan to conduct experiments on scalability, parallelization strategies (such as OpenMP, MPI, GPU, and their hybrids), genera, and architecture-specific code optimization techniques. Another benefit NERSC brings to this research is the ability to test the portability of autotuning techniques among HPC architectures such Intel based architectures and Cray Opteron based architectures. We estimate 120,000 core hours for our experimental study: evaluation of the strengths and limitations of autotuning = 15,000 core hours; experimental analysis on multiple problem classes on different architectures = 30,000 core hours; empirical analysis for new approaches = 30,000 core hours; tests on parallel autotuning approaches = 25,000 core hours; studies on autotuning GPU codes = 20,000 hours.

Edgar Solomonik and Jim Demmel and Brian Van Straalen are looking to explore a novel approach to topology-aware multicasts and multicore threaded reduction kernels to make dramatic reductions in communication contention for dense linear algebra and 2.5 D algorithms on Cray systems. We've done similar work already on BG/P systems. This will take at least 10,000 node hours, but since you get charged by the core hour we need about 200,000 additional hours. Some of the preliminary work will be done on Franklin with the Portals interface, but the XE6 DMAPP interface will also be worked on. This addresses all three elements of the NISE call.

Sam Williams and Noel Keen and Brian Van Straalen are working on fine-grained threading of stencil kernels and automatic code transformations and communication hiding schemes on multicore architectures. All three were proposed, but it looks like we have to put all three together to make any of them work at the finest point-wise parallelism scales. This work will probably take a similar amount as was used under the ultrascale development and the APDEC SciDAC work, roughly 150,000 core hours. This involves hybrid programming, threading, code transformation and novel approaches to multicore development and is part of moving Chombo to exascale production computing.