NERSCPowering Scientific Discovery Since 1974

NERSC Initiative for Scientific Exploration (NISE) 2011 Awards

Evaluating PGAS scientific graph analysis codes on the Gemini interconnect

Jason Riedy, Georgia Institute of Technology

Associated NERSC Project: Linear Algebra Algorithms on High Performance Computers (mp156)
Principal Investigator: James Demmel, University of California Berkeley

NISE Award: 250,000 Hours
Award Date: June 2011

Processing graph-structured data is central in many applications ranging from disease tracking through social networks to financial regulatory monitoring. Graph-structured analysis is an increasingly important performance bottleneck even in traditional scientific simulation codes. Current common programming systems for high-performance computers place building and optimizing graph analysis codes in the hands of the few, dedicated experts who still have a high tolerance for mental pain and anguish. We are expanding existing practice and evaluating new approaches for emerging programming systems and computing systems more amenable to high-performance graph analysis.

We propose porting our MPI-based, memory scalable weighted bipartite matching code to UPC and evaluating its performance on Hopper's Gemini interconnect. This memory-scalable code computes a static pivoting order used in SuperLU's scalable unsymmetric LU factorization.

The current MPI implementation is saddled with extra communication for coordinating semi-asynchronous tasks and also uses extra memory for explicitly maintaining local views of global information. Our matching algorithm is based on an iterative auction algorithm. The extra MPI communication calls needed as the problem reduces in size and moves across the graph (sparse matrix) defeats performance scalability, leaving our matcher only memory scalable in general.

UPC's more flexible memory model permits easier coordination as the problem size and active location changes. We expect at least a moderate constant factor improvement in performance on long-tailed iterations (e.g. on the sparse matrix av41092). UPC's relaxed memory consistency over the fine-grained Gemini network may reduce the communication and synchronization throughout our semi-asynchronous algorithm. Auction prices rise monotonically and are tolerant to different views during computation so long as only the best price is accepted. Using more fine-grained memory access will permit working with less memory overhead per process, enhancing our current memory scalability.

Naturally, these are ideals that may not be achieved by all UPC implementations. Investigating these on an architecture like the XE6 that supports distributed PGAS operations at a low level helps evaluate what needs improved in UPC implementations.

For base performance modeling evaluation, we also propose to port our existing microbenchmarks and implement a simple Graph500 graph traversal implementation suitable for use as a reference code. The bipartite matching algorithm, an auction algorithm, resists straight-forward performance characterization. Having some predictable baseline results helps evaluate our matching code's performance optimizations.