NERSCPowering Scientific Discovery Since 1974


Massive Acceleration of New Techniques In Science with Scalable Algorithms


Scalable Statistics and Machine Learning Algorithms are essential for extracting insights from Big Data. Our interdisciplinary team is trying to address a number of challenging analysis problems from a number of science domains at Lawrence Berkeley National Lab (LBNL). We are developing novel algorithms, and using state-of-the-art methods from High Performance Computing, Scientific Data Management and Parallel I/O for running software at scale on NERSC resources.

Problem Domains

Science DomainData SourceData CharacteristicsDataset SizeAnalysis Requirements
DESI Multiple Telescopes Multi-band, Noisy, Artifacts O(100)TB Data fusion, Inference
Daya Bay Detectors Spatio-Temporal, Noisy, Artifacts O(100)TB Pattern/Anomaly Detection
Life Sciences Mass Spectroscopy Imaging Multi-Modal, Noisy, Artifacts O(1)TB Dim. Reduction, Clustering
Genomics Sequencers Missing Data, Acquisition errors O(1-10)TB Clustering, Pattern Detection
Climate Satellites, Simulations Multi-Variate, Spatio-Temporal O(100)TB Pattern Detection

DESI (Astronomy)

We are addressing the problem of extracting a unified object catalog of objects in the sky from multiple telescope datasets (Sloan Digital Sky Survey, Palomar Transient Factory, Wide-Field Infrared Survey Explorer). The catalog (with associated uncertainties) will be used for target selection phase for the DESI project. We are formulating the problem as a graphical model; the major challenge is in performing efficient inference on the trillion-pixel dataset. We will explore stochastic variational inference and distributed Markov Chain Monte Carlo approaches for parameter estimation.

Daya Bay

Daya Bay is a reactor neutrino experiment located in China, with a host of international collaborators. A range of anti-neutrino detectors capture signatures corresponding to neutron decay, supernovae explosions, cosmogenic activity and background radiation. The overall dataset is more than 500TB. Current approaches for analysis of the Daya-Bay time-series consists of a manual exploration of short time windows. We are interested in applying unsupervised machine learning technology for automatically inferring high level categories of event types.

Life Sciences (Mass Spectroscopy Imaging)

Mass-spec imaging datasets provide unprecedented fidelity in measuring spatial distribution and abundances of molecules at locations within a sample. MSI has widespread applications in life sciences, bio-engineering, medical diagnosis and scientific experiments. Unfortunately, the lack of scalable data analysis capabilities is hindering progress in this field. We are exploring randomized linear algebra techniques for processing raw TB-sized MSI data volumes consisting of billions of voxels. We are exploring CUR decomposition and Nystrom approximations of spatial kernels; with the goal of producing scientifically interpretable basis vectors and meaningful clusters.


Graph-theoretical methods are prevalent in genomics and forms the basis of genetic mapping, gene-expression analysis, and genome assembly. Constructing graphs from the input data involve computing pairwise similarities and runs in at least O(n2k) time for n points with k features. The graph construction can hence be costlier than subsequent analysis. We will explore methods to compute important graph kernels such as minimum spanning forests directly on the input data itself without computing all pairwise similarities. We are exploring metric data structures as well as techniques that exploit the underlying low dimensional geometry.


We are interested in mining multi-variate, spatio-temporal climate datasets for extreme weather patterns (Hurricanes, Extra-Tropical Cyclones, Atmospheric Rivers, Heat waves, Blocking events, etc). High-resolution climate datasets produce O(100) TB output per run; current CMIP-5 archives consist of 5PB datasets, posing major scalability challenges for analysis tools. We will explore DBN methodologies and evaluate their performance for supervised learning tasks on climate datasets. We will also extend these methods for operating on multi-variate datasets.

Project Team


  • Prabhat, Aydin Buluc, Ben Bowen (LBL)
  • Michael Mahoney, Jiyang Yang (UC-Berkeley/ICSI/Stanford)
  • Jon McAuliffe, Jeff Regier, Fritz Sommer, Urs Koester (UC-Berkeley Stats/Redwood)
  • Ryan Adams, Brenton Patridge, Oren Rippel, Albert Wu, Elaine Angelos (Harvard/MIT)

LBL Collaborators

  • David Schlegel, Dan Dwyer (Physics)
  • Peter Nugent, Craig Tull, Michael Wehner, Daithi Stone, Oliver Ruebel, Lenny Oliker (CRD)
  • Wahid Bhimji, Evan Racah (NERSC)
  • Trent Northern, Katherine Louie (Life Sciences)
  • Dan Roshkar, Jarrod Chapman (JGI)

External Collaborators

    • Joey Gonzalez, Rashmi K V (UCB AMPLab)
    • Matt Hoffman (Adobe Research)
    • Dustin Lang (CMU)
    • Mohit Singh, Nadathur Satish, Narayanan Sundaram, Mostafa Patwari (Intel Research)
    • Amir Khosorowshahi