Data Intensive Computing Pilot Program 2012 Awards
NERSC's new data-intensive science pilot program is aimed at helping scientists capture, analyze and store the increasing stream of scientific data coming out of experiments, simulations and instruments.
High Throughput Computational Screening of Energy Materials
Gerbrand Ceder, Massachusetts Institute of Technology
NERSC Repository: matdat
The project plans to use 10 terabytes of NERSC project disk space, 20 terabtyes of archival storage, 10 million core hours on Carver, flash storage, and Carver's 1 terabyte memory nodes.
This project plans to bring data analysis inside the computational workflow of high-throughput computational surveys in materials science. If this can be automated in a sufficiently efficient and scalable way we propose an end result that puts data in the driver's seat where researchers interact solely with a growing and responsively dynamic knowledge base of materials while the compute-centric aspects happen automatically in the background.
The HPC community is familiar with simulation science that produces large data sets. Traditionally these have been from large scale simulations the inputs to which are driven by experts asking a question about a certain particular system. This proposal concerns the case where it is instead data which drives the simulation. High-throughput computational (HTC) surveys in materials science traverse large spaces of possible materials and use a simulation engine to record the properties of the materials as the survey progresses. With modern data technologies such as flash-storage and large memory (~1TB) nodes the possibility arises to continually review the body of produced data to adapt the workflow in interesting and productive ways. The most basic is to advise local search directions in the materials space. At a larger scale homology and consistency “all-to-all” checks can identify global directions for search, patterns of errors or discrepancies in simulation, and advise computational and algorithmic switching as boundaries in the search space are crossed that define applicability or efficiency of methods. Putting data in the driver’s seat is a clear direction to explore, but the impact on the overall HTC workflow must be minimized, making good use of “fast data” technologies in order to not impede the computational progress. The alternative, the current state of the are, is to engage in course grained steps of computation, data analysis, and subsequent revised computation. Off-line data analysis typically takes place with experts in the loop and informs the next computational "push".
By scalable non-relational databases (such as MongoDB, which is already used in a post processing mode) and perhaps basing them on flash filesystems we may be able to answer complex data heavy queries with low enough latency to advise a large number of compute workers about their next steps. Parallel databases such as sharded MongoDB are key prospects in this direction. The questions asked are both forward, i.e., about next material(s) to be computed, but also present-tense regarding the likelihood of the current simulation to complete based on the known material homologs in the database. We view this research as a benchmarking effort for new data centric technologies that measures the limits of how HTC computational surveys may be designed. The outcomes should be useful to both simulation scientists and HPC providers.
This project will use PyMatGen: python materials genomics is the python library that powers the Materials Project (http://www.materialsproject.org). It provides a framework for materials analysis with flexible classes for the representation of Element, Site, Structure objects, extensible io capabilities to manipulate many VASP input and output files and the crystallographic information file formats, tools to generate and view compositional and grand canonical phase diagrams, and electronic structure (DOS and band) analysis.
The project will also use a scaled up MongoDB database (preferably sharded to provide parallelism) and the existing materialsproject database. By benchmarking the time to complete an all-to-all comparison on a new material we can estimate the latency introduce to the computation on a per material basis. From there we can look at overall scalability (modulo contention) for a stream of requests arriving from compute workers being fedback in a real-time basis. A new science gateway that has reports and data specific to methodological details about the technologies themselves will be created.
Analysis and Serving of Data from Large-Scale Cosmological Simulations
Salman Habib, Argonne National Laboratory
NERSC Repository: hacc
The project plans to use up to 300 terabytes of NERSC project disk space, 750 terabtyes of archival storage, 2 million core hours on Carver, flash storage, and Carver's 1 terabyte memory nodes.
Cosmology is undergoing a period characterized by remarkable discoveries and explosive growth in the power and number of interconnected cosmic probes. An essential aspect of these "cosmological experiments" is that the Universe is part of the apparatus, significantly blurring the line between theory and experiment. Cosmological simulations are an essential aspect of the connection between theory and observations. As simulations become ever more complex, they will begin to rival in size and complexity the data produced from sky surveys using ground and space-based telescopes
covering a wide range of wavebands. The analysis and serving of the data products from simulations will greatly amplify the science return from theoretical and observational projects now being carried out by the community, and those planned for the near future.
Large-scale cosmological simulations are a vital component in the effort to extract the maximum science from ongoing
and proposed next-generation sky surveys (e.g. BOSS, BigBOSS, DES, LSST, Planck, SPT). These simulations play three essential roles:
- they provide direct means for cosmological discoveries that require a strong connection between theory and observations ("precision cosmology");
- they are an essential "tool of discovery" in dealing with large datasets generated by complex instruments, and,
- are a source of high-fidelity simulations that are necessary to understand and control systematics, especially astrophysical systematics. In all three use case, the simulations create a significant need for a large data storage and analysis capability.
Simulation data can be usefully partitioned into three classes. Level 1 is the raw data produced within the simulations, Level 2 -- the 'science' level is reduced data produced via in situ or post-processing analyses and Level 3 is a further reduction from Level 2 to the 'catalog' level suitable for query-based analysis exploits and other simpler analysis tasks. Large-scale simulations are typically carried out by HPC experts but their results, suitably exposed, heavily impact the science carried out by the much larger sky survey community. An important aim of this project is to facilitate a much closer connection between the simulation and survey analysis communities.
The simulation data products at Levels 2 and 3 can be used for analysis of observational data from surveys. We will work with observational teams from BOSS, DES, and SPT to directly impact cosmology. This work will be related to topics such as covariance analyses for baryon acoustic oscillation measurements (BOSS), weak lensing measurements (DES), and CMB lensing and cross-correlations (SPT). We will also produce mock catalog data for future LSST observations, primarily targeted at weak lensing probes of cosmology.
In this project, we propose to work primarily with Level 2 data from large simulations carried out at NERSC on Hopper, at ANL on the new 10 PFlops Mira system (we have an early science project on Mira), and later on with data from Jaguar/Titan at ORNL. We will "seed" the project with current simulation results from our "Coyote Universe" code suite,
roughly 20 TB of data.
The data-intensive resources will be used to carry out three main tasks: 1) analyze and reduce Level 2 data to Level 3, and 2) organize Level 2 and Level 3 data so that external users and collaborators can interact with the datasets via a web portal for both serving and analysis, and 3) develop an analysis framework that will allow the users the ability to run "canned' analyses under their (parametric) control through the web portal. Some of this analysis may be possible
on the portal front-end, while some of it may require running routines on a back-end cluster.
The main simulation source for the data will be the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework designed to carry out very large cosmological N-body simulations. HACC is currently operational on a variety of supercomputer architectures (CPU, CPU/Cell, CPU/GPU, IBM BG/P and BG/Q) and can scale to machine sizes significantly larger than those available currently.
The data analysis methods cover graph-theoretic algorithms applied to particle data as well as grid-based algorithms (spectral and configuration space). HACC contains an in situ and post-processing analysis framework (e.g., halo/sub-halo finder, statistical analysis, lensing maps). The halo merger trees from HACC will be further ingested by the semi-analytic code Galacticus to produce galaxy catalogs.
A major science activity in our project is the analysis of a large number of cosmological simulations run at different values of cosmological parameters. I/O bandwidth has been a continuous problem because of the large amount of data that has to be moved in and out of computational analysis resources (TBs). The availability of Flash memory should help considerably with this, and we will be happy to try out and exploit this resource.
Most of our parallel analysis tools are MPI-based; nevertheless there are cases where a single large memory node can be useful, especially when using OpenMP-based analysis tools (a good mode for prototyping analysis methods and algorithms).
We are very interested in using a web interface to provide access to our data. The web interface would support different operational modes:
- for serving public data (open to all) and private data (to collaborators),
- allowing the capability to run pre-loaded analysis packages on targeted data 'objects', with analysis parameters under the control of the user,
- allowing collaborators the possibility of initiating analysis runs on major NERSC resources (via the usual account system).
Additionally, we would hope to develop the web interface so that action can be initiated on remote data at remote sites (say at ANL or ORNL).
We will begin with a relatively simple database structure relying on initial work for the Coyote Universe dataset. This will evolve over time depending on the use patterns and science cases of interest. We expect to work with NERSC experts to decide on the appropriate database technologies available.
Interactive Real-time Analysis of Hybrid Kinetic-MHD Simulations with NIMROD
Charlson C. Kim, University of Washington
NERSC Repository: m1552
The project plans to use up to 100 terabytes of NERSC project disk space, 10 terabtyes of archival storage, 0.7 million core hours on Carver, flash storage, and Carver's 1 terabyte memory nodes. The use of Hadoop will be investigated as well.
Wave-particle interactions are critical to understanding plasmas in high temperature fusion regimes. The presence of energetic particles is known to stabilize instabilities such as the internal kink mode beyond their expected ideal stability. The internal kink mode is the magnetohydrodynamic (MHD) analogue of the kinking that results in a twisted rope. As in the case of the rope, the internal kink is a release of the built up energy. The extended stability leads to a greater build-up of stored energy until another stability threshold is reached resulting in a rapid release of energy known, the sawtooth crash. This build-up and crash are not well understood due, in part, to the complex dynamics of energetic particles and their sensitivity to details of geometry and distribution function. Other energetic particle interactions can destabilize the plasma such as the toroidal Alfven eigenmode (TAE).
This proposed research will attempt to elucidate the details of these interactions by exploring the vast data generated by hybrid kinetic-MHD simulations that model these phenomena. Understanding these interactions are critical to provide a sound basis for future efforts towards predictive modeling of energetic particle effects in fusion experiments such as ITER.
This project will perform real-time interactive analysis of energetic particle interactions with MHD instabilities on both an intra-timestep basis (one timestep) and inter-timestep basis (multiple timesteps) which will require the accumulation of millions-to-billions of particle over hundreds-to-thousands of timesteps for each simulation. Utilizing the large capacity storage and fast I/O capabilities offered by the Data Intensive Program, we will have the freedom to write, store, and analyze temporally and statistically resolved energetic PIC data, typically limited to infrequent restart files or a small subset of tracers. We will perform these simulations using the hybrid kinetic-MHD model in the NIMROD code[C.C. Kim, PoP15, 072507(2008), C.R. Sovinec, et al., JCP195, 355(2004)]. This analysis will be performed using the extensive capabilities offered by VisIt.
Analysis of PIC simulations have been limited to 0-D diagnostics such as growth rates and real frequencies and profiles and contour plots, leaving the bulk of the data under analyzed, particularly in comparison to the massive computing power that is required to generate the volumes of data. Due to the limits of practical storage and I/O bandwidth, the vast amounts of data generated by PIC simulations are, at best, integrated over and reduced to lower dimensional moments, for example, 0D time-history traces or profiles and contour plots. Interactive real time analysis of the PIC data is prohibitively expensive and usually impractical for anything but a small subset of tracers determined at launch time, restricting the available statistics and phase space resolution.
This research will apply new diagnostics techniques coupled with advanced visualization tools offered by VisIt and capabilities offered by the Data Intensive Program. We will utilize FastBit indexing on the H5Part formatted PIC data and use the Parallel Coordinate Analysis Tool to explore correlations in the 6 dimensional phase space (x,v) and identify trends and correlations in particle interactions with the instability. By applying the Selection/Subset capabilities integrated into the Parallel Coordinate Tool in VisIt, we can identify and propagate critical subsets of the energetic particles to visualizations of the PIC with the fluid instabilities. Availability of the temporally and statistically resolved PIC data coupled with these advanced visualization tools and resources will allow interactive exploration of the data at an unprecedented level. We believe this in-depth interactive analysis of the PIC data coupled with advanced visualization will lead to greater physics insight and provide better signatures to identify energetic particle interactions that are lost in the reductions typically necessary.
Both the Flash file system and 1TB memory node will be critical to accessing and interactive real time analysis of the large volumes of PIC data. Both memory and I/O speed have been the bottleneck in past attempts at interactive real time analysis of even modest volumes of PIC data.
Conversations with VisIt developers have indicated an interest in applying Hadoop as either a replacement of augmentation of the FastBit indexing that is required to parse through the PIC data. FastBit has been critical to any PIC analysis using VisIt for any but trivial volume sizes.
Next-generation genome-scale in silico modeling: the unification of metabolism, macromolecular synthesis, and gene expression regulation
Bernhard Palsson and Joshua Lerman, University of California San Diego
NERSC Repository: m1244
The project plans to use 5 terabytes of NERSC project disk space, 1 terabtye of archival storage, and 20 million core hours on Carver. A database will be used for storing components/interactions/functional-states omics data and the large-scale biochemical network reconstructions. Currently, our database is a relational database in PostgreSQL, but will need to migrate to a different platform to accommodate a large amount of newly generated omics data.
Systems analysis of growth functions in microorganisms is rapidly maturing. The use of experimental, mathematical, and computational techniques in a synergistic, iterative process is increasingly gaining acceptance as the optimal method for improving our understanding of emergent behavior in biochemical systems.
The primary goal of our computational and experimental research program is the prediction of phenotype from genotype. Genome-scale metabolic network reconstructions have been under development in our group for the purpose of predicting phenotype for over 10 years. The metabolic network reconstruction process is now at an advanced stage of development and has been translated into a 96-step standard operating procedure. Mathematical representations of metabolic network reconstructions have found a wide range of applications. In facilitating further progress, there is a great need for a robust computational infrastructure capable of supporting multi-scale genome annotation, integrated network reconstruction, constraint-based analysis, and omics data mapping. Unfortunately, this infrastructure is traditionally not available to individual research groups primarily focused on conceptual development.
We have created the next generation of predictive models and it moves far beyond metabolism to cover many other operational features of microbial genomes (including gene expression) with high chemical and genetic resolution. In a major departure from the past, the construction and application of these networks depends heavily on omics data analysis and integration. Omics data sets describing virtually all biomolecules in the cell are now available. These data can be generally classified into three distinct categories: components, interactions, and functional-states data. Components (or parts) data detail the molecular content of the cell or system, interactions data specify links between molecular components, and functional-states data provide insight into cellular phenotype.
The successful integration of the first two types of data (parts data and interactions data) results in the underlying biochemical network. The final type of data, functional-states data, is used to drive discovery. For example, we can test the completeness of a network by supplying data regarding the phenotypes rendered by the knockout of various genes. In cases where the model predicts no growth but growth occurs experimentally, the failure is likely due to the incompleteness of the model. More advanced uses such as iteratively designing a strain for metabolic engineering are envisioned, but are more computationally intensive.
How resources will be used:
- Access to a database for storing components/interactions/functional-states omics data and the large-scale biochemical network reconstructions. Currently, our database is a relational database, but will need to be updated to accommodate a large amount of newly generated omics data.
- 30 million NERSC MPP hours to facilitate building the networks (many steps involve multi-omics data analysis) and computing solutions to the models using 80- or 128-bit or exact precision.
Most our our analysis will be comprised of database queries, linear programming instances, and sequence alignment algorithms.
While our model is not strictly linear, we have formulated model simulations as a set of linear programming problems. These problems sample and search portions of the solution space over the nonlinear variables. Thus, much of our computation will comprise of many instances of linear programming problems, most of which can be executed in parallel.
Each problem instance is generated through queries of our model database. Thus, hosting of our database on NERSC is important for our project. As the resulting constraint matrix is ill-scaled we utilize 80-, 128- bit and exact solvers, including qsopt_ex (esolver) and soplex. The ill-scaled nature of problem instances and the need for many parallel simulations in a given experiment make this model much more computationally intensive than its predecessors.
To analyze sequencing data in the context of the model, the data will first be aligned using Burrows-Wheeler transform -based methods and the results stored in our database. Storing sequencing results and model content in the same database allows them to be directly queried in relation to each other. The sequencing results then can be compared to model simulations and modulate optimization problem formulation.
Transforming X-ray Science Toward Data-Centrism
Amedeo Perazzo, Stanford University
NERSC Repository: lcls
The project plans to use 20 terabytes of NERSC project disk space, 100 terabtyes of archival storage, 2 million core hours on Carver, flash storage, and Carver's 1 terabyte memory nodes.
X-rays, with wavelengths that are even smaller than a molecule, are ideal for imaging at the atomic scale. Molecules and atoms react quickly to forces that act on them. Chemical reactions, in which molecules join or split, can take place in a few femtoseconds. The ultrafast LCLS X-ray flash captures images of these events with a shutter speed of less than 100 femtoseconds, much like flashes from a high-speed strobe light, enabling scientists to take stop-motion pictures of atoms and molecules in motion. Another critical advantage of the intense coherent beam at LCLS is the ability to take diffraction snapshots of nanocrystals and single molecules. This ability will, for example, expand our understanding of a large number of membrane proteins which are characterized by the difficulty of growing large well-diffracting crystals. These features of the LCLS beam can dramatically improve our understanding of the fundamental processes of matter and life. One of the key challenges of the LCLS experiments is the ability to acquire, filter, analyze and visualize the huge amount of data created by these experiments. Only by solving this challenge we'll be able to fully exploit the unique features of the LCLS beam. This pilot may inform other lightsource to HPC transitions and also guide HPC providers in impact of cutting edge data-facing technologies.
The LCLS experiments have demonstrated the critical role that computing is playing in revolutionizing photon science from data collection to the visualization of the structures. Particularly of note are the data reduction challenge, to find a few usable events out of the data collected, and the classification challenge, which allows the individual patterns to be assembled into a 3-D data set.
In the next couple years the size of the LCLS detectors is expected to increase by a factor of four and the fraction of useful events to improve by a factor ten. This will cause a deluge of data requiring advances in the areas of data storage, data processing and data management.
The goal of this proposal is twofold: investigate the potentially critical improvements offered by systems which don't exist at SLAC, but which are available at NERSC, like flash based storage or large memory machines, and explore methods and techniques to allow LCLS users to run, seamlessly as possible, their analysis on non-SLAC systems.
We propose to initiate this collaboration by focusing on one specific LCLS experiment, the study of the 'Structure and Mechanism of the Photosynthetic Mn4Ca Water-Splitting Complex: Simultaneous Crystallography and Spectroscopy' (LCLS proposal 498, Vittal Yachandra PI).
Use of the X-ray free-electron laser is necessary for this experiment since it allows to outrun the radiation damage that would normally accrue to the catalytic cluster at a conventional X-ray synchrotron source. Yet there is a significant price to pay in terms of the challenge of data reduction. A predominant concern is the sheer number of crystal specimens (> 10^5 individual diffraction images) required to sample the diffraction pattern in all orientations and with sufficient measurement accuracy. A related issue arises from the underlying physics, which causes the individual Bragg spots to be sliced (only partially sampled) on any given image; thus it is a challenging numerical problem to reduce these numerous partial measurements in the raw data to a self-consistent list of intensities for structure solution. Performing this data reduction in an optimal, robust manner has not yet been achieved, but there is a clear requirement for future general applicability and use.
Most of the LCLS analysis operations are currently sequential and single-pass by necessity. That’s the most effective way to analyze large amount of data given the disk based storage technology we have available. One could envision more effective approaches where multiple passes, or even random accesses, are performed on the data to take advantages of features extracted from previous analysis or to perform cross correlation calculations among different events. A flash based storage system would allow much more efficient random access and it could store an entire data set at once. A terabyte memory machine could store an entire run or even the whole data set after an effective data selection step.
The project also plans to build a web interface to provide access to the data and to the analysis.
Data Processing for the Dayabay Reactor Neutrino Experiment's Search for Theta_13
Craig Tull, Lawrence Berkeley National Laboratory
NERSC Repository: dayabay
The project plans to use up to 200 terabytes of NERSC project disk space, 800 terabtyes of archival storage, and 11 million core hours on Carver. The project's codes access a MySQL database in order to retrieve configuration and calibration information. Placing this database on a high performance node within the Data Intensive Computing Pilot Program will improved the performance of these programs.
The standard model of particle physics is used to describe how sub-atomic particles, such as electrons, interact. There are around two dozen parameters that are inputs to this model, i.e. quantities that the model can not predict but rather their measured values are used to generate predictions.
Until recently all but five of the parameters were known, but now, using its first two months of its data, the Dayabay experiment has measured one of these unknowns, named theta_1_3, sufficiently accurately to state that it is non-zero. In order to improve the accuracy with which this parameter is known the experiment is planning on taking data for another three years. The better this value is known the more accurate the predictions by the standard model can be, leading to: constrains on the remaining unknown parameters; the exploration of new physics beyond the standard model; and the design of the next generation of experiments for studying matter and antimatter symmetry.
As the Dayabay collaboration is a collaboration of scientists from U.S.A., China, Russia, and Czech Republic it not only provides an opportunity to refine the standard model, but it also provides an opportunity for international culture and scientific exchange.
The resources will be used to handle the main processing and reprocessing of data collected by the Dayabay Reactor Neutrino Experiment. This experiment is already up and running and producing physics data at the rate of 3.5MB/s using 3/4 of its final number of detectors. The final data rate, expected later this year, is 4.5MB/s. Over the next 1 1/2 year this will lead to a total of 250TB of raw data (including 50TB of data taken over the first six months of running). Processing this data to produce quantities that physicists can consume is done with a production executable that current runs at 0.5 GB/hr. The processing of the initial 2 months of data from the experiment, in order to publish its first two results, has caused disk access degradation and network interface saturation. Given the increase in data over the experiments remaining lifetime, using the Data Intensive Computing Pilot Program will ensure timely analysis and publication of its results.
The physics data files for the experiment are collected at the Dayabay reactor in China and transferred NERSC within 20 minutes of the file closing. One portion of the resources, 400K hours, will be dedicated to executing immediate production processing in order to provide near real-time feedback to the researchers on site in order to recognize any issues with data collection. This immediate production will also allow researchers everywhere to run their own analysis codes against the most recent data.
As well as data taken at Dayabay, significant amounts of Monte Carlo data are required in order to provide an understanding of the features seen in the experimental data. This Monte Carlo data will be generated both at NERSC and other collaborating laboratories such as BNL, University of Wisconsin and Princeton, and gathered together at NERSC for storage and analysis. It is expected to generate about the same amount of data as taken by the experiment, i.e. 250 TB, using around 5M hours for the portion of data generated at NERSC.
The main code used by Dayabay is the Gaudi Framework. This C++ codebase was originally created for the LHCb and Atlas experiments on the Large Hadron Collider, but has since be adopted by many other experiments in High Energy Physics. Both the production processing and Monte Carlo generation use this framework to execute algorithm written by the experiment's researchers.
The production processing will calibrate the experimental data and create ROOT output files containing physical quantities researcher will use as input to their analysis. This provides a consistent basis from which researchers can compare analyses.
The Monte Carlo executables use the GEANT4 library to simulate the detector and is interactions outputting ROOT files similar to the ones created by production, but augmented with 'truth' information.
The submission of jobs is managed by code developed by the experiment, psquared, that keeps track of which data files have been processed by which configuration of the code, the state of those jobs and whether there are issues with particular nodes or sub-sets of the data.
The Dayabay Experiment uses a NESRC science gateway to provide the results of immediate production processing to researchers:
John Wu, Lawrence Berkeley National Laboratory
NERSC Repository: m1248
The project plans to use up to 100 terabytes of NERSC project disk space, 300 terabtyes of archival storage, 1 million core hours on Carver, flash storage, and Carver's 1 terabyte memory nodes.
A revolution in scientific data analysis is upon us, driven by a combination of large scale scientific projects combined with the need to contextualize scientific data at unprecedented scales. This revolution has been witnessed within Google Earth where millions of users understand their world through the combination of various information sources overlaid on three-dimensional visualization of every square meter of the globe. However, in science, contextualization must be quantified: visualization is insufficient. The ability of a system to combine high-quality scientific data from multiple sources and semantic classification of features on the surface of the Earth within a common interface will enable scientists to interact quantitatively with large scale data providing not just an overlay, but a true correlation between global features and scientific observation.
Multimode sensor data is widely available and extensively used in science and technology. For example, physics experiments and astronomy observations utilize specially designed instruments to gather a large amount of data for scientific exploration. Some buildings are instrumented with many sensors for capturing data including temperature, humidity, electricity usage and so on. In national security applications, multiple sensors are deployed for activities such as border surveillance and interdiction of special nuclear materials. In these use cases, one sensor can be regarded as the primary sensor, while the others as the secondary sensors that provide contextual information. For example, a nuclear science application might have the gamma-ray sensor as the primary feed, while the video recorder, LIDAR, infrared sensors, and GPS are regarded as secondary sensors.
An environmental energy application might regard the infrared sensor as the primary and others including the gamma-ray detectors as the secondary. To understand the output from the primary sensor, application scientists often review the secondary sensor data and manually annotate the context for the analysis operations. As these multimodal sensors grow in capability and popularity, there is an urgent need to automate the annotation and data integration process.
The project combines the domain expertise from Berkeley Lab's Environmental Energy Technologies and Nuclear Science Divisions (EETD and NSD), and computer expertise from the Computational Research Division (CRD) to solve the large data analysis problems from urban-scale energy and environmental challenges. Instead of Street Views, we aim to develop a digital data globe.
Large quantities of data continue to be generated and stored as output from complex and expensive experiments. To preserve the meaning of this data for future analysis, extensive contextual information is required. On a global scale, contextual information can be gained by multimode sensors from mobile or satellite platforms. For the purposes of scientific research, contextual data must be quantified in a manner that relates to the scientific question at hand. Our approach is to develop a common framework for storing and analyzing multimode sensor data such that researchers may develop feature extraction algorithms geared to their specific scientific questions. An initial source of multimode data will be the NSD MISTI truck, which has high qualify detectors in infrared and gamma-ray spectra. This data can be used to analyze energy inefficiency of buildings and nuclear background radiation at large scale. By working together with both NSD and EETD, a proof of principle demonstration will be constructed.
Many of our analysis algorithms will be I/O bound and can benefit from the faster I/O rates offered by the Flash file system. We will also be able to more efficiently process the data with more memory by loading more data into memory and avoiding I/O operations.
Integrating Compression with Parallel I/O for Ultra-Large Climate Data Sets
Jian Yin, Pacific Northwest national Laboratory
NERSC Repository: m1551
The project plans to use 20 terabytes of NERSC project disk space, 100 terabtyes of archival storage, 10,000 core hours on Carver, flash storage, and Carver's 1 terabyte memory nodes.
The rapid advance in high performance computing makes it possible to simulate climate in high resolution to achieve more realistic and reliable simulations. However, high-resolution climate simulations can generate a huge amount of data. For instance, by 2015 it will be possible to do short term 2 km resolution simulations in Global Cloud Resolving Models, where a single time step of a three-dimensional variable requires 256 GB of storage. The huge amount of data can easily overwhelm the current data analysis and visualization tools and thus prevents climate scientists from extracting insight from these data. Integrating compression with parallel I/O can give us the capability to process the data several orders of magnitude higher than we can currently do. This capability can enable climate scientists to make important discoveries.
In this project, we will investigate how to integrate compression into parallel I/O to enable us to process ultra-large climate datasets. Although High-resolution climate models can achieve realistic and reliable simulations, the huge amount of data generated from these simulations can easily overwhelm the current climate data analysis and visualization tools. Integrating compression into parallel I/O can give us the capability to process data several orders of magnitude higher than we can do currently.
High-resolution climate data offer great opportunities for compression. Our previous work shows that we can potentially reduce data volume by orders of magnitude through compression. Additionally, we observed the performance gaps between CPU, networks, and I/O devices continue to widen and climate data analysis and visualization applications are generally I/O bound. Compression can exploit idle CPU cycles to reduce the amount of data to be transferred over networks and I/O devices and hence increase effective network and I/O bandwidth.
However, a naïve approach that first decompresses the data and then invokes analysis and visualization tools cannot reduce end-to-end application running times because this approach only adds the decompression delay. Tightly integrating compression with parallel I/O to overlap decompression with I/O operations is essential. We plan to divide the original data into multiple blocks and each block can be compressed and decompressed independently. We can pipeline those blocks through different pipeline stages, which can include retrieving blocks from storage devices, transferring blocks over networks, decompressing blocks, and processing blocks. Prefetching and caching can also be used to eliminate decompression latency.
However, there are various design options to employ pipelining, prefetching, and caching. There are also many parameter settings such as block size. It is difficult to predict which design options and parameter settings would achieve better performance without experimental evaluation. We plan to use NERSC resources to evaluate various design options and parameter settings in term of performance and scalability. The experimental results will allow us to choose the best design to achieve the highest end-to-end application performance.
Some of our analysis and visualization tools are built on MOAB and the PnetCDF library, which in turn use MPI-IO for collective I/O. We must modify both the PnetCDF and MPI-IO layers to enable effective pipelining, prefetching, and caching. Another set of tools use Hadoop.
We will use the NERSC resources for performance and scalability studies of our systems. We will compare a set of design options and parameter settings. The hardware configuration can affect which design option and parameter setting can perform better. We will experiment with various hardware configurations, including using the Flash file systems and using nodes with various amount of memory, to understand the effects of various hardware configurations.