Bird's Eye View - SciDB Testbed at NERSC Pioneers High Usability Big Data Analytic Infrastructure.
Motivation? It’s painful to manage and analyze terabytes of data.
Solution? SciDB is a parallel database for array-structured data, great for Terabytes of:
- Time series, spectra, imaging, etc
The greatest benefit of SciDB is:
Usability: Use HPC hardware without learning parallel programming and parallel I/O.
Motivation - Pain Points of Data Analytics for Science
Current DOE experimental facilities, such as light sources, telescopes, sequencing centers, etc., are producing data in increasingly larger and faster quantities. Some of these facilities have reasonably prepared for this onslaught of data, while others are just beginning to grapple with the challenge. However, none of them is in a position to fully exploit the scientific potential of these data sets. Three major problems these science projects face are the management, analysis and ability to share the data (across the collaboration and publicly). As the data volume and variety grow, several challenges emerge:
- Traditional file-system based solutions can no longer efficiently store and access a large number of small files given random I/O patterns. Efficiently filtering large data sets is a difficult task
- As the analytic algorithms used on big data have become more and more complex, it has become more difficult to implement them in a highly effcient way with respect to parallel I/O and computation
- Sharing data is no longer feasible in many cases due to the data volume. Web-based portals that enable "smart queries" on large data sets are the future of data sharing
Moreover, with the advance of detector technologies, a small experimental group can generate an extremely large amount of data – and the number of these groups, with access to such large data sets, is growing quickly as well. Scientists are burdened with both data management tasks (filtering and querying) in addition to complex analytic tasks (modeling, machine learning, visualization, etc.). Currently these emerging projects are developing their own data management and analysis workflow, which often produces sub-optimal results as the wheel is reinvented repeatedly and poorly.
A unified, scalable, easy to use, extendable system is needed to tackle these extreme data challenges. Hosting such a solution at NERSC is especially important due to the number of data-intensive projects at NERSC.
SciDB as a Solution for Array-Structured Data
Scientific data contains two major parts: structured data (e.g., an image), and unstructured metadata (e.g., parameters of the detector when taking that image). Some solutions (such as MongoDB) are available to manage and query unstructured metadata at scale, where the metadata is less well organized but smaller in size. The key to scientific data analysis is to efficiently query, filter, aggregate or perform more complicated operations on the structured data. Furthermore, such a solution should be able to take advantage of the hardware available at current and future DOE high performance computing (HPC) facilities. While not all structured data has the same form, many experimental facilities are based on detectors that generate array-based structured data (CCD images) or analysis techniques that naturally fit into arrays (such as comparative genomic hybridization). In this project we evaluate SciDB as a unified, scalable and easy to use solution to manage, analyze and share array-based structured scientific data at extreme scale.
SciDB is an open-source database system designed to store and analyze extremely large array-structured data. Some examples of array-structured data include:
- Imaging data: digital pictures from light sources or telescopes
- Time series data collected from sensors
- Spectral data produced by spectrometers and spectrographs
- Graph-like structures to represent relations between entities (sparse matrix)
The benefits of SciDB include scaling on hundreds of nodes, ease of deployment on commodity hardware or large DOE HPC systems, efficient parallel I/O, a large variety of built-in generic analysis tools, and ease of integrating new analytical algorithms that can transparently access efficient I/O and parallelization.
SciDB Testbed Project at NERSC
In 2013 we setup a SciDB cluster at NERSC and started the SciDB testbed project. Our approach was to create partnerships where we pair 1-2 scientists with NERSC staff to work on real data problems. In each partnership, we helped the scientists port their data onto the SciDB testbed at NERSC, and provided help with initial data analysis. Our aim was to help the science projects build SciDB into their science workflows.
For each science partnership, we focused on the three aspects of their science workflow and tried to answer the following questions:
- Data Management: Can SciDB be used as an efficient way to store and organize Terabytes of raw and derived data? Can SciDB help to efficiently filter the data in parallel or find the interesting portion of the data?
- Data Analysis: Can the project implement complex modeling algorithms in SciDB and take advantage of the automatic parallelism? How much analysis can be done with the built-in SciDB functionalities and how much needs custom SciDB extension? How difficult is it to extend SciDB?
- Data Sharing: Can SciDB efficiently allow multiple users to share the same dataset and perform different analysis? Can SciDB be integrated with a science gateway (web portal) and allow collaborators to submit smart queries?
Usability is at the core of all three aspects; we give the biggest weight to how easy a non-expert user can start to use SciDB.
Best Suited Application Areas
Description of Data: A collection of millions of spectra, either from simulation, or from detectors.
Data Management Challenge: Efficiently store millions of spectra on disk in a distributed way, and allow quick filtering of the data to select a subset of the spectrums based on some metadata condition or patterns of the spectrum itself.
Data Analysis Challenge: Calculate per-spectrum features in parallel. Calculate aggregate features of many spectra in parallel.
Description of Data: Similar to spectra, but in this case the data contains a collection of time series data.
Data Management Challenge: Efficiently store millions of time series on disk in a distributed way, and allow quick filtering of the data to select a subset of the time series based on some metadata condition or patterns in the time series itself.
Data Analysis Challenge: Calculate per time series features in parallel. Calculate aggregate features of many time series in parallel.
Grid Structure (including dense matrix and image processing)
Description of Data: Data that naturally fits in a N-dimensional grid (such as climate simulation output) or collections of 3D images.
Data Management Challenge: Efficiently store large dense arrays in a distributed way. And efficiently access specific regions of the array.
Data Analysis Challenge: Parallel aggregate calculations on part of the grid or the whole grid. Parallel aggregate calculations across multiple grids. Linear algebra operations in parallel.
Sparse Relation Matrix/Graph
Description of Data: A large, sparse matrix representing the relationship between multiple categories of entities, such as a large graph.
Data Management Challenge: Efficiently store a large sparse matrix in a distributed way.
Data Analysis Challenge: Parallel aggregate calculations on part of the matrix or the whole matrix. Parallel aggregate calculations across multiple matrices. Sparse linear algebra operations in parallel.
To request access to NERSC SciDB please email email@example.com, briefly explain your use case, expected data volume, sample operations, and current solution.
Please email firstname.lastname@example.org