NERSCPowering Scientific Discovery Since 1974

H5Spark(dev)

Description and Overview

The H5Spark(dev) package is a Python and Scala plugin for reading HDF5 binary data format in Spark.

H5Spark provides easy-to-use high level interface, which allows you to convert hierarchical HDF5/netCDF4 data into RDD and conduct the data analytics using Spark. H5Spark design was initially driven by large scale matrix factorization. Traditional data formats supported in Spark include json, parquet, text, whereas HDF5, the most commonly used scientific data format is not natively supported. With H5Spark, science users can manipulate their HDF5 data in Spark.   

Availability at NERSC

H5spark has been used in various applications, but is still in an experimental stage with limited features. You can test it either in python or scala version of Apache Spark. 

  • Input: HDF5 file(s)
  • Output: RDD

Launching Spark and Loading H5Spark on Edison/Cori

#on Edison
salloc -N 2 -p debug -t 20
#or on Cori
salloc -N 2 -p debug -t 20 -C haswell
module load spark
module load h5spark
module load python/2.7-anaconda
start-all.sh

Using H5Spark in PySpark on Edison/Cori

1. Prepare a test file

cp $H5SPARK/src/resources/test.h5 temp_dir/

2. Start PySpark

pyspark

3. Run the codes

import os,sys,h5py,read
rdd=read.h5read(sc,('absolute_path/temp_dir/test.h5','charge'),mode='single',partitions=1)
print "rdd count:",rdd.count()
sc.stop()

The function h5read takes four parameters:

  • SparkContext
  • (FileName,DatasetName)
  • mode, single file read 'single', or multiple file read 'multi'
  • partitions, number of partitions for spark

Using H5Spark in Scala Spark on Edison/Cori 

1. Prepare a test file (optional)

cp $H5SPARK/src/resources/test.h5 your_dir/

2. Start Spark-shell

spark-shell --driver-class-path $(echo $H5SPARK_LIB/*.jar | tr ' ' ':')

 3. Run the codes:

import org.nersc.io._
val rdd = read.h5read_array (sc,"test.h5","charge", 2)
rdd.count()

h5read has different return type:

  • val rdd = read.h5read_point (sc, inputpath, variablename, partition) //load n-D data into RDD[(value:Double,key:Long)]
  • val rdd = read.h5read_array (sc, inputpath, variablename, partition) //load n-D data into RDD[Array[Double]]
  • val rdd = read.h5read_vec (sc,inputpath, variablename, partition) //Load n-D data into RDD[DenseVector]
  • val rdd = read.h5read_irow (sc,inputpath, variablename, partition) //Load n-D data into RDD[IndexedRow]
  • val rdd = read.h5read_imat (sc,inputpath, variablename, partition) //Load n-D data into IndexedRowMatrix

Further Information

Reference manuals