NERSCPowering Scientific Discovery Since 1974

H5py

Description and Overview

The h5py package is a Pythonic interface to the HDF5 binary data format.

H5py provides easy-to-use high level interface, which allows you to store huge amounts of numerical data, and easily manipulate that data from NumPy. H5py uses straightforward NumPy and Python metaphors, like dictionary and NumPy array syntax. For example, you can iterate over datasets in a file, or check out the .shape or .dtype attributes of datasets. You don't need to know anything special about HDF5 to get started. H5py rests on an object-oriented Cython wrapping of the HDF5 C API. Almost anything you can do from C in HDF5, you can do from h5py.

Availability at NERSC

On Cori/Edison, H5py is upgraded to 2.7.1 and is built with HDF5 1.10.1

Loading H5py on Edison/Cori

  • Serial H5py

module load python
module load h5py

  • Parallel H5py 

module load python
module load h5py-parallel

Using H5py in the Codes

  • Serial H5py
import h5py
fx=h5py.File('output.h5','w')
fx.close()
  • Parallel H5py
from mpi4py import MPI
import h5py
fx=h5py.File('output.h5','w',driver='mpio',comm=MPI.COMM_WORLD)
fx.close()

Basic Usage

  • 1. Open the file with different IO driver

Available drivers in h5py: sec2: unbuffered; stdio: buffered; core: memory-map; family: fixed-length chunk. Recommend the default driver, which is sec2 on Unix

Different open modes: r: readonly, file must exist; r+: read/write, file must exist; w: create file, truncate if exists; w- or x: create file, fail if exists; a: read/write if exists, create otherwise (default)

import h5py
fx = h5py.File('output.h5', driver=<driver name>, 'w')
  • 2. Slice the data like Numpy
dx = fx['4857/55711/4/coadd'][('FLUX','IVAR')] # read 2 columns in the 'coadd' table dataset
dx = fx['4857/55711/4/coadd'][()] # read the whole 'coadd' dataset in the group '4857/55711/4'
dx = fx['path_to_dataset'][0:10] # slice the first 10 in the dataset
  • 3. Be aware of the Implicit write 

There is no explicit write function in h5py, all writes happen implicitly when you do the assignment or dataset creation

# Initialize the dataset with existing numpy array
arr=np.arange(100)
dset=f.create_dataset('mydset',data=arr) # write happens here

# Rewrite h5py dataset with numpy array
dset=f.create_dataset('mydset',(10,10),dtype='f8')
temp=np.random.random((2,10))
dset[0:2,:]=temp # write happens here

Common Errors

  • Unable to create file (unable to lock file, errno = 524, error message = 'Unknown error 524')

This usually happens on Burst Buffer or Project file systems. Root cause is documented at HDF5 known issuesSimple fix to this is 

export HDF5_USE_FILE_LOCKING=FALSE     

 

  • numpy.dtype has the wrong size, try recompiling. Expected 88, got 96

This means your are not using python anaconda, which you should use, as the h5py is built with the default python module on Cori/Edison. System provided python does not work with h5py module. Simple fix to this is:

module load python     

Advanced Usage

This section offers recommendations for tuning the I/O performance. 

  • 1. Choose the most modern format in file creation (Sample Codes)

By default, every file that is created by HDF5/h5py is using the most compatible version, which forces the underlying storage format to be compatible with earliest inefficient format. By turning on the 'latest' format during file creation, you may save I/O cost. Refer to 'version bounding' for more information. 

f = h5py.File('name.hdf5', libver='earliest') # most compatible
f = h5py.File('name.hdf5', libver='latest')   # most modern

"Using 1 process to create 1 file with 8000 objects (including subgroups, different datasets, etc), 'latest' version achieved 2.25X speedup." 

with dset.collective:
dset[start:end,:]=temp

"Using 1k process to write 1TB file, collective IO achieved 2X speedup on Cori." 

  • 3. Avoid type casting in H5py

By default, numpy uses float64. In h5py, whenever it detects inconsistent types, it will force the type conversion on the fly, which is a costly operation. For example, if we create the dataset with dtype 'f' (which is 32 bits), and then assign a numpy array to this dataset, h5py will do the type casting. For example: 

#bad performance
dset = f.require_dataset('field',shape,'f')
dset = temp # temp is a numpy array, which by default, 64 bits floats
#good performance
dset = f.require_dataset('field',shape,'f8')
dset = temp

This largely reduces the IO time, from 527 seconds to 1.3 seconds when writing a 100x100x100 array with 5x5x5 procs

H5py provides very nice object-oriented interface by hiding many details that are available in HDF5 C libraries. Fortunately, users can still leverage the flexible HDF5 C features for tuning the I/O performance. H5py allows you to do so by using the h5py low-level API. For examples:

space=h5py.h5s.create_simple((100,)) 
plist=h5py.h5p.create(h5py.h5p.DATASET_CREATE) 
plist.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
  • 5. Use H5py in Spark

A NERSC developed HDF5 python plugin is available at here, but it's currently in dev mode, and only limited features are available, e.g., reading a 2d array from HDF5 file, etc. 

Science Use Case

H5py has been used in various scientific applications. One of the use case is from astronomy, in which, we built a customized file structure based HDF5 and developed basic query/subsetting/updating functions for managing BOSS spectra data, SDSS-II. More details: H5Boss

Performance vs. Productivity

NERSC Data Seminar May 26 2017

 

Downloads

  • H5py-2017-May26-public.pdf | Adobe Acrobat PDF file
    Productivity and High Performance, Can we have both? An Exploration of Parallel-H5py from I/O Perspective