Description and Overview
H5py provides easy-to-use high level interface, which allows you to store huge amounts of numerical data, and easily manipulate that data from NumPy. H5py uses straightforward NumPy and Python metaphors, like dictionary and NumPy array syntax. For example, you can iterate over datasets in a file, or check out the .shape or .dtype attributes of datasets. You don't need to know anything special about HDF5 to get started. H5py rests on an object-oriented Cython wrapping of the HDF5 C API. Almost anything you can do from C in HDF5, you can do from h5py.
Availability at NERSC
On Cori/Edison, H5py is upgraded to 2.7.1 and is built with HDF5 1.10.1
Loading H5py on Edison/Cori
- Serial H5py (We recommend using Python Anaconda)
module load python
- Parallel H5py
module load python
module load h5py-parallel
Using H5py in the Codes
- Serial H5py
- Parallel H5py
from mpi4py import MPI
- 1. Open the file with different IO driver
Available drivers in h5py: sec2: unbuffered; stdio: buffered; core: memory-map; family: fixed-length chunk. Recommend the default driver, which is sec2 on Unix
Different open modes: r: readonly, file must exist; r+: read/write, file must exist; w: create file, truncate if exists; w- or x: create file, fail if exists; a: read/write if exists, create otherwise (default)
fx = h5py.File('output.h5', driver=<driver name>, 'w')
- 2. Slice the data like Numpy
dx = fx['4857/55711/4/coadd'][('FLUX','IVAR')] # read 2 columns in the 'coadd' table dataset
dx = fx['4857/55711/4/coadd'][()] # read the whole 'coadd' dataset in the group '4857/55711/4'
dx = fx['path_to_dataset'][0:10] # slice the first 10 in the dataset
- 3. Be aware of the Implicit write
There is no explicit write function in h5py, all writes happen implicitly when you do the assignment or dataset creation
# Initialize the dataset with existing numpy array
dset=f.create_dataset('mydset',data=arr) # write happens here
# Rewrite h5py dataset with numpy array
dset[0:2,:]=temp # write happens here
- Unable to create file (unable to lock file, errno = 524, error message = 'Unknown error 524')
This usually happens on Burst Buffer or Project file systems. Root cause is documented at HDF5 known issues. Simple fix to this is
- numpy.dtype has the wrong size, try recompiling. Expected 88, got 96
This means your are not using python anaconda, which you should use, as the h5py is built with the default python module on Cori/Edison. System provided python does not work with h5py module. Simple fix to this is:
module load python
This section offers recommendations for tuning the I/O performance.
- 1. Choose the most modern format in file creation (Sample Codes)
By default, every file that is created by HDF5/h5py is using the most compatible version, which forces the underlying storage format to be compatible with earliest inefficient format. By turning on the 'latest' format during file creation, you may save I/O cost. Refer to 'version bounding' for more information.
f = h5py.File('name.hdf5', libver='earliest') # most compatible f = h5py.File('name.hdf5', libver='latest') # most modern
"Using 1 process to create 1 file with 8000 objects (including subgroups, different datasets, etc), 'latest' version achieved 2.25X speedup."
"Using 1k process to write 1TB file, collective IO achieved 2X speedup on Cori."
- 3. Avoid type casting in H5py
By default, numpy uses float64. In h5py, whenever it detects inconsistent types, it will force the type conversion on the fly, which is a costly operation. For example, if we create the dataset with dtype 'f' (which is 32 bits), and then assign a numpy array to this dataset, h5py will do the type casting. For example:
dset = f.require_dataset('field',shape,'f')
dset = temp # temp is a numpy array, which by default, 64 bits floats
dset = f.require_dataset('field',shape,'f8')
dset = temp
This largely reduces the IO time, from 527 seconds to 1.3 seconds when writing a 100x100x100 array with 5x5x5 procs
- 4. Use low-level API in H5py (Sample Codes)
H5py provides very nice object-oriented interface by hiding many details that are available in HDF5 C libraries. Fortunately, users can still leverage the flexible HDF5 C features for tuning the I/O performance. H5py allows you to do so by using the h5py low-level API. For examples:
Another more concrete example is disabling the pre-filling using low-level API. The following code is creating a 900GB dataset, using 'FILL_TIME_NEVER' was able to reduce the I/O cost from 40 minutes to less than 1 second:
fx=h5py.File('test_nofill.h5','w') spaceid = h5py.h5s.create_simple((30000,8000000))
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
datasetid =h5py.h5d.create(fx.id,"data",h5py.h5t.NATIVE_FLOAT, spaceid, plist)
dset = h5py.Dataset(datasetid)
- 5. Use H5py in Spark
A NERSC developed HDF5 python plugin is available at here, but it's currently in dev mode, and only limited features are available, e.g., reading a 2d array from HDF5 file, etc.
Science Use Case
H5py has been used in various scientific applications. One of the use case is from astronomy, in which, we built a customized file structure based HDF5 and developed basic query/subsetting/updating functions for managing BOSS spectra data, SDSS-II. More details: H5Boss.
Performance vs. Productivity
H5py-2017-May26-public.pdf | Adobe Acrobat PDF fileProductivity and High Performance, Can we have both? An Exploration of Parallel-H5py from I/O Perspective