The h5py package is a Pythonic interface to the HDF5 binary data format.1
An HDF5 file is a container for two kinds of objects: groups and datasets. Groups work like dictionaries, and datasets work like NumPy arrays.
File Objects
File objects serve as your entry point into the world of HDF5.
Opening & creating files
f = h5py.File('myfile.hdf5','r')
The file name may be a byte string or unicode string. Valid modes are:
r
: Readonly, file must exist (default)r+
: Read/write, file must existw
: Create file, truncate if existsw-
orx
: Create file, fail if existsa
: Read/write if exists, create otherwise
Groups
Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries. In this case the “keys” are the names of group members, and the “values” are the members themselves (Group
and Dataset
) objects.
Groups are like dictionaries
We can actually map h5py
functions to bash command
Create, pwd and ls
create_group
:mkdir
name
:pwd
keys
:ls
import h5py
f = h5py.File('foo.hdf5','w')
grp1 = f.create_group("grp1")
grp2 = f.create_group("grp2")
subgrp = grp1.create_group("subgrp")
print('''
{}
{}
{}
{}
'''.format(f.name, grp1.name, grp2.name, subgrp.name))
print('''
{}
{}
{}
'''.format(f.keys(), grp1.keys(), subgrp.keys()))
/ /grp1 /grp2 /grp1/subgrp <KeysViewHDF5 ['grp1', 'grp2']> <KeysViewHDF5 ['subgrp']> <KeysViewHDF5 []>
Remove
del f[<group_name>]
: rm -rf
del f["grp2"]
f.keys()
<KeysViewHDF5 ['grp1']>
Hard Links
f[<link name>]=f[<original group name>]
: ln <original filename> <link name>
f["grp1_hardlink"] = f["grp1"]
print(f.keys())
del f["grp1"]
print(f["grp1_hardlink"].keys())
<KeysViewHDF5 ['grp1', 'grp1_hardlink', 'grp2']> <KeysViewHDF5 ['subgrp']>
Soft Links
f[<link name>]=h5py.SoftLink(f.<original group>.name)
: ln -s <original filename> <link name>
f["grp2_softlink"] = h5py.SoftLink(f["grp2"].name)
print(f.keys())
del f["grp2"]
print(f["grp2_softlink"].keys())
<KeysViewHDF5 ['grp1_hardlink', 'grp2', 'grp2_softlink']> --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-71-8eb7848ea8a1> in <module> 2 print(f.keys()) 3 del f["grp2"] ----> 4 print(f["grp2_softlink"].keys()) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() /usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in __getitem__(self, name) 286 raise ValueError("Invalid HDF5 object reference") 287 else: --> 288 oid = h5o.open(self.id, self._e(name), lapl=self._lapl) 289 290 otype = h5i.get_type(oid) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5o.pyx in h5py.h5o.open() KeyError: 'Unable to open object (component not found)'
Datasets
Datasets are very similar to NumPy arrays. They are homogeneous collections of data elements, with an immutable datatype and (hyper)rectangular shape. Unlike NumPy arrays, they support a variety of transparent storage features such as compression, error-detection, and chunked I/O.
Descriptive attributes
shape
size
ndim
dtype
nbytes
len()
Creating dataset
create_dataset(name, shape=None, dtype=None, data=None, **kwds)
# create a dataset of shape (100,) and dtype='i8'
dset = f.create_dataset("default", (100,), dtype='i8')
# create from numpy array
dset = f.create_dataset("init", data=np.arange(100))
Reading & writing data
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file. The following slicing arguments are recognized:
- Indices: anything that can be converted to a Python long
- Slices (i.e.
[:]
or[0:10]
) - Field names, in the case of compound data
- At most one
Ellipsis
(...
) object - An empty tuple (
()
) to retrieve all data or scalar data
The following restrictions exist:
- Selection coordinates must be given in increasing order
- Duplicate selections are ignored
- Very long lists (> 1000 elements) may produce poor performance
Broadcasting is implemented using repeated hyperslab selections, and is safe to use with very large target selections. It is supported for the above “simple” (integer, slice and ellipsis) slicing only.
Multiple indexing only affect the loaded array.
f = h5py.File('my_hdf5_file.h5', 'w')
dset = f.create_dataset("test", (2, 2))
dset[0][1] = 3.0
print(dset[0, 1])
dset[0, 1] = 3.0
print(dset[0, 1])
0.0 3.0
Attributes
https://docs.h5py.org/en/stable/high/attr.html