h5py: HDF5 for Python

The h5py package is a Pythonic interface to the HDF5 binary data format.¹

An HDF5 file is a container for two kinds of objects: groups and datasets. Groups work like dictionaries, and datasets work like NumPy arrays.

File Objects

File objects serve as your entry point into the world of HDF5.

Opening & creating files

f = h5py.File('myfile.hdf5','r')

The file name may be a byte string or unicode string. Valid modes are:

r: Readonly, file must exist (default)
r+: Read/write, file must exist
w: Create file, truncate if exists
w- or x: Create file, fail if exists
a: Read/write if exists, create otherwise

Groups

Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries. In this case the “keys” are the names of group members, and the “values” are the members themselves (Group and Dataset) objects.

Groups are like dictionaries

We can actually map h5py functions to bash command

Create, pwd and ls

create_group: mkdir
name: pwd
keys: ls

import h5py
f = h5py.File('foo.hdf5','w')
grp1 = f.create_group("grp1")
grp2 = f.create_group("grp2")
subgrp = grp1.create_group("subgrp")

print('''
{}
{}
{}
{}
'''.format(f.name, grp1.name, grp2.name, subgrp.name))
print('''
{}
{}
{}
'''.format(f.keys(), grp1.keys(), subgrp.keys()))

/
/grp1
/grp2
/grp1/subgrp


<KeysViewHDF5 ['grp1', 'grp2']>
<KeysViewHDF5 ['subgrp']>
<KeysViewHDF5 []>

Remove

del f[<group_name>]: rm -rf

del f["grp2"]
f.keys()

<KeysViewHDF5 ['grp1']>

Hard Links

f[<link name>]=f[<original group name>]: ln <original filename> <link name>

f["grp1_hardlink"] = f["grp1"]
print(f.keys())
del f["grp1"]
print(f["grp1_hardlink"].keys())

<KeysViewHDF5 ['grp1', 'grp1_hardlink', 'grp2']>
<KeysViewHDF5 ['subgrp']>

Soft Links

f[<link name>]=h5py.SoftLink(f.<original group>.name): ln -s <original filename> <link name>

f["grp2_softlink"] = h5py.SoftLink(f["grp2"].name)
print(f.keys())
del f["grp2"]
print(f["grp2_softlink"].keys())

<KeysViewHDF5 ['grp1_hardlink', 'grp2', 'grp2_softlink']>
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-71-8eb7848ea8a1> in <module>
      2 print(f.keys())
      3 del f["grp2"]
----> 4 print(f["grp2_softlink"].keys())

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in __getitem__(self, name)
    286                 raise ValueError("Invalid HDF5 object reference")
    287         else:
--> 288             oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
    289 
    290         otype = h5i.get_type(oid)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5o.pyx in h5py.h5o.open()

KeyError: 'Unable to open object (component not found)'

Datasets

Datasets are very similar to NumPy arrays. They are homogeneous collections of data elements, with an immutable datatype and (hyper)rectangular shape. Unlike NumPy arrays, they support a variety of transparent storage features such as compression, error-detection, and chunked I/O.

Descriptive attributes

shape
size
ndim
dtype
nbytes
len()

Creating dataset

create_dataset(name, shape=None, dtype=None, data=None, **kwds)

# create a dataset of shape (100,) and dtype='i8'
dset = f.create_dataset("default", (100,), dtype='i8')
# create from numpy array
dset = f.create_dataset("init", data=np.arange(100))

Reading & writing data

HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file. The following slicing arguments are recognized:

Indices: anything that can be converted to a Python long
Slices (i.e. [:] or [0:10])
Field names, in the case of compound data
At most one Ellipsis (...) object
An empty tuple (()) to retrieve all data or scalar data

The following restrictions exist:

Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance

Broadcasting is implemented using repeated hyperslab selections, and is safe to use with very large target selections. It is supported for the above “simple” (integer, slice and ellipsis) slicing only.

Multiple indexing only affect the loaded array.

f = h5py.File('my_hdf5_file.h5', 'w')
dset = f.create_dataset("test", (2, 2))
dset[0][1] = 3.0
print(dset[0, 1])
dset[0, 1] = 3.0
print(dset[0, 1])

0.0
3.0

Attributes

https://docs.h5py.org/en/stable/high/attr.html

Dimension Scales

https://docs.h5py.org/en/stable/high/dims.html

h5py Documentation ↩

PREVIOUSDataset Summary: CPTAC Imaging Proteomics

NEXTOpenSlide Python API Tutorial