H5PY examples

HDF5 can be an excellent tool to help organize, index, and consolidate reserach data. In particular, applications that read or write large numbers of small text or image files can likely benefit significantly by consolidating those data into JSON-like, hierichical data structures. Consolidating IO based data management with HDF5 (or similar tools) can significantly improve compute performance by mitigating IO overhead and improving read/write performance.

Any array structures can be stored in HDF5 objects. In this example, we demonstrate how to stack multiple .png images into a single HDF5 object. Additional documentation of interest includes:

Principal documentation:
https://docs.h5py.org/en/stable/high/dataset.html
Source for this example: https://github.com/jeffersonscientific/SRC_python_demos/blob/main/h5py_examples.ipynb

Applications and motivation include:

Especially for lots of small images, consolidate many, many image files into a single container.
Expect significantly ML training performance, by improving IO performance
Much more efficient storage (filesystems are limited by inodes – or discrete numbers of file-like objects)
Simpler organization of data

Images and HDF5:

Some docs: https://matplotlib.org/stable/tutorials/images.html

Overview:

You have a bunch of image (.png) files…
Print some thumbnails of those images
Show that an image file is just an array of numbers
Create an HDF5 object
Add the images to the HDF5 object as numerical arrays
Separate those images into two groups
Extract and plot those images from the HDF5 object

%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy
import h5py
#import hdf5plugin
import matplotlib as mpl
import pylab as plt
import os

import imageio.v3 as iio

Define a source directory and create list of image files. This is a sloppy way to sort and select only the files we want, but it does the trick. Ideally you should select your files in code, allowing auxiliary files, meta-data, etc. in the directory.

img_path = './images_h5'
image_names = os.listdir(img_path)
image_names = [os.path.join(img_path, f) for f in image_names if f.endswith('.png') and 'Small' in f]
image_names = [os.path.abspath(f) for f in image_names]
print('** Image files: ', image_names)

** Image files:  ['/Users/myoder96/Codes/src_python_demos/images_h5/IMG_6989_Small.png', '/Users/myoder96/Codes/src_python_demos/images_h5/IMG_6974_Small.png', '/Users/myoder96/Codes/src_python_demos/images_h5/IMG_6947_Small.png', '/Users/myoder96/Codes/src_python_demos/images_h5/IMG_6968_Small.png', '/Users/myoder96/Codes/src_python_demos/images_h5/IMG_6970_Small.png']

# let's do some thumbnails, for reference
w=5
h=1.5*w
#
N = len(image_names)
fg = plt.figure(figsize=(h, N*w))
#
axes = [fg.add_subplot(1,N,k+1) for k in range(N)]
for k, (ax,im) in enumerate(zip(axes,image_names)):
    ax.imshow(iio.imread(im))
#

png

First, let’s just look at .png image layers:

Load some image names into a list
use imshow() to look at the image layers
Note that the depth of images may vary. Some .png images are (N,M,3), where others are (N,M,4).
If we are smart, we will dynamically determine the depth. For this demo, we are just going to use the first three levels and the aggregate.

img_array = iio.imread(image_names[0])
print(f'image array shape: {img_array.shape}')

image array shape: (320, 240, 4)

fg = plt.figure(figsize=(6,10))
axes = [fg.add_subplot(2,2,k+1) for k in range(4)]
#
for k, ax in enumerate(axes[0:3]):
    ax.imshow(img_array[:,:,k])
axes[3].imshow(img_array)

<matplotlib.image.AxesImage at 0x129378790>

png

Create an HDF5 object

Add images as arrays…

Quick Start: https://docs.h5py.org/en/stable/quick.html
We have two clases of images:
- Photos from “The Yo”
- Photos of Beru
Separate those photos into two groups:
- TheYo
- Beru
More practically, for ML applications you might have:
- training_imaes
- validation_images
- demo_images

h5_images = 'yo_images.h5'
ks_yo = [0,1,2]
ks_beru = [3,4]
#
h5_images_fp = os.path.join(img_path, h5_images)
#
with h5py.File(h5_images_fp, 'w') as h5_obj:
    #
    # create group(s). Note that we do not necessarily need to add the groups in advance. We can
    #. create them by providing a full(er) path in create_dataset()
    h5_obj.create_group('TheYo')
    #h5_obj.create_group('Beru')
    #
    # add Yo images:
    for k in ks_yo:
        h5_obj.create_dataset(f'TheYo/im_{k}', data=iio.imread(image_names[k]))
    #
    # add Beru images:
    for k in ks_beru:
        h5_obj.create_dataset(f'Beru/im_{k}', data=iio.imread(image_names[k]))
    #

Reading HDF5 data

We use a nested dict-like, array (JSON-like) structuer
- {‘image_set’: {img_name:img_data, …}, …}

with h5py.File(h5_images_fp, 'r') as fout:
    for ky in fout:
        print(f'ky: {ky}')
        for ds,vl in fout[ky].items():
            print(f'  ds: {ds}[{vl.shape}]')
#
# and show some thumbnails in a semi-dynamic way:
with h5py.File(h5_images_fp, 'r') as fout:
    n_groups = len(fout.keys())
    #
    n_cols = 0
    for ky in fout:
        print(f'ky: {ky}')
        n_cols = max(n_cols, len(fout[ky].keys()))
    #
    fg = plt.figure(figsize=(h*n_groups, n_cols))
    #
    #
    print(f'** subplot dimensions: {n_groups}, {n_cols}')
    
    # introduce some more efficient syntax:
    for k_rw, (ky,grp) in enumerate(fout.items()):
        
        for k_col, (nm,data) in enumerate(grp.items()):
            kc = k_col+1
            #
            k_img = k_rw*(n_cols) + k_col+1
            print(f'** image index: {k_img}')
            #
            ax = fg.add_subplot(n_groups, n_cols, k_img)
            ax.imshow(data)
        #
    #
        

ky: Beru
  ds: im_3[(320, 240, 4)]
  ds: im_4[(240, 320, 4)]
ky: TheYo
  ds: im_0[(320, 240, 4)]
  ds: im_1[(320, 240, 4)]
  ds: im_2[(240, 320, 4)]
ky: Beru
ky: TheYo
** subplot dimensions: 2, 3
** image index: 1
** image index: 2
** image index: 4
** image index: 5
** image index: 6

png

Compression

The astute reader will likely be disappointed to observe that the HDF5 object, created above, is some 2x as large as the collection of PNG files. This is (presumably…) because the HDF5 data are stored as a raw array and forego compression available to the image-type format.

Fortunately, h5py, as well as other HDF5 interfaces, support a variety of compression options. Again, the primary h5py documentation is quite helpful: https://docs.h5py.org/en/stable/high/dataset.html

Three popular lossless compression algorithms, supported by h5py, include:

compression="gzip" compression_opts={0,9} (variable compression; supported by all h5py builds)
compresion="lzf" (fast, low to moderate compression; all h5py builds)
compression="szip" (available with some h5py builds)

GZIP:

# gzip example:

h5_images_gz = 'yo_images_gz9.h5'
#ks_yo = [0,1,2]
#ks_beru = [3,4]
#
h5_images_gz_fp = os.path.join(img_path, h5_images_gz)
#
with h5py.File(h5_images_gz_fp, 'w') as h5_obj:
    #
    # create group(s). Note that we do not necessarily need to add the groups in advance. We can
    #. create them by providing a full(er) path in create_dataset()
    h5_obj.create_group('TheYo')
    #h5_obj.create_group('Beru')
    #
    # add Yo images:
    for k in ks_yo:
        h5_obj.create_dataset(f'TheYo/im_{k}', data=iio.imread(image_names[k]), compression="gzip", compression_opts=9)
    #
    # add Beru images:
    for k in ks_beru:
        h5_obj.create_dataset(f'Beru/im_{k}', data=iio.imread(image_names[k]), compression="gzip", compression_opts=9)
    #

LZF:

h5_images = 'yo_images_lzf.h5'
#ks_yo = [0,1,2]
#ks_beru = [3,4]
#
h5_images_fp = os.path.join(img_path, h5_images)
#
with h5py.File(h5_images_fp, 'w') as h5_obj:
    #
    # create group(s). Note that we do not necessarily need to add the groups in advance. We can
    #. create them by providing a full(er) path in create_dataset()
    h5_obj.create_group('TheYo')
    #h5_obj.create_group('Beru')
    #
    # add Yo images:
    for k in ks_yo:
        h5_obj.create_dataset(f'TheYo/im_{k}', data=iio.imread(image_names[k]), compression="lzf")
    #
    # add Beru images:
    for k in ks_beru:
        h5_obj.create_dataset(f'Beru/im_{k}', data=iio.imread(image_names[k]), compression="lzf")
    #

HDF5 Plugin

Additional compression filter are available from HDF5 Plugin – amongst other sources. Additional filters are made aviablie by importing the hdf5plugin Python module, eg: import hdf5plugin

It may be necessary to install the hdf5plugin package: https://hdf5plugin.readthedocs.io/en/stable/install.html

Typically, hdf5plugin can be installed via pip or conda; for conda, it may be necessary to add the conda-forge channel, eg:

conda add channel conda-forge
conda install hdf5plugin

or for a one-time use of that channel,

conda install -c conda-forge hdf5plugin

or

pip install hdf5plugin

On Stanford Research Computing’s Sherlock platform, this is likely:

pip3 install --user hdf5plugin

Note that hdf5plugin supports a variety of compression algorithms; see thier docs for details:

http://www.silx.org/doc/hdf5plugin/latest/usage.html

Note that, presumably because hdf5plugin is developed by a third party, and so is not intirinsically integrated with h5py, the syntax is a little bit different than when using the builtin options ="gzip" or ="lzf".

import hdf5plugin
#
h5_images = 'yo_images_bzip.h5'
#ks_yo = [0,1,2]
#ks_beru = [3,4]
#
h5_images_fp = os.path.join(img_path, h5_images)
#
with h5py.File(h5_images_fp, 'w') as h5_obj:
    #
    # create group(s). Note that we do not necessarily need to add the groups in advance. We can
    #. create them by providing a full(er) path in create_dataset()
    h5_obj.create_group('TheYo')
    #h5_obj.create_group('Beru')
    #
    # add Yo images:
    for k in ks_yo:
        h5_obj.create_dataset(f'TheYo/im_{k}', data=iio.imread(image_names[k]),
                              **hdf5plugin.BZip2(blocksize=5))
    #
    # add Beru images:
    for k in ks_beru:
        h5_obj.create_dataset(f'Beru/im_{k}', data=iio.imread(image_names[k]),
                             **hdf5plugin.BZip2(blocksize=5))
    #

Performance and iNodes

Compression performance

For the examples provided here, we observe approximately:

Raw PNG: 840 KB
Raw HDF5: 1.5 MB1
HDF5, gz9: 940 KB
HDF5, lzf: 1.2 MB
HDF5, bz2: 794 KB

Even with compression – or at lest the compression options demonstrated here, the total filesize of the HDF5 object is greater than the combined size of the original .png images. On the other hand, for all of the compressed formats, the files are aggregated into a single file, and so represent only a single “inode” on the filesystem.

iNodes

An “inode” is a file-like object – a file, directory, link, etc. Generally speaking, since each inode consists of meta-data describing itslef, it constitutes data, and so contributes to the total storage capacity of the filesystem. Additionally – and perhaps more significantly, most filesystems spent more time and compute power finding a file than reading it; there is typically an intrinsic latency to locating and openging a file on disk.

iNode quotas

For any filesystem – even your private lab machines or personal laptop, inode restrictions exist by the simple fact that they cost storege space, in the form of meta-data, and the configuration of the disk (filesystem) explicity defines some finite number of inodes that can be supported.

On shared filesystem, especially on HPC systems – including SRC’s Sherlock and Oak platforms, inode quotas are explicitly defined and enforced. For example Oak quotas are restricted to 150k inodes/TB (iTB); Sherlock’s $SCRATCH system sets a quota of 200k iTB. This means that mean file sized for those systems should be greater than 6.7MB and 5 MB, respectively.

Note then, that thousands of files stored as small files, for example <3.5MB/file effectively displace much more space (percent of an allocated quota) – in this case approximately twice as much space, than just their data voluem would suggest. In this respect, even if HFD5 compression underperforms image-format compression, consolidating inodes still increases overall capacity.

Compute performance

Input-output (IO) performance is almost always significantly compromised by storing data as large numbers of small files. One way to describe this – from a computing engineering perspective, is that the filesystem typically does not perform well as a data indexing system. Inversely, it is almost always true that filesystem – and therefore overall compute, performance can be significantly improved by consolidateing large numbers of small data files into a (much) smaller number of files, especialy if the consolidated format supports indexing.

This can be espcially true for ML and AI applications that read in large numbers of images, or other data collecdtions, and especially if these data are loaded in one or only a few readings.

As suggested here, a simple and effective option is HDF5. Various relational and federated databases might also be considered, depending on available resources, expected applications, and other factors.