Design of the Versioned HDF5 Library
In a previous post, we introduced the Versioned HDF5 library and described some of its features. In this post, we'll go into detail on how the underlying design of the library works on a technical level.
Versioned HDF5 is a library that wraps h5py and offers a versioned abstraction for HDF5 groups and datasets. Versioned HDF5 works fundamentally as a copy-on-write system. The basic idea of copy-on-write is that all data is effectively immutable in the backend. Whenever a high-level representation of data is modified, it is copied to a new location in the backend, leaving the original version intact. Any references to the original will continue to point to it.
At a high level, in a versioned system built with copy-on-write, all data in the system is stored in immutable blobs in the backend. The immutability of these blobs is often enforced by making them content-addressable, where the blobs are always referred to in the system by a cryptographic hash of their contents. Cryptographic hashes form a mapping from any blob of data to a fixed set of bytes, which is effectively one-to-one, meaning if two blobs have the same hash, they must be exactly the same data. This means that it is impossible to mutate a blob of data in-place. Doing so would change its hash, which would make it a different blob, since blobs are referenced only by their hash.
Whenever data for a version is committed, its data is stored as blobs in the backend. It may be put into a single blob, or split into multiple blobs. If it is split, a way to reconstruct the data from the blobs is stored. If a later version modifies that data, any blobs that are different are stored as new blobs. If the data is the same, the blobs will also be the same, and hence will not be written to twice, because they will already exist in the backend.
At a high level, this is how the git version control system works, for example (this is a good talk on the internals of git). It is also how versioning constructs in some modern filesystems like APFS and Btrfs.
Versioned HDF5 Implementation
In Versioned HDF5, this idea is implemented using two key HDF5 primitives: chunks and virtual datasets.
In HDF5, datasets are split into multiple chunks. Each chunk is of equal size, which is configurable, although some chunks may not be completely full. A chunk is the smallest part of a dataset that HDF5 operates on. Whenever a subset of a dataset is to be read, the entire chunk containing that dataset is read into memory. Picking an optimal chunk size is a nontrivial task, and depends on things such as the size of your L1 cache and the typical shape of your dataset. Furthermore, in Versioned HDF5 a chunk is the smallest amount of data that is stored only once across versions if it has not changed. If the chunk size is too small, it would affect performance, as operations would require reading and writing more chunks, but if it is too large, it would make the resulting versioned file unnecessarily large, as changing even a single element of a chunk requires rewriting the entire chunk. Versioned HDF5 does not presently contain any logic for automatically picking a chunk size. The pytables documentation has some tips on picking an optimal chunk size.
Because chunks are the most basic HDF5 primitive, Versioned HDF5 uses them as the underlying blobs for storage. This way operations on data can be as performant as possible.
Virtual datasets are a special kind of dataset that reference data from other datasets in a seamless way. The data from each part of a virtual dataset comes from another dataset. HDF5 does this seamlessly, so that a virtual dataset appears to be a normal dataset.
The basic design of Versioned HDF5 is this: whenever a dataset is created for
the first time (the first version containing the dataset), it is split into
chunks. The data in each chunk is hashed and stored in a hash table. The
unique chunks are then appended into a raw_data
dataset corresponding to the
dataset. Finally, a virtual dataset is made that references the corresponding
chunks in the raw dataset to recreate the original dataset. When later
versions modify this dataset, each modified chunk is appended to the raw
dataset, and a new virtual dataset is created pointing to corresponding
chunks.
For example, say we start with the first version, version_1
, and create a
dataset my_dataset
with n
chunks. The dataset chunks will be written into the
raw dataset, and the final virtual dataset will point to those chunks.
If we then create a version version_2
based off version_1
, and modify only
data contained in CHUNK 2, that new data will be appended to the raw dataset,
and the resulting virtual dataset for version_2
will look like this:
Since both versions 1 and 2 of my_dataset
have identical data in chunks other than
CHUNK 2, they both point to the exact same data in raw_data
. Thus, the
underlying HDF5 file only stores the data in version 1 of my_dataset
once, and
only the modified chunks from version_2
's my_dataset
are stored on top of that.
All extra metadata, such as attributes, is stored on the virtual dataset. Since virtual datasets act exactly like real datasets and operate at the HDF5 level, each version is a real group in the HDF5 file that is exactly that version. However, these groups should be treated as read-only, and you should never access them outside of the Versioned HDF5 API (see below).
HDF5 File Layout
Inside of the HDF5 file, there is a special _versioned_data
group that holds
all the internal data for Versioned HDF5. This group contains a versions
group, which contains groups for each version that has been created. It also
contains a group for each dataset that exists in a version. These groups each
contain two datasets, hash_table
, and raw_data
.
For example, consider a Versioned HDF5 file that contains two versions,
version1
, and version2
, with datasets data1
and data2
. Suppose also
that data1
exists in both versions and data2
only exists in version2
.
The HDF5 layout would look like this
/_versioned_data/
├── data1/
│ ├── hash_table
│ └── raw_data
│
├── data2/
│ ├── hash_table
│ └── raw_data
│
└── versions/
├── __first_version__/
│
├── version1/
│ └── data1
│
└── version2/
├── data1
└── data2
__first_version__
is an empty group that exists only for internal
bookkeeping purposes.
The hash_table
datasets store the hashes for each chunk of data, so that
duplicate data will not be written twice, and the raw_data
dataset stores
the chunks for all versions of a given dataset. It is referenced by the
virtual datasets in the corresponding version groups in versions/
. For
example, the chunks for the data data1
in versions version1
and version2
are stored in _versioned_data/data1/raw_data
.
Versioned HDF5 API
The biggest challenge of this design is that the virtual datasets representing
the data in each versioned data all point to the same blobs in the backend.
However, in HDF5, if a virtual dataset is written to, it will write to the
location it points to. This is at odds with the immutable copy-on-write
design. As a consequence, Versioned HDF5 needs to wrap all the h5py APIs that
write into a dataset to disallow writing for versions that are already
committed, and to do the proper copy-on-write semantics for new versions that
are being staged. Several classes that wrap the h5py Dataset and Group objects
are present in the versioned_hdf5.wrappers
submodule. These wrappers act
just like their h5py counterparts, but do the right thing on versioned data.
The top-level versioned_hdf5.VersionedHDF5File
API returns these objects
whenever a versioned dataset is accessed. They are designed to work seamlessly
like the corresponding h5py objects.
The Versioned HDF5 library was created by the D. E. Shaw group in conjunction with Quansight.
Comments