FAQ#
Usage questions#
I’m an Xarray user but unfamiliar with Zarr/Cloud - might I still want this?#
Potentially yes.
Let’s say you have a bunch of archival files (e.g. netCDF) which together tile along one or more dimensions to form a large dataset. Let’s also imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don’t then read the xarray docs page on combining data.)
# open_mfdataset does a lot of checks, so can take a while
ds = xr.open_mfdataset(
'/my/files*.nc',
engine='h5netcdf',
combine='nested',
)
ds # the complete lazy xarray dataset
However, you don’t want to run this set of xarray operations every time you open this dataset, as running commands like xr.open_mfdataset
can be expensive.
Instead you would prefer to just be able to open a single pre-saved virtual store that points to all your data, as that would open instantly (using xr.open_dataset('my_virtual_store.zarr')
), but still give access to the same data underneath.
VirtualiZarr
aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.
You can think of this as effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly.
Note
This means you should not change or add to any of the files comprising the store once created. If you want to make changes or add new data, you should look into using Icechunk instead.
As Zarr can read data that lives on filesystems too, this can be useful even if you don’t plan to put your data in the cloud.
You can create the virtual store once (e.g. as soon as your HPC simulation finishes) and then opening that dataset will be much faster than using open_mfdataset
each time.
Is this compatible with Icechunk?#
Very much so! VirtualiZarr allows you to ingest data as virtual references and write those references into an Icechunk Store. See the Icechunk documentation on creating virtual datasets.
In general once the Icechunk specification reaches a stable v1.0, we would recommend using that over Kerchunk’s references format, in order to take advantage of transactional updates, version controlled history, and faster access speeds.
I have already Kerchunked my data, do I have to redo that?#
No - you can simply open the Kerchunk-formatted references you already have into VirtualiZarr directly. Then you can manipulate them, or re-save them into a new format, such as Icechunk:
from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset('refs.json')
# vds = open_virtual_dataset('refs.parq') # kerchunk parquet files are supported too
vds.virtualize.to_icechunk(icechunkstore)
I already have some data in Zarr, do I have to resave it?#
No! VirtualiZarr can (well, soon will be able to) create virtual references pointing to existing Zarr stores in the same way as for other file formats.
Can I add a new reader for my custom file format?#
There are a lot of archival file formats which could potentially be represented as virtual zarr references (see this issue listing some examples). VirtualiZarr ships with some readers for common formats (e.g. netCDF/HDF5), but you may want to write your own reader for some other file format.
VirtualiZarr is designed in a way to make this as straightforward as possible. If you want to do this then this comment will be helpful.
You can also use this approach to write a reader that starts from a kerchunk-formatted virtual references dict.
Currently if you want to call your new reader from virtualizarr.open_virtual_dataset
you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see issue #245).
How does this actually work?#
I’m glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of archival files in some other format as a series of steps:
Read byte ranges - We use various virtualizarr readers to determine which byte ranges within a given archival file would have to be read in order to get a specific chunk of data we want. Several of these readers work by calling one of the kerchunk file format backends and parsing the output.
Construct a representation of a single file (or array within a file) - Kerchunk’s backends return a nested dictionary representing an entire file, but we instead immediately parse this dict and wrap it up into a set of
ManifestArray
objects. The record of where to look to find the file and the byte ranges is stored under theManifestArray.manifest
attribute, in aChunkManifest
object. Both steps (1) and (2) are handled by thevirtualizarr.open_virtual_dataset
, which returns onexarray.Dataset
object for the given file, which wraps multipleManifestArray
instances (as opposed to e.g. numpy/dask arrays).Deduce the concatenation order - The desired order of concatenation can either be inferred from the order in which the datasets are supplied (which is what
xr.combined_nested
assumes), or it can be read from the coordinate data in the files (which is whatxr.combine_by_coords
does). If the ordering information is not present as a coordinate (e.g. because it’s in the filename), a pre-processing step might be required.Check that the desired concatenation is valid - Whether called explicitly by the user or implicitly via
xr.combine_nested/combine_by_coords/open_mfdataset
,xr.concat
is used to concatenate/stack the wrappedManifestArray
objects. When doing this xarray will spend time checking that the array objects and any coordinate indexes can be safely aligned and concatenated. Along with opening files, and loading coordinates in step (3), this is the main reason whyxr.open_mfdataset
can take a long time to return a dataset created from a large number of files.Combine into one big dataset -
xr.concat
dispatches to theconcat/stack
methods of the underlyingManifestArray
objects. These perform concatenation by merging their respective Chunk Manifests. Using xarray’scombine_*
methods means that we can handle multi-dimensional concatenations as well as merging many different variables.Serialize the combined result to disk - The resultant
xr.Dataset
object wrapsManifestArray
objects which contain the complete list of byte ranges for every chunk we might want to read. We now serialize this information to disk, either using the Kerchunk specification, or the Icechunk specification.Open the virtualized dataset from disk - The virtualized zarr store can now be read from disk, avoiding redoing all the work we did above and instead just opening all the virtualized data immediately. Chunk reads will be redirected to read the corresponding bytes in the original archival files.
The above steps could also be performed using the kerchunk
library alone, but because (3), (4), (5), and (6) are all performed by the kerchunk.combine.MultiZarrToZarr
function, and no internal abstractions are exposed, kerchunk’s design is much less modular, and the use cases are limited by kerchunk’s API surface.
How do VirtualiZarr and Kerchunk compare?#
You have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.
Users of Kerchunk may find the following comparison table useful, which shows which features of Kerchunk map on to which features of VirtualiZarr.
Component / Feature |
Kerchunk |
VirtualiZarr |
---|---|---|
Generation of references from archival files (1) |
||
From a netCDF4/HDF5 file |
|
|
From a netCDF3 file |
|
|
From a COG / tiff file |
|
|
From a Zarr v2 store |
|
|
From a Zarr v3 store |
❌ |
|
From a GRIB2 file |
|
|
From a FITS file |
|
|
From a HDF4 file |
|
|
From a DMR++ metadata file |
❌ |
|
From existing kerchunk JSON/parquet references |
|
|
In-memory representation (2) |
||
In-memory representation of byte ranges for single array |
Part of a “reference |
|
In-memory representation of actual data values |
Encoded bytes directly serialized into the “reference |
|
In-memory representation of entire file / store |
Nested “reference |
|
Manipulation of in-memory references (3, 4 & 5) |
||
Combining references to multiple arrays representing different variables |
|
|
Combining references to multiple arrays representing the same variable |
|
|
Combining references in coordinate order |
|
|
Combining along multiple dimensions without coordinate data |
❌ |
|
Dropping variables |
|
|
Renaming variables |
❌ |
|
Renaming dimensions |
❌ |
|
Renaming manifest file paths |
|
|
Splitting uncompressed data into chunks |
|
|
Selecting specific chunks |
❌ |
|
Parallelization |
||
Parallelized generation of references |
Wrapping kerchunk’s opener inside |
Wrapping |
Parallelized combining of references (tree-reduce) |
|
Wrapping |
On-disk serialization (6) and reading (7) |
||
Kerchunk reference format as JSON |
|
|
Kerchunk reference format as parquet |
|
|
Zarr v3 store with |
❌ |
|
Icechunk store |
❌ |
|
Development#
Why a new project?#
The reasons why VirtualiZarr has been developed as separate project rather than by contributing to the Kerchunk library upstream are:
Kerchunk aims to support non-Zarr-like formats too (1) (2), whereas VirtualiZarr is more strictly scoped, and may eventually be very tighted integrated with the Zarr-Python library itself.
Whilst some features of VirtualiZarr currently require importing Kerchunk, Kerchunk is an optional dependency, and the VirtualiZarr roadmap aims to at some point not share any code with the Kerchunk library, nor ever require importing it. (You would nevertheless still be able to write out references in the Kerchunk format though!)
The API design of VirtualiZarr is deliberately completely different to Kerchunk’s API, so integration into Kerchunk would have meant duplicated functionality.
Refactoring Kerchunk’s existing API to maintain backwards compatibility would have been challenging.
What is the Development Status and Roadmap?#
VirtualiZarr version 1 (mostly) achieves feature parity with kerchunk’s logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.
Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages.
We have a lot of ideas, including:
“Virtual concatenation” of separate Zarr arrays
ManifestArrays as an intermediate layer in-memory in Zarr-Python
If you see other opportunities then we would love to hear your ideas!