FAQ#
Usage questions#
Can my specific data be virtualized?#
Depends on some details of your data.
VirtualiZarr works by mapping your data to the zarr data model from whatever data model is used by the format it was saved in. This means that if your data contains anything that cannot be represented within the zarr data model, it cannot be virtualized.
When virtualizing multi-file datasets, it is sometimes the case that it is possible to virtualize one file, but not possible to virtualize all the files together as part of one datacube, because of inconsistencies between the files. The following restrictions apply across every file in the datacube you wish to create!
Recognized format - Firstly, there must be a virtualizarr reader that understands how to parse the file format that your data is in. The VirtualiZarr package ships with readers for a number of common formats, if your data is not supported you may first have to write your own dedicated virtualizarr reader which understands your format.
Rectilinear arrays - The zarr data model is one of a set of rectilinear arrays, so your data must be decodable as a set of rectilinear arrays, each of which will map to single zarr array (via the
ManifestArray
class). If your data cannot be directly mapped to a rectilinear array, for example because it has inconsistent lengths along a common dimension (known as “ragged data”), then it cannot be virtualized.Homogeneous chunk shapes - The zarr data model assumes that every chunk of data in a single array has the same chunk shape. For multi-file datasets each chunk often corresponds to (part of) one file, so if all your files do not have consistent chunking your data cannot be virtualized. This is a big restriction, and there are plans to relax it in future, by adding support for variable-length chunks to the zarr data model.
Homogeneous codecs - The zarr data model assumes that every chunk of data in a single array uses the same set of codecs for compression etc. For multi-file datasets each chunk often corresponds to (part of) one file, so if all your files do not have consistent compression or other codecs your data cannot be virtualized. This is another big restriction, and there are also plans to relax it in the future.
Registered codecs - The codecs needed to decompress and deserialize your data must be known to zarr. This might require defining and registering a new zarr codec.
Homogeneous data types - The zarr data model assumes that every chunk of data in a single array decodes to the same data type (i.e. dtype). For multi-file datasets each chunk often corresponds to (part of) one file, so if all your files do not have consistent data types your data cannot be virtualized. This is arguably inherent to the concept of what an array is.
Registered data types - The dtype of your data must be known to zarr. This might require registering a new zarr data type.
If you attempt to use virtualizarr to create virtual references for data which violates any of these restrictions, it should raise an informative error telling you why it’s not possible.
Sometimes you can get around some of these restrictions for specific variables by loading them into memory instead of virtualizing them - see the section in the usage docs about loadable variables.
I’m an Xarray user but unfamiliar with Zarr/Cloud - might I still want this?#
Potentially yes.
Let’s say you have a bunch of archival files (e.g. netCDF) which together tile along one or more dimensions to form a large dataset. Let’s also imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don’t then read the xarray docs page on combining data.)
# open_mfdataset does a lot of checks, so can take a while
ds = xr.open_mfdataset(
'/my/files*.nc',
engine='h5netcdf',
combine='nested',
)
ds # the complete lazy xarray dataset
However, you don’t want to run this set of xarray operations every time you open this dataset, as running commands like xr.open_mfdataset
can be expensive.
Instead you would prefer to just be able to open a single pre-saved virtual store that points to all your data, as that would open instantly (using xr.open_dataset('my_virtual_store.zarr')
), but still give access to the same data underneath.
VirtualiZarr
aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.
You can think of this as effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly.
Note
This means you should not change or add to any of the files comprising the store once created. If you want to make changes or add new data, you should look into using Icechunk instead.
As Zarr can read data that lives on filesystems too, this can be useful even if you don’t plan to put your data in the cloud.
You can create the virtual store once (e.g. as soon as your HPC simulation finishes) and then opening that dataset will be much faster than using open_mfdataset
each time.
Is this compatible with Icechunk?#
Very much so! VirtualiZarr allows you to ingest data as virtual references and write those references into an Icechunk Store. See the Icechunk documentation on creating virtual datasets.
In general once the Icechunk specification reaches a stable v1.0, we would recommend using that over Kerchunk’s references format, in order to take advantage of transactional updates, version controlled history, and faster access speeds.
I have already Kerchunked my data, do I have to redo that?#
No - you can simply open the Kerchunk-formatted references you already have into VirtualiZarr directly. Then you can manipulate them, or re-save them into a new format, such as Icechunk:
from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset('refs.json')
# vds = open_virtual_dataset('refs.parq') # kerchunk parquet files are supported too
vds.virtualize.to_icechunk(icechunkstore)
I already have some data in Zarr, do I have to resave it?#
No! VirtualiZarr can create virtual references pointing to existing Zarr stores in the same way as for other file formats. Note: Currently only reading Zarr V3 is supported.
Can I add a new reader for my custom file format?#
There are a lot of archival file formats which could potentially be represented as virtual zarr references (see this issue listing some examples). VirtualiZarr ships with some readers for common formats (e.g. netCDF/HDF5), but you may want to write your own reader for some other file format.
VirtualiZarr is designed in a way to make this as straightforward as possible. If you want to do this then this comment will be helpful.
You can also use this approach to write a reader that starts from a kerchunk-formatted virtual references dict.
Currently if you want to call your new reader from virtualizarr.open_virtual_dataset
you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see issue #245).
Why would I want to load variables using loadable_variables
?#
Loading variables can be useful in a few scenarios:
You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
You want in-memory indexes to use with
xr.combine_by_coords
,Storing a variable on-disk as a set of references would be inefficient, e.g. because it’s a very small array (saving the values like this is similar to kerchunk’s concept of “inlining” data),
The variable has encoding, and the simplest way to decode it correctly is to let xarray’s standard decoding machinery load it into memory and apply the decoding,
Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data). Loading them allows you to rechunk them however you like.
How does this actually work?#
I’m glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of archival files in some other format as a series of steps:
Read byte ranges - We use various virtualizarr readers to determine which byte ranges within a given archival file would have to be read in order to get a specific chunk of data we want. Several of these readers work by calling one of the kerchunk file format backends and parsing the output.
Construct a representation of a single file (or array within a file) - Kerchunk’s backends return a nested dictionary representing an entire file, but we instead immediately parse this dict and wrap it up into a set of
ManifestArray
objects. The record of where to look to find the file and the byte ranges is stored under theManifestArray.manifest
attribute, in aChunkManifest
object. Both steps (1) and (2) are handled by thevirtualizarr.open_virtual_dataset
, which returns onexarray.Dataset
object for the given file, which wraps multipleManifestArray
instances (as opposed to e.g. numpy/dask arrays).Deduce the concatenation order - The desired order of concatenation can either be inferred from the order in which the datasets are supplied (which is what
xr.combined_nested
assumes), or it can be read from the coordinate data in the files (which is whatxr.combine_by_coords
does). If the ordering information is not present as a coordinate (e.g. because it’s in the filename), a pre-processing step might be required.Check that the desired concatenation is valid - Whether called explicitly by the user or implicitly via
xr.combine_nested/combine_by_coords/open_mfdataset
,xr.concat
is used to concatenate/stack the wrappedManifestArray
objects. When doing this xarray will spend time checking that the array objects and any coordinate indexes can be safely aligned and concatenated. Along with opening files, and loading coordinates in step (3), this is the main reason whyxr.open_mfdataset
can take a long time to return a dataset created from a large number of files.Combine into one big dataset -
xr.concat
dispatches to theconcat/stack
methods of the underlyingManifestArray
objects. These perform concatenation by merging their respective Chunk Manifests. Using xarray’scombine_*
methods means that we can handle multi-dimensional concatenations as well as merging many different variables.Serialize the combined result to disk - The resultant
xr.Dataset
object wrapsManifestArray
objects which contain the complete list of byte ranges for every chunk we might want to read. We now serialize this information to disk, either using the Kerchunk specification, or the Icechunk specification.Open the virtualized dataset from disk - The virtualized zarr store can now be read from disk, avoiding redoing all the work we did above and instead just opening all the virtualized data immediately. Chunk reads will be redirected to read the corresponding bytes in the original archival files.
The above steps could also be performed using the kerchunk
library alone, but because (3), (4), (5), and (6) are all performed by the kerchunk.combine.MultiZarrToZarr
function, and no internal abstractions are exposed, kerchunk’s design is much less modular, and the use cases are limited by kerchunk’s API surface.
How do VirtualiZarr and Kerchunk compare?#
You have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.
Users of Kerchunk may find the following comparison table useful, which shows which features of Kerchunk map on to which features of VirtualiZarr.
Component / Feature |
Kerchunk |
VirtualiZarr |
---|---|---|
Generation of references from archival files (1) |
||
From a netCDF4/HDF5 file |
|
|
From a netCDF3 file |
|
|
From a COG / tiff file |
|
|
From a Zarr v2 store |
|
|
From a Zarr v3 store |
|
|
From a GRIB2 file |
|
|
From a FITS file |
|
|
From a HDF4 file |
|
|
From a DMR++ metadata file |
❌ |
|
From existing kerchunk JSON/parquet references |
|
|
In-memory representation (2) |
||
In-memory representation of byte ranges for single array |
Part of a “reference |
|
In-memory representation of actual data values |
Encoded bytes directly serialized into the “reference |
|
In-memory representation of entire file / store |
Nested “reference |
|
Manipulation of in-memory references (3, 4 & 5) |
||
Combining references to multiple arrays representing different variables |
|
|
Combining references to multiple arrays representing the same variable |
|
|
Combining references in coordinate order |
|
|
Combining along multiple dimensions without coordinate data |
❌ |
|
Dropping variables |
|
|
Renaming variables |
❌ |
|
Renaming dimensions |
❌ |
|
Renaming manifest file paths |
|
|
Splitting uncompressed data into chunks |
|
|
Selecting specific chunks |
❌ |
|
Parallelization |
||
Parallelized generation of references |
Wrapping kerchunk’s opener inside |
Wrapping |
Parallelized combining of references (tree-reduce) |
|
Wrapping |
On-disk serialization (6) and reading (7) |
||
Kerchunk reference format as JSON |
|
|
Kerchunk reference format as parquet |
|
|
Zarr v3 store with |
❌ |
|
Icechunk store |
❌ |
|
Development#
Why a new project?#
The reasons why VirtualiZarr has been developed as separate project rather than by contributing to the Kerchunk library upstream are:
Kerchunk aims to support non-Zarr-like formats too (1) (2), whereas VirtualiZarr is more strictly scoped, and may eventually be very tighted integrated with the Zarr-Python library itself.
Whilst some features of VirtualiZarr currently require importing Kerchunk, Kerchunk is an optional dependency, and the VirtualiZarr roadmap aims to at some point not share any code with the Kerchunk library, nor ever require importing it. (You would nevertheless still be able to write out references in the Kerchunk format though!)
The API design of VirtualiZarr is deliberately completely different to Kerchunk’s API, so integration into Kerchunk would have meant duplicated functionality.
Refactoring Kerchunk’s existing API to maintain backwards compatibility would have been challenging.
What is the Development Status and Roadmap?#
VirtualiZarr version 1 (mostly) achieves feature parity with kerchunk’s logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.
Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages.
We have a lot of ideas, including:
“Virtual concatenation” of separate Zarr arrays
ManifestArrays as an intermediate layer in-memory in Zarr-Python
If you see other opportunities then we would love to hear your ideas!