Usage#

This page explains how to use VirtualiZarr today, by introducing the key concepts one-by-one.

Opening files as virtual datasets#

VirtualiZarr is for manipulating “virtual” references to pre-existing data stored on disk in a variety of formats, by representing it in terms of the Zarr data model of chunked N-dimensional arrays.

If we have a pre-existing netCDF file on disk,

import xarray as xr

# create an example pre-existing netCDF4 file
ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')

We can open a virtual representation of this file using open_virtual_dataset.

from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset('air.nc')

(Notice we did not have to explicitly indicate the file format, as open_virtual_dataset will attempt to automatically infer it.)

Note

In future we would like for it to be possible to just use xr.open_dataset, e.g.

import virtualizarr

vds = xr.open_dataset('air.nc', engine='virtualizarr')

but this requires some upstream changes in xarray.

Printing this “virtual dataset” shows that although it is an instance of xarray.Dataset, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps ManifestArray objects.

vds
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
    time     (time) float32 12kB ManifestArray<shape=(2920,), dtype=float32, ...
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

These ManifestArray objects are each a virtual reference to some data in the air.nc netCDF file, with the references stored in the form of “Chunk Manifests”.

Opening remote files#

To open remote files as virtual datasets pass the reader_options options, e.g.

aws_credentials = {"key": ..., "secret": ...}
vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})

Chunk Manifests#

In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.

A “Chunk Manifest” is a list of chunk keys and their corresponding byte ranges in specific files, grouped together such that all the chunks form part of one Zarr-like array. For example, a chunk manifest for a 3-dimensional array made up of 4 chunks might look like this:

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

Notice that the "path" attribute points to a netCDF file "foo.nc" stored in a remote S3 bucket. There is no need for the files the chunk manifest refers to to be local.

Our virtual dataset we opened above contains multiple chunk manifests stored in-memory, which we can see by pulling one out as a python dictionary.

marr = vds['air'].data
manifest = marr.manifest
manifest.dict()
{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}

In this case we can see that the "air" variable contains only one chunk, the bytes for which live in the air.nc file at the location given by the 'offset' and 'length' attributes.

The ChunkManifest class is virtualizarr’s internal in-memory representation of this manifest.

ManifestArray class#

A Zarr array is defined not just by the location of its constituent chunk data, but by its array-level attributes such as shape and dtype. The ManifestArray class stores both the array-level attributes and the corresponding chunk manifest.

marr
ManifestArray<shape=(2920, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
marr.manifest
ChunkManifest<shape=(1, 1, 1)>
marr.zarray
ZArray(shape=(2920, 25, 53), chunks=(2920, 25, 53), dtype=int16, compressor=None, filters=None, fill_value=None)

A ManifestArray can therefore be thought of as a virtualized representation of a single Zarr array.

As it defines various array-like methods, a ManifestArray can often be treated like a “duck array”. In particular, concatenation of multiple ManifestArray objects can be done via merging their chunk manifests into one (and re-labelling the chunk keys).

import numpy as np

concatenated = np.concatenate([marr, marr], axis=0)
concatenated
ManifestArray<shape=(5840, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
concatenated.manifest.dict()
{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000},
 '1.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}

This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks.

Note

As a single Zarr array has only one array-level set of compression codecs by definition, concatenation of arrays from files saved to disk with differing codecs cannot be achieved through concatenation of ManifestArray objects. Implementing this feature will require a more abstract and general notion of concatenation, see GH issue #5.

Remember that you cannot load values from a ManifestArray directly.

vds['air'].values
NotImplementedError: ManifestArrays can't be converted into numpy arrays or pandas Index objects

The whole point is to manipulate references to the data without actually loading any data.

Note

You also cannot currently index into a ManifestArray, as arbitrary indexing would require loading data values to create the new array. We could imagine supporting indexing without loading data when slicing only along chunk boundaries, but this has not yet been implemented (see GH issue #51).

Virtual Datasets as Zarr Groups#

The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed ManifestArray cannot store all of this information, an xarray.Dataset wrapping multiple ManifestArrays maps really nicely to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of in-memory arrays.

The problem of combining many legacy format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using open_virtual_dataset and using xarray’s various combining functions to combine them into one aggregate virtual dataset.

Combining virtual datasets#

In general we should be able to combine all the datasets from our legacy files into one using some combination of calls to xarray.concat and xarray.merge. For combining along multiple dimensions in one call we also have xarray.combine_nested and xarray.combine_by_coords. If you’re not familiar with any of these functions we recommend you skim through xarray’s docs on combining.

Let’s create two new netCDF files, which we would need to open and concatenate in a specific order to represent our entire dataset.

ds1 = ds.isel(time=slice(None, 1460))
ds2 = ds.isel(time=slice(1460, None))

ds1.to_netcdf('air1.nc')
ds2.to_netcdf('air2.nc')

Note that we have created these in such a way that each dataset has one equally-sized chunk.

TODO: Note about variable-length chunking?

Manual concatenation ordering#

The simplest case of concatenation is when you have a set of files and you know which order they should be concatenated in, without looking inside the files. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.

We can actually avoid creating any xarray indexes, as we won’t need them. Without indexes we can avoid loading any data whatsoever from the files, making our opening and combining much faster than it normally would be. Therefore if you can do your combining manually you should. However, you should first be confident that the legacy files actually do have compatible data, as only the array shapes and dimension names will be checked for consistency.

You can specify that you don’t want any indexes to be created by passing indexes={} to open_virtual_dataset.

vds1 = open_virtual_dataset('air1.nc', indexes={})
vds2 = open_virtual_dataset('air2.nc', indexes={})

We can see that the datasets have no indexes.

vds1.indexes
Indexes:
    *empty*

As we know the correct order a priori, we can just combine along one dimension using xarray.concat.

combined_vds = xr.concat([vds1, vds2], dim='time', coords='minimal', compat='override')
combined_vds
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
    time     (time) float32 12kB ManifestArray<shape=(2920,), dtype=float32, ...
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

We can see that the resulting combined manifest has two chunks, as expected.

combined_vds['air'].data.manifest.dict()
{'0.0.0': {'path': 'air1.nc', 'offset': 15419, 'length': 3869000},
 '1.0.0': {'path': 'air2.nc', 'offset': 15419, 'length': 3869000}}

Note

The keyword arguments coords='minimal', compat='override' are currently necessary because the default behaviour of xarray will attempt to load coordinates in order to check their compatibility with one another. In future this default will be changed, such that passing these two arguments explicitly will become unnecessary.

The general multi-dimensional version of this concatenation-by-order-supplied can be achieved using xarray.combine_nested.

combined_vds = xr.combine_nested([vds1, vds2], concat_dim=['time'], coords='minimal', compat='override')

In N-dimensions the datasets would need to be passed as an N-deep nested list-of-lists, see the xarray docs.

Note

In future we would like for it to be possible to just use xr.open_mfdataset to open the files and combine them in one go, e.g.

vds = xr.open_mfdataset(
    ['air1.nc', 'air2.nc'],
    combine='nested',
    concat_dim=['time'],
    coords='minimal',
    compat='override',
    indexes={},
)

but this requires some upstream changes in xarray.

Automatic ordering using coordinate data#

TODO: Reinstate this part of the docs once GH issue #18 is properly closed.

Automatic ordering using metadata#

TODO: Use preprocess to create a new index from the metadata

Loading variables#

Whilst the values of virtual variables (i.e. those backed by ManifestArray objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like xr.open_dataset normally returns. These variables are specified using the loadable_variables argument:

vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'])
<xarray.Dataset> Size: 31MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

You can see that the dataset contains a mixture of virtual variables backed by ManifestArray objects, and loadable variables backed by (lazy) numpy arrays.

Loading variables can be useful in a few scenarios:

  1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,

  2. Storing a variable on-disk as a set of references would be inefficient, e.g. because it’s a very small array (saving the values like this is similar to kerchunk’s concept of “inlining” data),

  3. The variable has encoding, and the simplest way to decode it correctly is to let xarray’s standard decoding machinery load it into memory and apply the decoding.

Writing virtual stores to disk#

Once we’ve combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.

Writing to Kerchunk’s format and reading via fsspec#

The kerchunk library has its own specification for how byte range references should be serialized (either as a JSON or parquet file).

To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the ds.virtualize.to_kerchunk accessor method.

combined_vds.virtualize.to_kerchunk('combined.json', format='json')

These references can now be interpreted like they were a Zarr store by fsspec, using kerchunk’s built-in xarray backend (kerchunk must be installed to use engine='kerchunk').

combined_ds = xr.open_dataset('combined.json', engine="kerchunk")

In-memory (“loadable”) variables backed by numpy arrays can also be written out to kerchunk reference files, with the values serialized as bytes. This is equivalent to kerchunk’s concept of “inlining”, but done on a per-array basis using the loadable_variables kwarg rather than a per-chunk basis using kerchunk’s inline_threshold kwarg.

Note

Currently you can only serialize in-memory variables to kerchunk references if they do not have any encoding.

When you have many chunks, the reference file can get large enough to be unwieldy as json. In that case the references can be instead stored as parquet. Again this uses kerchunk internally.

combined_vds.virtualize.to_kerchunk('combined.parq', format='parquet')

And again we can read these references using the “kerchunk” backend as if it were a regular Zarr store

combined_ds = xr.open_dataset('combined.parq', engine="kerchunk")

By default references are placed in separate parquet file when the total number of references exceeds record_size. If there are fewer than categorical_threshold unique urls referenced by a particular variable, url will be stored as a categorical variable.

Writing as Zarr#

Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the proposed “Chunk Manifest” ZEP. To do this we simply use the ds.virtualize.to_zarr accessor method.

combined_vds.virtualize.to_zarr('combined.zarr')

The result is a zarr v3 store on disk which contains the chunk manifest information written out as manifest.json files, so the store looks like this:

combined/zarr.json  <- group metadata
combined/air/zarr.json  <- array metadata
combined/air/manifest.json <- array manifest
...

The advantage of this format is that any zarr v3 reader that understands the chunk manifest ZEP could read from this store, no matter what language it is written in (e.g. via zarr-python, zarr-js, or rust). This reading would also not require fsspec.

Note

Currently there are not yet any zarr v3 readers which understand the chunk manifest ZEP, so until then this feature cannot be used for data processing.

This store can however be read by open_virtual_dataset(), by passing filetype="zarr_v3".