Usage#
This page explains how to use VirtualiZarr today, by introducing the key concepts one-by-one.
Opening files as virtual datasets#
VirtualiZarr is for manipulating “virtual” references to pre-existing data stored on disk in a variety of formats, by representing it in terms of the Zarr data model of chunked N-dimensional arrays.
If we have a pre-existing netCDF file on disk:
import xarray as xr
# create an example pre-existing netCDF4 file
ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')
We can open a virtual representation of this file using open_virtual_dataset
.
from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset('air.nc')
(Notice we did not have to explicitly indicate the file format, as open_virtual_dataset
will attempt to automatically infer it.)
Printing this “virtual dataset” shows that although it is an instance of xarray.Dataset
, unlike a typical xarray dataset, in addition to a few in-memory numpy arrays, it also wraps ManifestArray
objects.
vds
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB ManifestArray<shape=(2920, 25, 53)...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Generally a “virtual dataset” is any xarray.Dataset
which wraps one or more ManifestArray
objects.
These particular ManifestArray
objects are each a virtual reference to some data in the air.nc
netCDF file, with the references stored in the form of “Chunk Manifests”.
As the manifest contains only addresses at which to find large binary chunks, the virtual dataset takes up far less space in memory than the original dataset does:
ds.nbytes
30975672
vds.virtualize.nbytes
23704
Important
Virtual datasets are not normal xarray datasets!
Although the top-level type is still xarray.Dataset
, they are intended only as an abstract representation of a set of data files, not as something you can do analysis with. If you try to load, view, or plot any data you will get a NotImplementedError
. Virtual datasets only support a very limited subset of normal xarray operations, particularly functions and methods for concatenating, merging and extracting variables, as well as operations for renaming dimensions and variables.
The only use case for a virtual dataset is combining references to files before writing out those references to disk.
Opening remote files#
To open remote files as virtual datasets pass the reader_options
options, e.g.
aws_credentials = {"key": ..., "secret": ...}
vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
Chunk Manifests#
In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.
A “Chunk Manifest” is a list of chunk keys and their corresponding byte ranges in specific files, grouped together such that all the chunks form part of one Zarr-like array. For example, a chunk manifest for a 3-dimensional array made up of 4 chunks might look like this:
{
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}
Notice that the "path"
attribute points to a netCDF file "foo.nc"
stored in a remote S3 bucket. There is no need for the files the chunk manifest refers to to be local.
Our virtual dataset we opened above contains multiple chunk manifests stored in-memory, which we can see by pulling one out as a python dictionary.
marr = vds['air'].data
manifest = marr.manifest
manifest.dict()
{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
In this case we can see that the "air"
variable contains only one chunk, the bytes for which live in the file:///work/data/air.nc
file, at the location given by the 'offset'
and 'length'
attributes.
The ChunkManifest
class is virtualizarr’s internal in-memory representation of this manifest.
ManifestArray
class#
A Zarr array is defined not just by the location of its constituent chunk data, but by its array-level attributes such as shape
and dtype
. The ManifestArray
class stores both the array-level attributes and the corresponding chunk manifest.
marr
ManifestArray<shape=(2920, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
marr.manifest
ChunkManifest<shape=(1, 1, 1)>
marr.metadata
ArrayV3Metadata(shape=(2920, 25, 53),
data_type=<DataType.float64: 'float64'>,
chunk_grid=RegularChunkGrid(chunk_shape=(2920, 25, 53)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
fill_value=np.float64(-327.67),
codecs=(FixedScaleOffset(codec_name='numcodecs.fixedscaleoffset', codec_config={'scale': 100.0, 'offset': 0, 'dtype': '<f8', 'astype': '<i2'}),
BytesCodec(endian=<Endian.little: 'little'>)),
attributes={'GRIB_id': 11,
'GRIB_name': 'TMP',
'actual_range': [185.16000366210938,
322.1000061035156],
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'long_name': '4xDaily Air temperature at sigma '
'level 995',
'parent_stat': 'Other',
'precision': 2,
'statistic': 'Individual Obs',
'units': 'degK',
'var_desc': 'Air temperature'},
dimension_names=None,
zarr_format=3,
node_type='array',
storage_transformers=())
A ManifestArray
can therefore be thought of as a virtualized representation of a single Zarr array.
As it defines various array-like methods, a ManifestArray
can often be treated like a “duck array”. In particular, concatenation of multiple ManifestArray
objects can be done via merging their chunk manifests into one (and re-labelling the chunk keys).
import numpy as np
concatenated = np.concatenate([marr, marr], axis=0)
concatenated
ManifestArray<shape=(5840, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
concatenated.manifest.dict()
{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000},
'1.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks.
Note
As a single Zarr array has only one array-level set of compression codecs by definition, concatenation of arrays from files saved to disk with differing codecs cannot be achieved through concatenation of ManifestArray
objects. Implementing this feature will require a more abstract and general notion of concatenation, see GH issue #5.
Remember that you cannot load values from a ManifestArray
directly.
vds['air'].values
NotImplementedError: ManifestArrays can't be converted into numpy arrays or pandas Index objects
The whole point is to manipulate references to the data without actually loading any data.
Note
You also cannot currently index into a ManifestArray
, as arbitrary indexing would require loading data values to create the new array. We could imagine supporting indexing without loading data when slicing only along chunk boundaries, but this has not yet been implemented (see GH issue #51).
Virtual Datasets as Zarr Groups#
The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed ManifestArray
cannot store all of this information, an xarray.Dataset
wrapping multiple ManifestArray
s maps neatly to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of as in-memory arrays.
The problem of combining many archival format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using open_virtual_dataset
and using xarray’s various combining functions to combine them into one aggregate virtual dataset.
But before we combine our data, we might want to consider loading some variables into memory.
Loading variables#
Whilst the values of virtual variables (i.e. those backed by ManifestArray
objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy arrays, just like xr.open_dataset
normally returns.
Which variables to open this way can be specified using the loadable_variables
argument:
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'])
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
You can see that the dataset contains a mixture of virtual variables backed by ManifestArray
objects (lat
and lon
), and loadable variables backed by (lazy) numpy arrays (air
and time
).
Loading variables can be useful in a few scenarios:
You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
You want in-memory indexes to use with
xr.combine_by_coords
,Storing a variable on-disk as a set of references would be inefficient, e.g. because it’s a very small array (saving the values like this is similar to kerchunk’s concept of “inlining” data),
The variable has encoding, and the simplest way to decode it correctly is to let xarray’s standard decoding machinery load it into memory and apply the decoding,
Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data). Loading them allows you to rechunk them however you like.
The default value of loadable_variables
is None
, which effectively specifies all the “dimension coordinates” in the file, i.e. all one-dimensional coordinate variables whose name is the same as the name of their dimensions. Xarray indexes will also be automatically created for these variables. Together these defaults mean that your virtual dataset will be opened with the same indexes as it would have been if it had been opened with just xarray.open_dataset()
.
Note
In general, it is recommended to load all of your low-dimensional variables.
Whilst this does mean the original data will be duplicated in your new virtual zarr store, by loading your coordinates into memory they can be inlined in the reference file for fast reads from the virtual store.
However, you should not do this for higher-dimensional variables, as then you might use a lot of storage duplicating them, defeating the point of the virtual zarr approach. Also, anything duplicated could become out of sync with the referenced origial files, especially if not using a transactional storage engine like Icechunk
.
CF-encoded time variables#
To decode time variables according to the CF conventions, you must ensure time
is one of the loadable_variables
and the decode_times
argument of open_virtual_dataset
is set to True
(decode_times
defaults to None).
vds = open_virtual_dataset(
'air.nc',
loadable_variables=['air', 'time'],
decode_times=True,
)
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Combining virtual datasets#
In general we should be able to combine all the datasets from our archival files into one using some combination of calls to xarray.concat
and xarray.merge
. For combining along multiple dimensions in one call we also have xarray.combine_nested
and xarray.combine_by_coords
. If you’re not familiar with any of these functions we recommend you skim through xarray’s docs on combining.
Let’s create two new netCDF files, which we would need to open and concatenate in a specific order to represent our entire dataset.
ds1 = ds.isel(time=slice(None, 1460))
ds2 = ds.isel(time=slice(1460, None))
ds1.to_netcdf('air1.nc')
ds2.to_netcdf('air2.nc')
Note that we have created these in such a way that each dataset has one equally-sized chunk.
TODO: Note about variable-length chunking?
Manual concatenation ordering#
The simplest case of concatenation is when you have a set of files and you know which order they should be concatenated in, without looking inside the files. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.
vds1 = open_virtual_dataset('air1.nc')
vds2 = open_virtual_dataset('air2.nc')
As we know the correct order a priori, we can just combine along one dimension using xarray.concat
.
combined_vds = xr.concat([vds1, vds2], dim='time')
combined_vds
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB ManifestArray<shape=(2920, 25, 53)...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
We can see that the resulting combined manifest has two chunks, as expected.
combined_vds['air'].data.manifest.dict()
{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
Note
If you have any virtual coordinate variables, you will likely need to specify the keyword arguments coords='minimal'
and compat='override'
to xarray.concat()
, because the default behaviour of xarray will attempt to load coordinates in order to check their compatibility with one another. In future this default will be changed, such that passing these two arguments explicitly will become unnecessary.
The general multi-dimensional version of this concatenation-by-order-supplied can be achieved using xarray.combine_nested()
.
combined_vds = xr.combine_nested([vds1, vds2], concat_dim=['time'])
In N-dimensions the datasets would need to be passed as an N-deep nested list-of-lists, see the xarray docs.
Note
For manual concatenation we can actually avoid creating any xarray indexes, as we won’t need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).
Ordering by coordinate values#
If you’re happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!
vds1 = open_virtual_dataset('air1.nc')
vds2 = open_virtual_dataset('air2.nc')
combined_vds = xr.combine_by_coords([vds2, vds1])
Notice we don’t have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, the manifest still has the chunks listed in the correct order such that the 1-dimensional time
coordinate has ascending values:
combined_vds['air'].data.manifest.dict()
{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
Ordering using metadata#
TODO: Use preprocess to create a new index from the metadata. Requires open_virtual_mfdataset
to be implemented in PR #349.
Writing virtual stores to disk#
Once we’ve combined references to all the chunks of all our archival files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
Writing to Kerchunk’s format and reading data via fsspec#
The kerchunk library has its own specification for how byte range references should be serialized (either as a JSON or parquet file).
To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the vds.virtualize.to_kerchunk
accessor method.
combined_vds.virtualize.to_kerchunk('combined.json', format='json')
These references can now be interpreted like they were a Zarr store by fsspec, using kerchunk’s built-in xarray backend (kerchunk must be installed to use engine='kerchunk'
).
combined_ds = xr.open_dataset('combined.json', engine="kerchunk")
In-memory (“loadable”) variables backed by numpy arrays can also be written out to kerchunk reference files, with the values serialized as bytes. This is equivalent to kerchunk’s concept of “inlining”, but done on a per-array basis using the loadable_variables
kwarg rather than a per-chunk basis using kerchunk’s inline_threshold
kwarg.
Note
Currently you can only serialize in-memory variables to kerchunk references if they do not have any encoding.
When you have many chunks, the reference file can get large enough to be unwieldy as json. In that case the references can be instead stored as parquet. Again this uses kerchunk internally.
combined_vds.virtualize.to_kerchunk('combined.parquet', format='parquet')
And again we can read these references using the “kerchunk” backend as if it were a regular Zarr store
combined_ds = xr.open_dataset('combined.parquet', engine="kerchunk")
By default references are placed in separate parquet file when the total number of references exceeds record_size
. If there are fewer than categorical_threshold
unique urls referenced by a particular variable, url will be stored as a categorical variable.
Writing to an Icechunk Store#
We can also write these references out as an IcechunkStore. Icechunk
is an open-source, cloud-native transactional tensor storage engine that is compatible with Zarr version 3. To export our virtual dataset to an Icechunk
Store, we simply use the vds.virtualize.to_icechunk
accessor method.
# create an icechunk repository, session and write the virtual dataset to the session
import icechunk
storage = icechunk.local_filesystem_storage("./local/icechunk/store")
# By default, local virtual references and public remote virtual references can be read wihtout extra configuration.
repo = icechunk.Repository.create(storage)
session = repo.writable_session("main")
# write the virtual dataset to the session with the IcechunkStore
vds1.virtualize.to_icechunk(session.store)
session.commit("Wrote first dataset")
Append to an existing Icechunk Store#
You can append a virtual dataset to an existing Icechunk store using the append_dim
argument. This is especially useful for datasets that grow over time. Note that Zarr does not currently support concatenating datasets with different codecs or chunk shapes.
session = repo.writeable_session("main")
# write the virtual dataset to the session with the IcechunkStore
vds2.virtualize.to_icechunk(session.store, append_dim="time")
session.commit("Appended second dataset")
See the Icechunk documentation for more details. icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr
Opening Kerchunk references as virtual datasets#
You can open existing Kerchunk json
or parquet
references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like Icechunk.
vds = open_virtual_dataset('combined.json', filetype='kerchunk')
# or
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk')
One difference between the kerchunk references format and virtualizarr’s internal manifest representation (as well as icechunk’s format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the fsspec
filesystem which the filepath was defined relative to.
You can dis-ambuiguate kerchunk references containing relative paths by passing the fs_root
kwarg to virtual_backend_kwargs
.
# file `relative_refs.json` contains a path like './file.nc'
vds = open_virtual_dataset(
'relative_refs.json',
filetype='kerchunk',
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
)
# the path in the virtual dataset will now be 'file:///some_directory/file.nc'
Note that as the virtualizarr vds.virtualize.to_kerchunk
method only writes absolute paths, the only scenario in which you might come across references containing relative paths is if you are opening references that were previously created using the kerchunk
library alone.
Rewriting existing manifests#
Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.
Rewriting file paths#
You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the vds.virtualize.rename_paths
accessor method.
For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.
def local_to_s3_url(old_local_path: str) -> str:
from pathlib import Path
new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
filename = Path(old_local_path).name
return str(new_s3_bucket_url / filename)
renamed_vds = vds.virtualize.rename_paths(local_to_s3_url)
renamed_vds['air'].data.manifest.dict()
{'0.0.0': {'path': 'http://s3.amazonaws.com/my_bucket/air.nc', 'offset': 15419, 'length': 7738000}}