Giter VIP home page Giter VIP logo

zarr-specs's Introduction

Zarr Specification

Zarr core protocol for storage and retrieval of N-dimensional typed arrays

drawing

For the v1 and v2 specs, please see https://github.com/zarr-developers/zarr-python/tree/main/docs/spec.

The rendered docs of the main branch are available at https://zarr-specs.readthedocs.io

Usage

The following steps install the necessary packages to render the specs with automatic updating and reloading of changes:

## optionally setup an venv
# python3 -m venv .venv
# . .venv/bin/activate
pip install -r docs/requirements.txt
pip install sphinx-autobuild
sphinx-autobuild -a docs docs/_build/html

zarr-specs's People

Contributors

alimanfoo avatar brandon-neth avatar carreau avatar clbarnes avatar d-v-b avatar davidbrochart avatar dimitripapadopoulos avatar grlee77 avatar jakirkham avatar jbms avatar joshmoore avatar jrbourbeau avatar jstriebel avatar mkitti avatar msankeys963 avatar normanrz avatar rabernat avatar rouault avatar zoj613 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zarr-specs's Issues

Publish Zarr spec with OGC

http://www.opengeospatial.org/standards:

OGC(R) standards are technical documents that detail interfaces or encodings. Software developers use these documents to build open interfaces and encodings into their products and services. These standards are the main "products" of the Open Geospatial Consortium and have been developed by the membership to address specific interoperability challenges. Ideally, when OGC standards are implemented in products or online services by two different software engineers working independently, the resulting components plug and play, that is, they work together without further debugging.

Considering the fact that zarr already has a mature spec, I believe it would be advantageous for us to register that spec with OGC. The effort required would be minimal, since the spec is well written. But it would give a certain level of credibility with certain communities.

cc @percivall, who first made me aware of OGC standards in pangeo-data/pangeo#450

Chunk Spec

Following up on today's call and #3, define a specification for how chunks are represented in memory before going through (compression) filters and storage.

Minimum requirement: a chunk can store nd-tensors of primitive datatypes.
There was also a consensus to support big and little endian data (and C/F layout where appropriate).

On top, we discussed these questions:

  1. Does a chunk have a header? Is the header required or optional?
  2. Is the chunk size (shape in python lingo) fixed or variable? Variable chunk size would require header or information about chunks in the metadata.
  3. Is the number of values stored in the chunk always determined by the shape (i.e. product of the shape)? If not, would implement something akin to n5 varlength mode and require a header.
  4. How do we support non-primitive datatypes, e.g. strings or VLen / ragged arrays? Could be implemented via 3. or something akin to the current zarr spec.

Regarding 2.:
Use case 1: storing edge chunks that are not fully covered.
@axtimwalde pointed out that this allows direct mapping to memory without copying data in n5-imglib implementation.
Use case 2: appending / prepending to datasets. This could be used to implement prepending to datasets without modifying existing chunks. Note that one of @alimanfoo's motivations to NOT implement variable chunk size was to always have valid chunks when appending to a dataset.

Regarding 3:
The n5 use cases we discussed were simple examples like storing unique values in the spatial block corresponding to a chunk and more complicated examples like the n5-label-multiset.
Also, this could be useful to define non-primitive datatypes, e.g. strings encoded via offsets and values. See also 4.

Regarding 4:
During the discussion, several additional datatypes that could be supported were discussed:

More general, there is the question how we could provide a mechanism for extensions to the spec that define new datatypes.
In the current zarr implementation, numpy arrays of objects can be stored via a special filter, see #6.
In the current n5 implementation, non-primitive datatypes can be encoded into a varlength chunk and need to be decoded with a separate library (i.e. not part of n5-core) again.

Non-JSON metadata and attributes

As briefly discussed in the group chat, I would like to propose a change to how metadata and attributes are accessed. The current spec is specific that this data must be readable and writable as JSON. This is compatible with all current storage backends of Zarr and the filesystem and cloud storage backends of N5. It is not compatible with the current HDF5 backend of N5 where attributes and metadata are represented as HDF5 attributes. Instead of requiring JSON, I suggest that metadata and attribute access should be specified similar to the group and array access protocol of the spec, i.e. as access primitives, i.e. API. The most basic primitives would be:

getAttribute - Retrieve the value associated with a given key and attributeKey.

| Parameters: `key`, `attributeKey`, [`type`]
| Output: `value`

setAttribute - Store a (key, attributeKey, value) triple.

| Parameters: `key`, `attributeKey`, `value`
| Output: none

Probably also something to list attributes and may be infer their types if necessary.
The N5 API does it this way and I find it very straight forward to use this across JSON and non-JSON backends

https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Reader.java#L214

https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Reader.java#L271

https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Writer.java#L43

https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Writer.java#L59

and the default JSON implementation which is only bloated to support version 0 with non auto-inferred compressors

https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/AbstractGsonReader.java

The Amazon S3 limit on the length of keys

I noticed that Amazon S3 (and apparently also Google) define a
limit of 1024 bytes for object keys. This limit apparently
applies to the whole key and not, say, segments of the key,
where segment is the name between '/' occurrences.

I know that for atmospheric sciences netcdf-4 datasets, variable
names are used to encode a variety of properties such as dates
and locations. This often results in long variable
names. Additionally, deeply nested groups are used to also
classify sets of variables. Bottom line: it is probable that
such datasets will run up against the 1024 byte limit in the
near future.

So my question to the community is: how do we deal with the 1024
byte limit? Or do we ignore it?

One might hope that Amazon will up that limit Real-Soon-Now. My
guess is that a limit of 4096 bytes would be adequate to push
the problem off to a more distant future.

If such a length increase does not happen, then we may need to
rethink the Zarr layout so that this limit is circumvented.
Below are some initial thoughts about this. I hope I am not
overthinking this and that there is some simpler approach that I
have not considered.

One possible proposal is to use a structure where
the long key is replaced with the hash of the long key.
This leads to an inode-like system with flat space of hash keys
and the objects for those hashkeys contain metadata and chunk-data.
In order to represent the group structure, one
would need to extend this to have some "inodes" be directory-like
objects that map a key segment to the hashkey of the inodes
"contained" in the directory.

I am sure there are other ways to do this. Is may also be worth
asking about the purpose of the groups. Right now they serve
as a namespace and as a primitive indexing mechanism for the leaf
content-bearing objects. Perhaps they are superfluous.

In any case, the 1024 byte key-length limit is likely
to be a problem for Zarr in the near future.
The community needs to decide if it wants to ignore this
limitation or address it in some general way.

=Dennis Heimbigner
Unidata

Proposal: Object versioning...

I've written a blog post about this How to (and not to) handle metadata in high momentum datasets so for a more thorough dive please read that but in short:

I'm really interested in moving datasets backed by an object store (S3 in my case). S3 is eventually consistent and so there is an issue whenever you make changes to more than one object at (approximately) the same time since on read you don't know what combination of versions you'll get.

This could be an issue if I update .zarray to grow my array and also update .zattrs with some metadata to reflect this change. On read I could get the new metadata and old shape or via versa. Both which would be bad.

This becomes more pronounced when working complex datasets with coordinates etc, when saved as Zarr by Xarray these end up in different zarrays in the same group. But there is no tie to what version of any object you get and an update then read could result in all kinds of corruption.

Some of this needs to be resolved higher up the tooling (xarray, etc) but I think Zarr development needs to be aware of the challenge and support it.

Core protocol v3.0 status

Hi All!

I spent some time looking through the work surrounding the v3.0 core protocol over in #16. My goal for this issue is to summarize the current status of this work and help spur conversation in the community. Any feedback can then be used to guide and prioritize future work on the core protocol and protocol extensions.

cc @alimanfoo @jakirkham @joshmoore @ryan-williams

Specification development process document (current status)

  • Defines concept of a core protocol, protocol extensions, stores, and codecs
  • Will define the process for minor/major changes to the core protocol and how decisions are made
  • Could use feedback from the community

Core protocol (current status)

  • Core concepts and terminology

    • E.g. arrays, groups, chunks, etc.
    • These all seem to be well defined and in good shape overall
  • Node names

    • Restriction to node name characters and some possible names
    • Case insensitive uniqueness of siblings
    • Question: Are the restrictions on node names too restrictive?
  • Data types

    • Core data types are boolean, integer, and floating point
    • Complex and datetime dtypes can be implemented as protocol extensions
    • Question: What about languages that don't easily support the full list of core data types? (xref zarr-developers/community#25)
  • Chunking

    • Core protocol consists of regular grid. Other grid types, e.g. non-uniform chunking or unknown chunk sizes, can be defined via protocol extensions
    • Core protocol uses C- and F-order for the memory layout of each chunk. Other layouts, e.g. sparse memory layouts, are possible via protocol extensions
    • Chunk encoding consists of a compressor codec. Note this does not include filters, which can be supported via protocol extensions
  • Metadata

    • Three types of metadata documents: bootstrap metadata, array metadata, and group metadata
    • The bootstrap metadata doc must be encoded in JSON, while the array and group metadata docs can use other encodings
    • Bootstrap metadata document contains the protocol specification used (e.g. v3.0, v3.1, etc.), how the array and group metadata documents are encoded (default is JSON), and a list of protocol extensions used
    • Array metadata document contains the array shape, data type, user-defined attributes, etc.
      • Protocol extension points include: data type, chunk grid type, and chunk memory layout
      • extensions metadata value need to be defined in protocol spec
      • Question: There seems to be a question about how to specify the fill_value for dtypes other than bool and int
    • Group metadata document contains protocol extensions and user-defined attributes
  • Stores

    • Defines abstract store interface which can be implemented on top of different storage technology backends
    • Abstract interface methods for operating on keys and values in a store include get, set, delete, etc.
    • Not all abstract methods need to be implemented (e.g. can have a read-only store)
    • Core protocol does not define any store implementations, but gives examples of possible implementations
    • Some protocol operation need to be filled out
  • Protocol extensions

    • This section needs to be completed

Protocol extensions (current status)

Three protocol extensions are currently in progress:

  • Datetime data types - looks relatively filled out
  • Complex data types - currently a scaffolding
  • Filters - currently a scaffolding

Several other possible extensions are outlined in #49

Stores (current status)

  • Currently one store spec in progress, the file system store

Terminology comparison with TensorStore

The TensorStore docs include a documentation page defining some key terminology about ND arrays, including "index space", "index domain" and "index transformation". Maybe worth considering whether to adopt any of this terminology within the v3 core protocol spec.

Sparse chunk memory layout

This is a placeholder for a potential protocol extension to define sparse memory layouts for chunks.

The idea is to enable use of a sparse memory layout (e.g., CSR, CSC or COO) within each chunk of a Zarr array. I.e., a Zarr array has a regular chunk grid as normal, but instead of using a dense C contiguous or F contiguous layout for the data within each chunk, use a sparse memory layout.

E.g., in the case of COO the memory layout would comprise two memory blocks, one storing the coordinates, the other storing the data values. For the purposes of encoding and storage, these two memory blocks could be concatenated into a single memory block, which could then be passed down through filter and compressor codecs and stored as normal. When retrieving and decoding the chunk, the coordinates and the data values could be presented as views of different regions of the memory block, to avoid extra memory copies.

In terms of the Zarr v3 core protocol, this could be specified as a protocol extension, defining new memory layouts that could be used within the chunk_memory_layout array metadata property.

An implementation in Python could be relatively straightforward, by using an existing sparse array library like SciPy (for 2D chunks) or sparse (for ND chunks) to manage the chunks, instead of numpy.

This could also integrate nicely with blocked parallel computing frameworks like Dask, because each chunk would be presented as a sparse array, and so any computational steps within the task graph that could operate directly on the sparse representation could do so, rather than forcing data into a dense representation.

Note that this is different from discussions about defining conventions for storing sparse arrays in Zarr, where a collection of two or more Zarr arrays are used to store a single sparse array. (E.g., for a COO array, the coords would be stored in one Zarr array, and the data in a second Zarr array). That may be equally worthwhile to pursue, but is a different concept and probably serves slightly different use cases

Would like to be involved in spec change discussions.

Hello @alimanfoo pointed me in this direction following my post on a desired spec at zarr-developers/zarr-specs/#9 .

I'd be really keen to get involved with these discussion/calls.

At the Met Office we've lots of big fast-moving data that most (all) formats currently struggle with.

Misc. Minor Comments from Dennis Heimbigner

Node name character set

Currently the core protocol v3.0 draft includes a section on node names which defines the set of characters that may be used to construct the name of a node (array or group).

The set of allowed characters is currently extremely narrow, being only the union of a-z, A-Z, 0-9, -_.. There is no support for non-latin characters, which obviously creates a barrier for many people. I'd like to relax this and allow any Unicode characters. Could we do this, and if we did, what problems would we need to anticipate and address in the core protocol spec?

Some points to bear in mind for this discussion:

Currently in the core protocol, node names (e.g., "bar") are used to form node paths (e.g., "/foo/bar"), which are then used to form storage keys for metadata documents (e.g., "meta/root/foo/bar.array") and data chunks (e.g., "data/foo/bar/0.0"). These storage keys may then be handled by a variety of different store types, including file system stores where storage keys are translated into file paths, cloud object stores where storage keys are translated into object keys, etc.

Different store types will have different abilities to support the full Unicode character set for node names. For example, although most file systems support Unicode file names, there are still reserved characters and words which cannot be used, which differ between operating systems and file system types. However, these constraints might not apply at all to other store types, such as cloud object stores. In general, other store types may have different constraints. Do we need to anticipate any of this, or can we delegate these issues to be dealt with in the different store specs?

One last thought, whatever we decide, the set of allowed characters should probably be defined with respect to some standard character set, e.g., Unicode. I.e., we should probably reference the appropriate standard when discussing which characters are allowed.

Proposal: group to list it's children

If I read the 2.2 spec correctly then when opening a group there is no way of knowing the children of that group without doing a list. This seems sub-optimal to me. I'm usually working on S3 and very large Zarr files (1000s/100,000s objects) and a list operation in this setting is not very efficent.

I feel that it .zgroup listed its children then this would relieve this problem. Something like.

.zgroup -> contains: ["foo", "foo2"]
foo/.zgroup -> contains:["bar"]
foo/barr/.zarray
foo2/.zarray

Support suffix for metadata files

In the current v3 core protocol draft, keys for metadata documents have no suffix to represent the document format, they are just keys like "meta/root/foo.group" and "meta/root/foo/bar.array". In some situations (e.g., where file systems or web servers are being used as storage) it would be useful to have a suffix that indicates the format, i.e., ".json" for the default. Consider adding support for this.

Clarify status and semantics of object ('O') data type in storage spec

The current storage spec in the section on β€œdata type encoding” describes how the data type of an array should be encoded in the array metadata. Here is the content from the beginning of the section:

Simple data types are encoded within the array metadata as a string,
following the `NumPy array protocol type string (typestr) format
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The format
consists of 3 parts:

* One character describing the byteorder of the data (``"<"``: little-endian;
  ``">"``: big-endian; ``"|"``: not-relevant)
* One character code giving the basic type of the array (``"b"``: Boolean (integer
  type where all values are only True or False); ``"i"``: integer; ``"u"``: unsigned
  integer; ``"f"``: floating point; ``"c"``: complex floating point; ``"m"``: timedelta;
  ``"M"``: datetime; ``"S"``: string (fixed-length sequence of char); ``"U"``: unicode
  (fixed-length sequence of Py_UNICODE); ``"V"``: other (void * – each item is a
  fixed-size chunk of memory))
* An integer specifying the number of bytes the type uses.

The byte order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
``"|S12"`` are valid data type encodings.

The spec then goes on to describe how datetime and timedelta data types are encoded:

For datetime64 ("M") and timedelta64 ("m") data types, these MUST also include the
units within square brackets. A list of valid units and their definitions are given in
the `NumPy documentation on Datetimes and Timedeltas
<https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units>`_.
For example, ``"<M8[ns]"`` specifies a datetime64 data type with nanosecond time units.

...and also how structured data data types are encoded:

Structured data types (i.e., with multiple named fields) are encoded as a list
of two-element lists, following `NumPy array protocol type descriptions (descr)
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html#>`_. For
example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a
data type composed of three single-byte unsigned integers labelled "r", "g" and
"b".

Implicit in all of this is that the spec is inheriting the numpy definition of data types, and deferring to the numpy documentation as much as possible.

In addition to fixed-memory data types, numpy also defines an β€œobject” data type (character code β€˜O’). In numpy, an array with object data type is an array of memory pointers, where each pointer dereferences to a Python object. Although the object data type is described in the numpy documentation, it is not mentioned at all in the zarr storage spec. It is therefore unclear whether it is or is not a valid data type for use in a zarr array, and if it is, what it’s semantics are.

The Python zarr implementation has in fact fully supported the object data type since version 2.2 (zarr-developers/zarr-python#212). The implementation follows numpy in the sense that, when data are retrieved from a zarr array with object data type, they are returned to the user as a numpy array with object data type, i.e., as an array of Python objects.

However, this does not mean that the encoded zarr data are necessarily Python-specific. When storing data into a zarr array with object data type, how the objects are encoded is delegated to the first codec in the filter chain. For example, if the first codec in the filter chain is the MsgPack codec, then data will be encoded using MessagePack encoding, which is a language-independent encoding and can in principle be decoded in a variety of programming languages. Similarly, if the array contains only string objects, then the VLenUTF8 codec can be used, which will encode data in a format similar to parquet encoding, and which could be decoded in any programming language. Further examples are given in the sections on string arrays and object arrays in the zarr tutorial.

In the longer term, the community may want to revisit the approach to encoding of arrays with variable-length data types, and to produce a new major revision of the storage spec. However, I suggest that we first aim to resolve this issue by adding some clarifying text to the version 2 storage spec, to make explicit the status and semantics of the object data type. As precedent, we have previously made a number of edits to the version 2 storage spec to make clarifications, see the changes section of the spec. For example, we added clarifications regarding the datatime and timedelta data types, and we added clarifications regarding the encoding of fill values, so I am hoping for a similar resolution here.

Negative chunk indexes, offset chunk origin

This is a feature request. If you think that it is worth doing, I volunteer to update the zarr specification and implementation.

The v2 specification states

For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key β€œ0.0”; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key β€œ2.4”; etc.

Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.

Together, these statements imply the following restrictions:

  • the [0, 0, ..., 0] origin "corner" of a N-dimensional array must coincide with the origin corner of the [0, 0, ..., 0]th chunk
  • overhanging chunks may only appear at the edges of the array that are far from the origin

These restrictions make it convenient to grow/shrink an N-dimensional array along the edges that are far from zero but inconvenient to grow/shrink along the zero-index edges. For example, consider this chunked array with shape [3, 3]:

[[ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8]]

If this array is split into chunks of [2, 2], the chunks are

  • chunk [0, 0] is [[0, 1], [3, 4]]
  • chunk [0, 1] is [[2, undefined], [5, undefined]]
  • chunk [1, 0] is [[6, 7], [undefined, undefined]]
  • chunk [1, 1] is [[8, undefined], [undefined, undefined]]

To concatenate an array of zeroes [[0, 0, 0]] on the non-zero edge of the 0th dimension, I only need to change the shape of the array to [4, 3] and update chunks [1, 0] and [1, 1]:

[[ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8],
 [ 0, 0, 0]]
  • chunk [0, 0] is unchanged
  • chunk [0, 1] is unchanged
  • chunk [1, 0] becomes [[6, 7], [0, 0]]
  • chunk [1, 1] becomes [[8, undefined], [0, undefined]]

However, to concatenate on the opposite edge, I need to shift the chunk origin and can not reuse any of the existing chunks:

[[ 0, 0, 0],
 [ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8]]
  • chunk [0, 0] becomes [[0, 0], [0, 1]]
  • chunk [0, 1] becomes [[0, undefined], [2, undefined]]
  • chunk [1, 0] becomes [[3, 4], [6, 7]]
  • chunk [1, 1] becomes [[5, undefined], [8, undefined]]

This rechunking is expensive for big arrays that are repeatedly grown in the "negative" direction.

I propose relaxing restrictions to make this append easier. Specifically, I propose the following:

  • Overhanging chunks may appear along any edge of the array
  • The array metadata has a new key, chunk_origin (feel free to invent a better name!), which specifies the location in the N-dimensional array of the origin "corner" of the [0, 0, ..., 0]th chunk. If unspecified, the default value of chunk_origin is [0, 0, ..., 0] (meaning that the chunk origin coincides with the origin of the array itself), to preserve backwards compatibility.
  • Chunk indices may be negative

In the example above, we can efficiently append along the zero edge by changing the shape to [4, 3] and changing the chunk origin from [0, 0] to [1, 0] and adding new chunks with negative indexes:

  • add chunk [-1, 0] with contents [[undefined, undefined], [0, 0]]
  • add chunk [-1, 1] with contents [[undefined, undefined], [0, undefined]]
  • chunk [0, 0] is unchanged
  • chunk [0, 1] is unchanged
  • chunk [1, 0] is unchanged
  • chunk [1, 1] is unchanged

What do you think?

Migrate storage spec

Assuming we want to at least migrate the storage spec versions 1 and 2, then:

  • Migrate the source files for the storage spec from the zarr (python) repo to this repo.
  • Set up RTFD for this repo and (re)publish the specs from here (I guess they'd end up at zarr-specs.readthedocs.io).
  • Edit the docs back at zarr.readthedocs.io to explain that specs have moved and provide links to new locations.

Partial chunk reads

The ability for zarr to support partial chunk reads has come up a couple of times (xref zarr-developers/zarr-python#40, zarr-developers/zarr-python#521). One benefit of supporting this would be improvements to slicing operations that are poorly aligned with chunk boundaries. As @alimanfoo pointed out, some compressors also support partial decompression which would allow for extracting out part of a compressed chunk (e.g. the blosc_getitem method in Blosc).

One potential starting point would be to add a new method, e.g. decode_part, to the Codec interface. Compressors which don't support partial decompression could have a fallback implementation where the entire chunk is decompressed and then sliced. We would also need a mechanism for mapping chunk indices to the appropriate parameters needed for decode_part to extract a part of a chunk.

With the current work on the v3.0 spec taking place, I wanted to open this issue to discuss if partial chunk reads are something we'd like to support as a community

Docs file structure

Here's a proposal for how to organise the documentation files within this repository. Comments very welcome.

  • docs - Top level folder for all documentation. Docs will be in RST format and built via sphinx.
    • conf.py - Sphinx configuration file.
    • index.rst - Main documentation page. Provides a brief introduction to Zarr and the organisation of the documentation. Includes a complete table of contents.
    • process.rst - Describes the processes for proposing new specs or changes to existing specs.
    • protocol - Folder containing specifications of the core protocol.
      • v1.rst - Version 1 of the core protocol. Migrated without change from here.
      • v2.rst - Version 2 of the core protocol. Migrated without change from here.
      • v3.0.rst - Version 3.0 of the core protocol, to be written.
    • transformations - Folder containing specifications of transformations of the core protocol.
      • consolidated-metadata - Folder containing specifications of the consolidated metadata transformation.
        • v1.rst - Version 1 of the consolidated metadata format, to be written.
      • chunk-key-separator.rst - Documentation of the chunk key protocol transformation, which involves rewriting chunk keys to use a different character as the separator character between chunk grid indices (e.g., '/' instead of '.').
    • extensions - Folder containing specifications of extensions to the core protocol.
      • zcdf - Folder containing specifications of the NetCDF-style extensions to the core protocol.
        • v1.rst - Version 1 of the ZCDF extension spec.
    • storage - Folder containing specifications of storage layers. Each storage layer spec describes how operations in the abstract storage interface (get, set, delete key/value pairs) are translated into concrete operations in a storage system such as a file system or cloud object store.
      • file-system.rst - Spec that maps the abstract storage interface onto file system operations.
      • zip-file.rst - Spec that maps the abstract storage interface onto operations on a zip file.
      • dbm.rst - Spec that maps the abstract storage interface onto operations on a dbm-style database (including gdbm, ndbm and Berkeley DB).
      • lmdb.rst - Spec that maps the abstract storage interface onto operations on an LMDB databases.
      • sqlite.rst
      • mongodb.rst
      • redis.rst
      • abs.rst
      • gcs.rst
      • s3.rst
      • ...
    • codecs - Folder containing codec specifications. Codecs include filters and compressors. A codec specification describes the chunk encoding/decoding process and the encoded format. These may just be references to documentation published elsewhere, and/or a reference implementation.
      • adler32.rst
      • astype.rst
      • blosc.rst
      • bz2.rst
      • categorize.rst
      • crc32.rst
      • delta.rst
      • fixedscaleoffset.rst
      • gzip.rst
      • json.rst
      • json2.rst
      • lz4.rst
      • lzma.rst
      • msgpack.rst
      • msgpack2.rst
      • packbits.rst
      • pickle.rst
      • quantize.rst
      • vlen-array.rst
      • vlen-bytes.rst
      • vlen-utf8.rst
      • zlib.rst
      • zstd.rst

To elaborate on a couple of things...

Here I'm using "Zarr core protocol" to mean the core spec that defines the array and group metadata formats, the abstract interfaces for storage layers and codecs, and the logical model for how arrays are divided into chunks, and how storage keys are constructed for storing metadata and chunk data. Previously this has been called the "Zarr storage specification" but I think that "protocol" is a better word as it's closer to what this spec is actually defining.

In this structure I'm envisaging that specs may be decoupled and versioned separately. I.e., the core protocol is decoupled from codec and storage layer specs. This is intended to allow for new storage layers or codecs to be defined without requiring any changes or versioning to the core protocol.

I've also tentatively included extensions here as a place that might hold extensions like ZCDF. Previously there has been discussion of having a separate repo for extensions/conventions, however I'm wondering if we should try and keep everything together.

xref #8 which identifies some components of the overall system architecture, and thus has some bearing on how specs are organised.

zgdal

Just landed here and find zarr very promising!

In addition to zhdf and znetcdf mentioned here, it would be very nice to have something similar for the broadly used GDAL data model (zgdal). I think that the implementation would be straightforward too.

Probably too soon and not the best place, but I think this is worth mentioning.

Versioned arrays

From @shoyer: What about indirection for chunks? The use case here is incrementally updated and/or versioned arrays (e.g., each day you update the last chunk with the latest data, without altering the rest of the data). In the typical way this is done, you hash each chunk and use hashes as keys in another store. The map from chunk indices to hashes is also something that you might want to store differently (e.g., in memory). I guess this would also be pretty easy to implement on top of the Storage abstraction.

Storing arrays with unknown chunk sizes

In some cases ( like dask/dask#3293 ), it is useful to be able to store an array where the chunk sizes are not known until the array is written out. Admittedly this may require things like chunk headers and some more sophistication when it comes to working with the data after storing it. However there are definite use cases like in this SO question.

Extension slicing: or what to do when things go wrong

cF: https://doc.zeroc.com/ice/3.6/client-server-features/slicing-values-and-exceptions

Given this class hierarchy defined for the Ice protocol:

module ex {

  // Some base data type
  class ZString {
       Bytes data;
  };

  // A particular interpretation of that type
  class ZUTF8 extends ZString {
  };

  // Something arbitrarily more complex ...
  class ZHTML extends ZUTF8 {
      Language lang;
  };

};

a ZHTML instance that was serialized would contain an ordered list of its extension specs as:

In[1]: print(ZHTML().ice_ids())
Out[1]: ["::ex::ZHTML", "::ex::ZUTF8", "::ex::ZString", "::Ice::Object"]

A client which wanted to deserialize this object but did not have access to the ZHTML spec would slice off that interpretation and try the next one. If the only spec that was available was ZString, then that instance would be of type ZString and the client would use the regular handling for that type.

Types that might benefit from this mechanism:

  • complex numerics (e.g. extends LongPair if the spec doesn't support the more flexible Long[2])
  • timestamps (e.g. extends String)
  • Perhaps #23 (e.g. Chunk[] of different sizes)

Scope?

There are at least three types of spec that could live here:

  1. The zarr storage spec. There have been two versions (1 and 2). Content currently lives in the zarr python repo and is published on RTFD within the zarr python docs.

  2. The zarr codec registry. Currently this is undocumented, and is effectively defined as the set of codecs implemented in the zarr-developers/numcodecs repository, which serves as reference implementations. However the codec registry could (should?) be documented independently of the numcodecs implementation, and have a community process for registering new codecs.

  3. Community extensions/conventions. E.g., the set of conventions supported by xarray to implement shared dimensions, or the set of conventions that ultimately becomes NCZarr.

Should they all live here, or should some live elsewhere?

Questions regarding the DirectoryStore design and expected functionality.

Hi there,

I'm trying to understand the reasoning, choices and trade-off made when the DirectoryStore on disk layout was created.

As far as I can tell it has been done to be:

  1. relatively simple
  2. cose to the default zarr protocol.
  3. internal meant to be inspected by humans, subgroup in hierarchy accessed independently without opening the root via the zarr lib.

Are these assumption of mine correct ? To what extent could they be changed – for the internal implementation of DirectoryStore for v3 spec assuming the end user API does not change ?

For multiple language implementation of the DirectoryStore v3, I'm supposing we also care about a few other things mainly:

  1. On disk layout should be relatively friendly to machines and many languages.
  2. robust to Key Cases (as the zarr protocol may allow unicode and enforce store to maybe be case sensitive?)
  3. Efficient when possible

There are few other questions that I have not seen mentioned in discussions/spec of the DirectoryStore, mainly whether soft/hard links are allowed, how permission is handled and wether writing over a chunk keep the inodes and seek write or should replace the files.

I believe some of the current constraint on casing and efficiency of current DirectoryStore can be overcome with minimal loss of readability for human exploring the internal of such a datastore.

for Example, we could change the encoding of keys as follow.

  • in the Zarr Protocol allow arbitrary unicode for keys, or at least relax casing.
  • For the DS, encode the key as follow.

A key in the DS would HUMAN_PART-MACHINE_PART

  • the HUMAN_PART would be ascii-restricted, non empty version of the key, mostly informative for the user exploring the filesystem.
  • MACHINE_PART would be base32 encoded version of the key stripped of trailing =. This would ensure ability to store complex unicode keys without having any issues with casing, or reserved names (like COM on windows, names starting with dashes, dots.. etc.
  • The "MACHINE_PART" can also ends with d, or g depending on whether a key is a group or a dataset which should limit the number of stats/read when listing a store with large number of groups/datasets.

I'm happy to come up with a more detail description, but don't want to engage into this if I don't understand properly the tradeoff that need to be achieved.

spec v3: progressive encoding

I'm wondering if progressive encoding could be supported in Zarr. It is a technique often used on the web, where a low resolution image can be downloaded and displayed first, and then refined as the download continues (see e.g. https://cloudinary.com/blog/progressive_jpegs_and_green_martians).
Zarr currently supports only full resolution contiguous chunks, so if you want to have a global view of the data, even if you are going to coarsen it afterwards, you have to first get all the data. Progressive encoding would allow to save a lot of bandwidth in this case, which is particularly useful for e.g. visualization.
But I'm not sure if it would be easy to fit into the current architecture, or if there is interest in it.

Node name case sensitivity

Currently the v3.0 core protocol draft section on node names states the following:

Note that node names are used to form storage keys, and that some storage systems will perform a case-insensitive comparison of storage keys during retrieval. Therefore, within a hierarchy, all nodes within a set of sibling nodes must have a name that is unique under case-insensitive comparison. E.g., the names β€œfoo” and β€œFOO” are not allowed for sibling nodes.

This constraint, however, is problematic in scenarios where nodes are being created in parallel. This is because requiring that sibling nodes have a unique name under case-insensitive comparison would require that a check is made before node creation to ensure no name collisions, and any such check would require synchronisation of node creation at least within the same group.

Currently a design goal for the spec is to avoid constraints which require operations to be synchronised, thus allowing multiple parallel processes to work simultaneously, including creating nodes within the same hierarchy. This avoids implementation complexity (no need for locks or other synchronisation mechanisms) and also works better on stores with eventual consistency behaviour.

I suggest we remove this constraint from the core protocol. I.e., implementations of the core protocol are not expected to check for case-insensitive node name collisions.

However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.

I.e., we push this to the application layer.

Define system architecture

Zarr and n5 naturally decouple different components of their architecture, in a way that allows clear and simple interfaces to be defined between them, and that allows for pluggability. For example, both define a storage layer interface, based on storage and retrieval of key/value pairs, which can then have pluggable implementations (file system, S3, mongodb, ...). Zarr (via numcodecs) also defines a codec API which enables new filters or compressors to be plugged in.

It could be very helpful to draw a picture of this system architecture, and then to discuss what aspects of the architecture should be covered in a "core" spec, versus where additional specs can be layered on top or plugged in.

For example, we might have a core spec that formally defines the system architecture, defines abstract APIs for storage and chunk encoding/decoding, and defines how essential metadata are formatted. Then we might have a separate spec for each storage layer implementation, that defines how keys and values are mapped into concrete storage entities like file paths and file contents. And we might have a separate spec for each codec, that defines the encoding process and format.

In other words, a modular spec architecture, that allows new specs for things like storage or encoding to be plugged in without affecting the core spec.

Dimension names as core array metadata

Several domains make use of named dimensions, i.e., for a given array with N dimensions, each of those N dimensions is given a human-readable name.

Given the broad utility of this, should we include this within the core array metadata in the v3 protocol? E.g., add a dimensions property within the array metadata document, whose value should be a list of strings:

    "shape": [10000, 1000],
    "dimensions": ["space", "time"],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

One question this raises is how to handle the case where no names are provided, or only some dimensions are named but not others. I.e., dimension names should probably be optional.

The alternative is that we leave this to the community to define a usage convention to store dimension names in the user attributes, e.g., similar to what xarray currently does using the "_ARRAY_DIMENSIONS" attribute name.

Protocol extensions for awkward arrays

This issue is a starting point for discussing possible protocol extensions to allow support for various types of "awkward" arrays that do not fit into the model of arrays with fixed size dimensions and/or with fixed size data types.

For example, this includes "ragged" or "jagged" arrays which include at least one dimension of variable size. The simplest example of this would be an array of variable sized arrays of some simple type like integers or floats.

It may be necessary to divide up this space and split out into separate issues if it would make sense to break the problem down. Suggestions for how to do this welcome.

Some related material:

Option to use `/` as as separator instead of `.` in spec v3.

From @alimanfoo on today's meeting.

Can we add the option to use / as as separator instead of . in spec v3 ?
This comme from the potential need on filesystem to limit the number of files in a directory, and using / would help. Even if it could be made store only this may need to be exposed through the protocol as other tools may be copying data from a filesystem to a cloud store.

My main concern would then be the difficulty to know wether a path either given or returned from a store is an implict group, array or chunk.

Say we have the following path:

/g1/g2/g3/a1/0.1.2.3.4; it would become /g1/g2/g3/a1/0/1/2/3/4 where I use here for clarity g for groups, a for array and numbers for chunks.

I think that a store or protocol implementation may have the query at least /meta/g1/g2/g3/a1/0/1/2/3/4/.array, to /meta/g1/g2/g3/a1/.array potentially in order to know how to interpret a key.

For example this may complicate things like list_dir() that now needs some understanding of the structure. Listing /g1/g2/g3/a1/0/ should typically fail (it is not really a dir/group), unless we walk up to find the .array.

Listing /g1/g2/g3/a1/ would return all the chunks in case where separator is ., only the number of chunks along the first dimension if separator is /, unless it is made aware of the separator.

This can likely be taken care of in the protocol, though that feel like extra complexity.

Zarr N5 spec diff

Overview of the diff between zarr and n5 specs with the potential goal of consolidating the two formats.
@alimanfoo, @jakirkham / @axtimwalde please correct me if I am misrepresenting zarr / N5 spec or if you think there is something to add here.
Note that the zarr and n5 spec have different naming conventions.
The data-containers are called arrays in zarr and datasets in n5.
Zarr refers to the nested storage of data-containers as hierarchies or groups (it is not quite clear to me
what the actual difference is, see below)
, n5 only refers to groups.
I will use the group / dataset notation.

Edit:
Some corrections from @alimanfoo, I left in the original statements but striked them out.

Groups

  1. attributes
  • zarr: groups MUST contain a json file .zgroup which MUST contain zarr_format and MUST NOT contain any other keys. They CAN contain additional attributes in .zattrs
  • n5: groups CAN contain a file attributes.json containing arbitrary json serializable attributes. The root group "/" MUST contain the key n5 with the n5 version.
  1. zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset. Zarr does not allow nested datasets (i.e. a dataset containing another dataset). This is not allowed in n5 either, I think. The spec does not explicitly forbid it though.

Datasets

  1. metadata
  • zarr: metadata is stored in .zarray.
  • n5: metadata is stored in attributes.json.
  1. layout:
  • zarr: supports C (row-major) and F (column major) indexing, which determines how chunks are indexed and how chunks are stored. This is determined via the key order. Chunks are always indexed as row-major.
  • n5: chunk indexing and storage is done according to column-major layout (F).
  1. dtype:
  • zarr: key dtype holds numpy type encoding. Importantly, supports big- and little- endian, which MUST be specified.
  • n5: key dataType, only numerical types and only big endian.
  1. compression:
  • zarr: supports all numcodecs compressors (and no compression), stored in key compressors.
  • n5: by default supports raw (= no compression), bzip2, gzip, lz4 and xz. There is a mechanism to support additional compressors. Stored in key compression.
  1. filters:
  • zarr: supports additional filters from numcodecs that can be applied to chunks before (de)-serialization. Stored in key filters.
  • n5: does not support this. However, the mechanism for additional compressors could be hijacked to achieve something similar.
  1. fill-value:
  • zarr: the fill-value determines how chunks that don't exist are initialised. Stored in key fill_value.
  • n5: fill-value is hard-coded to 0 (and hence not part of the spec).
  1. attributes:
  • zarr: additional attributes can be stored in .zattributes
  • n5: additional attributes can be stored in attributes.json. MUST NOT override keys reserved for metadata.
    In addition, zarr and n5 store the shape of the dataset and of the chunks in the metadata with the keys shape, chunks / dimensions, blockSize.

Chunk storage

  1. header:
  • zarr: chunks are stored without a header
  • n5: chunks are stored with header, that encodes the chunk's mode (see 3.) and the shape of the chunk.
  1. shape of edge chunks:
  • zarr: chunks are always stored with full chunk shape, also if they are over-hanging (e.g. chunk-shape (30, 30) and dataset shape (100, 100)).
  • n5: only valid part of chunks is stored. This is possible due to 1.
  1. varlength chunks
  • zarr: as far as I know not supported.
  • n5: supports var-length mode (specified in header). In this case, the size of the chunk is not determined by the chunk's shape, but is additionally defined in the header. This is useful for ND storage of "less structured" data. E.g. a histogram of the values in the ROI corresponding to the chunk.
  1. indexing / storage
  • zarr: chunks are indexed by . separated keys, e.g. 2.4. I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec. These keys get mapped to a representation appropriate for the implementation. E. g. on the filesystem, keys can be trivially mapped to files called 2.4 or nested as 2/4.
  • n5: chunks are stored nested, e.g. 2/4. (This is also implementation dependent. There are implementations where nested might not make sense. The difference is only . separated vs. / separated.)

Store type of store in top level metadata ?

I'm not 100% sure this is a spec question, or an implementation one, and it is mostly being driven by this morning questions on the community call.

Right now it looks to me that users need to specify which store they want to use, and that some guesses can be done depending on extensions (normalize_store_arg ?).

Currently this prevent to deeply change or experiment with store with similar structure without being aware of the kind of store one is working on.

Would it be interesting to have the (top level?) metadata to have a description of the kind of store that should be expected ?

Obviously for some of the stores it's hard, but for url-based or directory based stores, it should be pretty easy and give some flexibility WRT change of internal data structure, and/or bug fixes.

WIP: Multiscale use-case

Motivation

In imaging applications, especially interactive ones, the usability of a data array is greatly increased by having pre-computed sub-resolutions of the array. For example, an array of size (10**5, 10**5) might have halving-steps pre-computed, providing arrays of sizes 5000, 2500, 1250, 625, 312 etc. Users can quickly load a low-resolution representation to choose which regions are worth loading in higher- or even full- resolution. A few examples of this trend in imaging file formats are provided under Related Reading.

The current zarr spec has the following issues when trying to naively specify such sub-resolutions:

  • Arrays of differing size can only represent the individual resolution by naming convention
    ("Reslolution_0", "Resolution_1", etc.) This issue exists in a number of existing formats.
  • Storing data of differing dimensions in the same chunk is not intended.
  • Even if data of differing dimensions (compression)

Generalization

In other domains, a generalization of this functionality might enable "summary data" to be stored,
where along a given dimension a function has been applied, e.g. averaging. This is usually most
beneficial when the function is sufficiently time-costly that its worth trading storage for speed.

Potential implementations

Filter / Memory-layout

Each chunk could be passed to a function which stores or reads the multiscale representation
with a given chunk. (TBD)

Array relationships

Metadata on a given array could specify one or both inheritance relationships to other arrays.
For example, if a child array link to its parent, it might store the following metadata:

{
    "summary_of": {
        "key": "Resolution_0",
        "method": "halving",
        "dimensions": [0, 1]
    }
}

One issue with only having the parent relationship defined is how one determines the lowest
resolution. The child relationships could be represented with:

{
    "summarized_by": [
        {
            "key": "Resolution_1",
            "method": "having",
            "dimensions": [0, 1]
        }, ...

    ]
}

but this would require updating source arrays when creating a summary.

An alternative would be to provide a single source of metadata on the relationships between arrays.

Related reading

Possible synonyms / Related concepts

  • Global lossy compression
  • Progressive compression
  • Pyramidal images
  • Sub-resolutions
  • Summary views

v3 types, extensions and fallback type.

Reading on the extensions and in particular that extensions can define fallback types, I'm wondering if it's reasonable to reintroduce a "passthrough" or Raw type that indicate how many bits this given type uses.

  • I'm quasi certain not all the existing types in the base spec can be safely use as passthrough even for types of the same size. Float NaN having multiple binary values is one example, I am not sure about signed int (will any system uses one's complement that would raise issues?)

  • I'm guessing some people will use this with types that are not 1/2/4/8 bytes like packed structs, so I Would we don't want to support non power of 2, stil llvm has support for 128 bits float for example.

  • Such a type better convey what extensions means, and does not prevent implementations to use those internally.

Handling arrays with non-uniform chunking

Currently Zarr handles Arrays with uniform chunking. Meaning all chunks have the same size (excepting end chunks, which can be shorter). It would be nice if Zarr could also handle non-uniform chunking. Meaning that chunks would still live on a grid, but may vary in size based on their location (IOW not only end chunks would have this feature).

Motivating use cases include saving a Dask Array with non-uniform chunking. Admittedly these could be rechunked if the chunk size is known. Though this comes with some overhead compared to not rechunking. When the chunk size is unknown, this cannot be easily accomplished.

Extension for separate user attributes

In the current v3 protocol spec draft, an array metadata document contains both core metadata (like array data type, shape, chunk shape, etc.) and user attributes. E.g.:

{
    "shape": [10000, 1000],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

The same is true for groups, i.e., both core metadata and user attributes are stored together in a single group metata document, e.g.:

{
    "extensions": [],
    "attributes": {
        "spam": "ham",
        "eggs": 42,
    }
}

The zarr v3 approach of storing core metadata and user attributes together is similar to N5.

Note that this is different from the zarr v2 protocol, where the core metadata and user attributes are stored in separate documents.

Raising this issue to surface any comments or discussion on this approach.

Some possible reasons for bringing core metadata and user attributes together into a single document:

  • Creating an array with user attributes requires only a single storage request.
  • Reading all metadata for an array requires only a single storage request.

Some possible reasons for separating core metadata and user attributes into different documents:

  • If user attributes for an array are reasonably large and rarely used, then there is some overhead when reading core metadata. I.e., you can't read just the core metadata without reading the user attributes too.

spec v3: expose ability to see data/ and meta content ?

Currently in v2 spec, some store ability like listdir() are pretty straightforward as there is a single namespace.

in v3 though there is meta/ and data/. Some operation in spec v3 return ambiguous results:

# v2
>>> ds.listdir('Tair')

['.zarray',
 '.zattrs',
 '0.0.0',
 '0.0.1',
 '0.1.0',
 '0.1.1',
 '0.2.0',
 '0.2.1',
 '0.3.0',
 '0.3.1'
# v3
>>> ds.listdir('Tair')

['.array',
 '0',
 '1',
 '2',
 ...
# note that `/` is the default separator in v3.

As a human it's clear were those are, for an implementation you have to understand the meaning of each of those to know whether they are in meta/ or data/ which may requires to actually poke at and load the content of .array (or .group?).

It also seem like some extensions may add keys.

Do we want to add some functionalities to tell (without an extra round trip) whether keys are under /meta or /data ?

Making chained compression an extension

Sorry if I'm rehashing things. Asked some colleagues to take a look at the spec and provide feedback. This is basically just me transcribing their thoughts. Also if I missed things or made errors, would encourage them to hop in and correct me. πŸ™‚

One of the things that they found somewhat concerning about the existing Zarr implementation was support of chained compression. Namely that one could do things like apply GZIP and then LZ4. It would be preferable to just allow one form of compression by default. Allowing general purpose chaining like this complicates implementation and likely adds little value. Maybe this is already the common case?

That said, they did note that other storage specs (like Parquet or ORC) may have a data packing step prior to compression (like RLE or dictionary coding). This seems to still make sense to support, but it would be preferable to have only one optional compressor follow an optional data packing step.

Thoughts?

Proposal: open ended zarrs

Problem

We store gridded fields which are continuously being appended to and having old data expunged.

An example of this would be a weather forecast dataset which is being modified hourly as new simulation data is generated and old expired data is removed. Currently we are storing this data as many netCDF files in S3 which expire after 24 hours. However the downside to this is that we have to maintain a metadata index somewhere describing all the files (around half a million at any one time).

We would like to explore storing this data using zarr, but have concerns about how to logically partition the data. Ideally we would like to maintain one zarr group which stores each field as a zarr array, however we are facing some challenges with this:

  • Every time we append data to the array we have to update the attributes with a new shape. Data is ingested in parallel and we are storing it in S3, so maintaining a file lock on the metadata is not possible and updating the object may not replicate fast enough.
  • We wish to remove old data from the beginning of the array. Currently it is my understanding that this would require re-indexing all chunks each time as well as rewriting the metadata.

Proposal

One idea we've had to resolve this is to change the way the attributes are stored and chunks are loaded.

  • We could make storing the shape of the array optional. This would mean that when we append new chunks we would not need to update the metadata.
  • If zarr attempts to load a chunk which does not exist (either because it hasn't been created yet or it has been expunged) it could return a NaN array of the correct shape.

In practice this would mean that I could load my weather forecast zarr and request a slice of temperature data, for 12pm the following day from the 48 simulations run up until that time. This would result in an array with NaN values for old runs which have been removed and new runs which are yet to be run, but it will contain the data which does currently exist including the very latest simulation available.

This would result in ever increasing indices, which would eventually hit limits on object key length. One solution to that would be to create a parallel zarr occasionally which would reset the index and ingest into both and eventually remove the older one when the new one gets to a certain age. This would result in some duplication of data but would avoid this problem.

Conclusion

I would be really keen to hear feedback from the community on this. I discussed it last week with @mrocklin and @jhamman. Would be good to hear more thoughts on this.

Protocol extensions

Protocol extensions are mentioned throughout the v3.0 core protocol docs but they are still undocumented. Has there been any design discussions around how to specify and implement protocol extensions yet?

License?

Currently we don't have a license here.

Many other projects in this org have chosen MIT, which might make sense. N5 uses BSD 2-Clause. Other Zarr/N5 projects use various license from MIT through Apache. Maybe one of these makes sense.

As a counterpoint, the idea of this repo is to include documents, not code. Perhaps it would be better to pick a different license given the intended usage. One of the CC licenses might be a reasonable choice.

There could be other choices that I've missed that would be relevant to discuss. Feel free to mention ideas here and we can try to reach consensus on one.

Extension for data integrity

In a recent conversation with @balaji-gfdl, we discussed the importance of data integrity. Verifying data integrity is crucial for many data providers. For standard single-file-based formats (e.g. netcdf, hdf5, csv), the md5 checksum is the gold standard for verifying that a file is binary identical after network transmission. What best practice for data integrity will be recommend for Zarr data?

Fletcher checksums have been proposed as a possible filter / compressor option in #38. AFAIU, these would tell us whether a single chunk has been corrupted in transit. It does not address some broader questions, such as:

  • How do we verify that two arrays are the same (down to the bit) in a chunked Zarr store and a legacy data format?
  • How do we verify that two arrays with different chunk structure are the same (down to the bit)?

There is obviously an expensive way to do this: open each array in python and say np.testing.assert_array_equal. This is a lot more expensive and inconvenient than just running md5 on the command line.

It seems like this is a question that a certain type of computer scientist may already know the answer to. Some sort of hierarchical checksum, extension of the Fletcher algorithm, etc....

Whatever the technical solution, it seems essential that we have some sort of answer to this common question.

z5 library (Zarr/N5 interoperability)

Ran across z5 recently, which allows reading and writing of both Zarr and N5 in C++ and Python. As Zarr and N5 have both grown FWICT for similar reasons, but in different languages (Python and Java respectively), am interested to understand the similarities and differences between them. Along those lines, it would be good to learn in what areas interoperability between Zarr and N5 can be improved. I think we would be in a really great place if data can more smoothly move between these two formats and different languages.

cc @constantinpape @saalfeldlab

Support zero-padding chunk indices when generating chunk keys

In the current v3 core protocol draft, chunk keys are formed by concatenating chunk indices without any zero padding, e.g., "0.0" and "100.200", etc. However, this means chunk files/objects do not sort lexically, which can be convenient when accessing zarr data via generic tools. To get a lexical sort could be achieved with zero padding, e.g., "0.0" becomes "000.000". It is hard to generalise because fixing a number of zeros to pad would constrain the number of chunks on any dimension, and it is impossible in general to know ahead of time how many chunks are needed given that array dimensions can be resized. However, it might be possible to add this as an option, expecting that it is not the default but may in some circumstances be specified by the user.

Extension proposal: multiscale arrays v0.1

This issue has been migrated to an image.sc topic after the 2020-05-06 community discussion. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Feedback and request changes are welcome either on this repository or on image.sc.


As a first draft of support for the multiscale use-case (#23), this issue proposes an intermediate nomenclature for describing groups of Zarr arrays which are scaled down versions of one another, e.g.:

example/
β”œβ”€β”€ 0    # Full-sized array
β”œβ”€β”€ 1    # Scaled down 0, e.g. 0.5; for images, in the X&Y dimensions
β”œβ”€β”€ 2    # Scaled down 1, ...
β”œβ”€β”€ 3    # Scaled down 2, ...
└── 4    # Etc.

This layout was independently developed in a number of implementations and has since been implemented in others, including:

Using a common metadata representation across implementations:

  1. fosters a common vocabulary between existing implementations
  2. enables other implementations to reliably detect multiscale arrays
  3. permits the upgrade of v0.1 arrays to future versions of this or other extension
  4. tests this extension for limitations against multiple use cases

A basic example of the metadata that is added to the containing Zarr group is seen here:

{
  β€œmultiscales”: [
    {
      β€œdatasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      β€œversion” : β€œ0.1”
    }
     // See the detailed example below for optional metadata
  ]
}

Process

An RFC process for Zarr does not yet exist. Additionally, the v3 spec is a work-in-progress. However, since the implementations listed above as well as others are already being developed, I'd propose that if a consensus can be reached here, this issue should be turned into an .rst file similar to those in the v3 branches (e.g. filters) and used as a temporary spec for defining arrays with the understanding that this a prototype intended to be amended and brought into the general extension mechanism as it develops.

I'd welcome any suggestions/feedback, but especially around:

  • Better terms for "multiscale" and "series"
  • The most useful enum values
  • Is this already too complicated? (Limit to one series per group?) or on the flip side:
  • Are there existing use cases that aren't supported? (Note: I'm aware of some examples like BDV's N5 format but I'd suggest they are higher-level than just "multiscale arrays".)

Deadline for a first round of comments: March 15, 2020
Deadline for a second round of comments: April 15, 2020

Detailed example

Color key (according to https://www.ietf.org/rfc/rfc2119.txt):

- MUST     : If these values are not present, the multiscale series will not be detected.
! SHOULD   : Missing values may cause issues in future versions.
+ MAY      : Optional values which can be readily omitted.
# UNPARSED : When updating between versions, no transformation will be performed on these values.

Color-coded example:

-{
-  "multiscales": [
-    {
!      "version": "0.1",
!      "name": "example",
-      "datasets": [
-        {"path": "0"},
-        {"path": "1"},
-        {"path": "2"}
-      ],
!      "type": "gaussian",
!      "metadata": {
+        "method":
#          "skiimage.transform.pyramid_gaussian",
+        "version":
#          "0.16.1",
+        "args":
#          [true],
+        "kwargs":
#          {"multichannel": true}
!      }
-    }
-  ]
-}

Explanation

  • Multiple multiscale series of datasets can be present in a single group.
  • By convention, the first multiscale should be chosen if all else is equal.
  • Alternatively, a multiscale can be chosen by name or with slightly more effort, but the zarray metadata like chunk size.
  • The paths to the arrays are ordered from largest to smallest.
  • These paths could potentially point to datasets in other groups via β€œ../foo/0” in the future. For now, the identifiers MUST be local to the annotated group.
  • These values SHOULD (MUST?) come from the enumeration below.
  • The metadata example is taken from https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.pyramid_reduce

Type enumeration:

Sample code

#!/usr/bin/env python
import argparse
import zarr
import numpy as np
from skimage import data
from skimage.transform import pyramid_gaussian, pyramid_laplacian

parser = argparse.ArgumentParser()
parser.add_argument("zarr_directory")
ns = parser.parse_args()


# 1. Setup of data and Zarr directory
base = np.tile(data.astronaut(), (2, 2, 1))

gaussian = list(
    pyramid_gaussian(base, downscale=2, max_layer=4, multichannel=True)
)

laplacian = list(
    pyramid_laplacian(base, downscale=2, max_layer=4, multichannel=True)
)

store = zarr.DirectoryStore(ns.zarr_directory)
grp = zarr.group(store)
grp.create_dataset("base", data=base)


# 2. Generate datasets
series_G = []
for g, dataset in enumerate(gaussian):
    if g == 0:
        path = "base"
    else:
        path = "G%s" % g
        grp.create_dataset(path, data=gaussian[g])
    series_G.append({"path": path})

series_L = []
for l, dataset in enumerate(laplacian):
    if l == 0:
        path = "base"
    else:
        path = "L%s" % l
        grp.create_dataset(path, data=laplacian[l])
    series_L.append({"path": path})


# 3. Generate metadata block
multiscales = []
for name, series in (("gaussian", series_G),
                     ("laplacian", series_L)):
    multiscale = {
      "version": "0.1",
      "name": name,
      "datasets": series,
      "type": name,
    }
    multiscales.append(multiscale)
grp.attrs["multiscales"] = multiscales

which results in a .zattrs file of the form:

{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "G1"
                },
                {
                    "path": "G2"
                },
                {
                    "path": "G3"
                },
                {
                    "path": "G4"
                }
            ],
            "name": "gaussian",
            "type": "gaussian",
            "version": "0.1"
        },
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "L1"
                },
                {
                    "path": "L2"
                },
                {
                    "path": "L3"
                },
                {
                    "path": "L4"
                }
            ],
            "name": "laplacian",
            "type": "laplacian",
            "version": "0.1"
        }
    ]
}

and the following on-disk layout:

/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmp77n1ga3r.zarr
β”œβ”€β”€ G1
β”‚Β Β  β”œβ”€β”€ 0.0.0
...
β”‚Β Β  └── 3.1.1
β”œβ”€β”€ G2
β”‚Β Β  β”œβ”€β”€ 0.0.0
β”‚Β Β  β”œβ”€β”€ 0.1.0
β”‚Β Β  β”œβ”€β”€ 1.0.0
β”‚Β Β  └── 1.1.0
β”œβ”€β”€ G3
β”‚Β Β  β”œβ”€β”€ 0.0.0
β”‚Β Β  └── 1.0.0
β”œβ”€β”€ G4
β”‚Β Β  └── 0.0.0
β”œβ”€β”€ L1
β”‚Β Β  β”œβ”€β”€ 0.0.0
...
β”‚Β Β  └── 3.1.1
β”œβ”€β”€ L2
β”‚Β Β  β”œβ”€β”€ 0.0.0
β”‚Β Β  β”œβ”€β”€ 0.1.0
β”‚Β Β  β”œβ”€β”€ 1.0.0
β”‚Β Β  └── 1.1.0
β”œβ”€β”€ L3
β”‚Β Β  β”œβ”€β”€ 0.0.0
β”‚Β Β  └── 1.0.0
β”œβ”€β”€ L4
β”‚Β Β  └── 0.0.0
└── base
    β”œβ”€β”€ 0.0.0
...
    └── 1.1.1

9 directories, 54 files
Revision Source Date Description
6 External feedback on twitter and image.sc 2020-05-06 Remove "scale"; clarify ordering and naming
5 External bug report from @mtbc 2020-04-21 Fixed error in the simple example
4 #50 (comment) 2020-04-08 Changed "name" to "path"
3 Discussions up through #50 (comment) 2020-04-01 Updated naming schema
2 #50 (comment) 2020-03-07 Fixed typo
1 @joshmoore 2020-03-06 Original text from in person discussions

Thanks to @ryan-williams, @jakirkham, @freeman-lab, @petebankhead, @jni, @sofroniewn, @chris-allan, and anyone else whose GitHub account I've forgotten for the preliminary discussions.

v3 spec development process & milestones

Repurposing this issue to discuss process and milestones for development of the v3 protocol spec.

Straw man proposal for discussion:

  • Phase 1 (current)
    • Continue discussing technical points in the current editor's draft via zarr-specs github and the v3 spec dev calls
    • Accumulate some early implementation experience from zarr-python (Matthias), zarrita (Alistair), and at least one other language (ideally statically-typed)
    • Take tentative positions on any remaining open questions, update and complete the v3 protocol and associated specs, highlighting any particular points where we'd like feedback
    • Milestone: Publish "First Working Draft"
    • Request for comments
  • Phase 2
    • Accumulate some more implementation experience, continue technical discussions
    • Compile and discuss comments from community review on first working draft, decide how to address comments
    • Revise v3 protocol spec
    • Establish a test suite for verifying completeness and compatibility between different implementations
    • Milestone: Publish "Second Working Draft"
    • Call for implementations
  • Phase 3
    • Focus on implementations, getting completeness and compatibility
    • Deal with any further (hopefully minor) technical issues that arise from new implementations
    • Finalise v3 protocol and associated specs
    • Request approval from Zarr Steering Group to publish
    • Milestone: Publish "Recommendation"

Name?

Have named this spec for lack of creativity. Though maybe it should be zarr-spec or something else to aid users forking the repo and to avoid conflicts with similarly named things. There may be other good choices for a name. If people have suggestions, feel free to voice them. Maybe we can pick the best name by consensus?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.