Xarray has a lot of advantages, e.g.: named dimensions <l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[proposal] Changing the backend to xarray about anndata HOT 9 CLOSED

scverse commented on August 17, 2024

[proposal] Changing the backend to xarray

from anndata.

Comments (9)

falexwolf commented on August 17, 2024 2

Sorry for the late response, I was on holidays.

I looked into xarray in the beginning and decided against because of the missing sparse data support and the plain fact, that things like scikit-learn only accept numpy arrays and sparse matrices as input.

These days, we're putting a lot of thought in improving the backed infrastructure of anndata for chunked calculations. We might return to xarray for that reason. I can also keep you posted on the benchmarks, soon, here.

from anndata.

shoyer commented on August 17, 2024 2

However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.

Well, to be clear -- it could handle the wrapping by itself. We would need to define a metadata convention (but this should be pretty simple/straightforward).

from anndata.

flying-sheep commented on August 17, 2024

we now support zarr, which is feature-comparable so i guess this can be closed

from anndata.

ivirshup commented on August 17, 2024

@Hoeze, do you have a sense of how sparse data could be handled with netCDF or if anyone is working on it? I saw you had mentioned this on the xarray sparse issue, but haven't been able to find out too much myself.

If we could conform more to a standard like netCDF, that could help with interchange as mentioned here: ivirshup/sc-interchange#5.

from anndata.

Hoeze commented on August 17, 2024

@ivirshup Yes, there are some things ongoing.
The best bet for native sparse array support in xarray will be pydata/sparse.
However, you should talk to @shoyer for the native integration into xarray.
It would be awesome if someone would push this!

This solution will likely only support COO format for some time until pydata/sparse supports CSD (see pydata/sparse#258).
However, a lot of frameworks like TileDB or Tensorflow support only COO anyway.

In the meantime you can still save the data in sparse format and wrap it yourself.
I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

IMHO, if possible I would prefer a dense matrix over a sparse one.
Everything with a sparsity ratio lower than 90-95% will very likely cost more processing power to decode than you can theoretically save. Especially in cases where you have to convert it to dense format anyway.
Also, compression algorithms can save comparable amounts of storage.
In each case, you save a lot of engineering effort.
However, @falexwolf might have another opinion, as he did a lot of benchmarking on this.

from anndata.

ivirshup commented on August 17, 2024

Thanks for the feedback!

There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.

However, a lot of frameworks like TileDB or Tensorflow support only COO anyway.

I think this is fine. On the fly conversion from COO to CSR or CSC should be easy enough. The main issue with COO right now is that scipy.sparse's version doesn't have subsetting, which makes it a pain to use here.

I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?

IMHO, if possible I would prefer a dense matrix over a sparse one.

I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5.

To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking.

from anndata.

Hoeze commented on August 17, 2024

There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.

Yes, with pydata/xarray#3117 this could finally happen soon!

I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?

Yes, that's the wrapping problem:
NetCDF does not have (as far as i know) any conventions about storing sparse structures, trees, etc.
This means you have to store e.g. a sparse COO matrix as a coordinate matrix and a value vector.
When reading this data, you then have to wrap it with e.g. pydata/sparse, scipy.sparse or another language-dependent library.

However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.

IMHO, if possible I would prefer a dense matrix over a sparse one.

I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5.

To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking.

TileDB will be very useful in this case. It is multithreaded and stores data in chunks. I.e. even non-compressed dimension lookups should be quite fast.
Unfortunately, TileDB's Python and R library are still in its infancy.

from anndata.

ivirshup commented on August 17, 2024

@shoyer, what would the goals of an NetCDF-storable sparse array be for xarray? Would you just want to target reading the whole array into memory at once via xarray?

I see how this would be straight forward. If partial/chunked access for dask or keeping the data compatible with NetCDF libraries are goals I think it get's more complicated. Are these cases in scope for an xarray solution, or would this have to happen downstream?

from anndata.

shoyer commented on August 17, 2024

Reading whole sparse arrays from a netCDF file at once seems like a good start, and something that could easily be done in xarray.

Eventually, it would probably be nice to have chunked/partial access, but that does seem much more complicated. I'm not sure netCDF is the right file format in that case, since you probably want a more intelligent (tree like) on-disk indexing scheme and netCDF's compression filters are not very flexible. Maybe this could be done more easily with zarr? Either way, xarray could wrap a third-party library that implements sparse arrays on disk.

from anndata.

[proposal] Changing the backend to xarray about anndata HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent