Comments (8)
OK! At some point this discussion had to be reopened.
Here's the structure of the AnnData object: https://scanpy.readthedocs.io/en/latest/api/scanpy.api.AnnData.html#scanpy.api.AnnData
The convention is that samples/observations (e.g. cells) of variables/features (e.g. genes) are stored in the rows of a matrix. The columns correspond to variables/features. This is the convention of the modern classics of Statistics Hastie et al. (2009) and Machine Learning Murphy (2012), the convention of dataframes both in R and Python and the established machine learning and statistics packages in Python (statsmodels, scikit-learn).
To come up with a name for the annotation of rows/samples/observations we voted for .smp
(short for .samples
). One could also think about .obs
or .observations
.
Initially I thought that we should stick with .rows
, but all Statistics and ML algorithms expect an oriented data matrix and some communities (genomics) use different conventions as the one mentioned above - so we thought it's less confusing if we go with .smp
for the annotation of samples.
What do you think?
from anndata.
from anndata.
I realized this only recently that in the single-cell community, a 'sample' could mean a 'batch' of single-cell measurements.
So just then I thought that maybe 'observation' is better. Would you be more happy with that? Or what is your suggestion?
from anndata.
from anndata.
Hi, I think I’ll quickly write up why anything is better than row/col. The best name for the two dimensions is up for discussion, but there’s basically two ways to store data:
-
The most generic way is a tidy/long format.
pandas.DataFrame
solves this. -
More special kinds of data can be optimized for their specific features. We observed that many biological datasets have a specific shape: A rectangular (sparse or dense) numeric matrix of observations/samples/individuals × features/variables, and dense metadata for observations and features.
The long format would be both wasteful in space and less semantic for the same reason – each observation would have all feature and sample metadata, e.g. [Expression, GeneID, GeneSymbol, SampleID, BatchID]
AnnData is designed for the second case.
from anndata.
OK, we will transition to observations / variables instead of samples / variables. Of course, code will remain backwards compatible and only in some future version, we will remove the .smp
attributes. This transition should happen within the next few days.
For a bit more in the future, we might also account for an easy way of the storing the row-/col- storing convention of AnnData. As in genomic data, people of store the observations in the columns, one could think of a global switch that allows to change between both conventions... Let's see if this would become too complicated...
from anndata.
The new version of anndata and Scanpy use the notion "observations" instead of "samples". This is reflected in the attribute .obs
. Of course, .smp
will continue to work for some while in future.
from anndata.
Please, see http://anndata.readthedocs.io and https://scanpy.readthedocs.io.
from anndata.
Related Issues (20)
- `/` in column names makes AnnData Zarr object unreadable on windows HOT 12
- String indexes in var lead to UserWarning HOT 1
- Move tests out of the package
- Refactor `BaseCompressedSparseDataset` and subclasses to remove `backed_{csr,csc}_matrix`
- concat_on_disk should use int64 indptrs by default so that it can concatenate large files HOT 4
- TypeError: Can't implicitly convert non-string objects to strings HOT 4
- Error concatenating scRNA with visium dataset
- memory usage of concat HOT 3
- Using `Pint` for units HOT 9
- UMAP of gene subset of adata HOT 1
- Error using sc.pl.stacked_violin HOT 1
- NotImplementedError with concat_on_disk
- Issue when setting anndata.X to numpy array HOT 3
- Does anndata.raw can be automatically modified after define it?
- Make the call to `_check_uniqueness()` optional in `_init_as_actual()` (and `__init__()`) HOT 1
- `concat_on_disk` outer join
- `concat_on_disk` merge strategies are untested/not implemented
- Potentially flaky benchmark HOT 1
- Reading Anndata from only parts of h5ad file: Hack solution HOT 8
- 2D indexing (no “:”) in `sparse_dataset` is not lazy HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anndata.