For example, "column_annotations" instead of "smp"

I do think observation is better. <span class="email-hidden-toggle

Consider changing variable names to be more generic about anndata HOT 8 CLOSED

scverse commented on July 18, 2024

Consider changing variable names to be more generic

from anndata.

Comments (8)

falexwolf commented on July 18, 2024

OK! At some point this discussion had to be reopened.

Here's the structure of the AnnData object: https://scanpy.readthedocs.io/en/latest/api/scanpy.api.AnnData.html#scanpy.api.AnnData

The convention is that samples/observations (e.g. cells) of variables/features (e.g. genes) are stored in the rows of a matrix. The columns correspond to variables/features. This is the convention of the modern classics of Statistics Hastie et al. (2009) and Machine Learning Murphy (2012), the convention of dataframes both in R and Python and the established machine learning and statistics packages in Python (statsmodels, scikit-learn).

To come up with a name for the annotation of rows/samples/observations we voted for .smp (short for .samples). One could also think about .obs or .observations.

Initially I thought that we should stick with .rows, but all Statistics and ML algorithms expect an oriented data matrix and some communities (genomics) use different conventions as the one mentioned above - so we thought it's less confusing if we go with .smp for the annotation of samples.

What do you think?

from anndata.

joshua-gould commented on July 18, 2024

I find "samples" to be confusing as it can meaning changes depending on who you're talking to. "Variables" seems to be self-explanatory.

…

On Wed, Nov 22, 2017 at 8:42 AM, Alex Wolf ***@***.***> wrote: OK! At some point this discussion had to be reopened. Here's the structure of the AnnData object: https://scanpy.readthedocs.io/ en/latest/api/scanpy.api.AnnData.html#scanpy.api.AnnData The convention is that samples/observations (e.g. cells) of variables/features (e.g. genes) are stored in the rows of a matrix. The columns correspond to variables/features. This is the convention of the modern classics of Statistics Hastie et al. (2009) <https://web.stanford.edu/%7Ehastie/ElemStatLearn/> and Machine Learning Murphy (2012) <https://mitpress.mit.edu/books/machine-learning-0>, the convention of dataframes both in R and Python and the established machine learning and statistics packages in Python (statsmodels <http://www.statsmodels.org/stable/index.html>, scikit-learn <http://scikit-learn.org/>). To come up with a name for the annotation of rows/samples/observations we voted for .smp (short for .samples). One could also think about .obs or .observations. Initially I thought that we should stick with .rows, but all Statistics and ML algorithms expect and oriented data matrix and some communities (genomics) uses different conventions as the one mentioned above - so we thought it's less confusing if we go with .smp for the annotation of samples. What do you think? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_pn6b0u1LAxJOoJXjoqe9JrcabHHvXks5s5CTOgaJpZM4QnZNY> .

from anndata.

falexwolf commented on July 18, 2024

I realized this only recently that in the single-cell community, a 'sample' could mean a 'batch' of single-cell measurements.

So just then I thought that maybe 'observation' is better. Would you be more happy with that? Or what is your suggestion?

from anndata.

joshua-gould commented on July 18, 2024

I do think observation is better.

…

On Wed, Nov 22, 2017 at 9:50 AM, Alex Wolf ***@***.***> wrote: I realized this only recently that in the single-cell community, a 'sample' could mean a 'batch' of single-cell measurements. So just then I thought that maybe 'observation' is better. Would you be more happy with that? Or what is your suggestion? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_pn9gnh48bDX6vPRR6oKyV2h4EJ0T5ks5s5DSmgaJpZM4QnZNY> .

from anndata.

flying-sheep commented on July 18, 2024

Hi, I think I’ll quickly write up why anything is better than row/col. The best name for the two dimensions is up for discussion, but there’s basically two ways to store data:

The most generic way is a tidy/long format. pandas.DataFrame solves this.
More special kinds of data can be optimized for their specific features. We observed that many biological datasets have a specific shape: A rectangular (sparse or dense) numeric matrix of observations/samples/individuals × features/variables, and dense metadata for observations and features.

The long format would be both wasteful in space and less semantic for the same reason – each observation would have all feature and sample metadata, e.g. [Expression, GeneID, GeneSymbol, SampleID, BatchID]

AnnData is designed for the second case.

from anndata.

falexwolf commented on July 18, 2024

OK, we will transition to observations / variables instead of samples / variables. Of course, code will remain backwards compatible and only in some future version, we will remove the .smp attributes. This transition should happen within the next few days.

For a bit more in the future, we might also account for an easy way of the storing the row-/col- storing convention of AnnData. As in genomic data, people of store the observations in the columns, one could think of a global switch that allows to change between both conventions... Let's see if this would become too complicated...

from anndata.

falexwolf commented on July 18, 2024

The new version of anndata and Scanpy use the notion "observations" instead of "samples". This is reflected in the attribute .obs. Of course, .smp will continue to work for some while in future.

from anndata.

falexwolf commented on July 18, 2024

Please, see http://anndata.readthedocs.io and https://scanpy.readthedocs.io.

from anndata.

Consider changing variable names to be more generic about anndata HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent