Giter VIP home page Giter VIP logo

Comments (8)

falexwolf avatar falexwolf commented on July 18, 2024

OK! At some point this discussion had to be reopened.

Here's the structure of the AnnData object: https://scanpy.readthedocs.io/en/latest/api/scanpy.api.AnnData.html#scanpy.api.AnnData

The convention is that samples/observations (e.g. cells) of variables/features (e.g. genes) are stored in the rows of a matrix. The columns correspond to variables/features. This is the convention of the modern classics of Statistics Hastie et al. (2009) and Machine Learning Murphy (2012), the convention of dataframes both in R and Python and the established machine learning and statistics packages in Python (statsmodels, scikit-learn).

To come up with a name for the annotation of rows/samples/observations we voted for .smp (short for .samples). One could also think about .obs or .observations.

Initially I thought that we should stick with .rows, but all Statistics and ML algorithms expect an oriented data matrix and some communities (genomics) use different conventions as the one mentioned above - so we thought it's less confusing if we go with .smp for the annotation of samples.

What do you think?

from anndata.

joshua-gould avatar joshua-gould commented on July 18, 2024

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

I realized this only recently that in the single-cell community, a 'sample' could mean a 'batch' of single-cell measurements.

So just then I thought that maybe 'observation' is better. Would you be more happy with that? Or what is your suggestion?

from anndata.

joshua-gould avatar joshua-gould commented on July 18, 2024

from anndata.

flying-sheep avatar flying-sheep commented on July 18, 2024

Hi, I think I’ll quickly write up why anything is better than row/col. The best name for the two dimensions is up for discussion, but there’s basically two ways to store data:

  1. The most generic way is a tidy/long format. pandas.DataFrame solves this.

  2. More special kinds of data can be optimized for their specific features. We observed that many biological datasets have a specific shape: A rectangular (sparse or dense) numeric matrix of observations/samples/individuals × features/variables, and dense metadata for observations and features.

    The long format would be both wasteful in space and less semantic for the same reason – each observation would have all feature and sample metadata, e.g. [Expression, GeneID, GeneSymbol, SampleID, BatchID]

AnnData is designed for the second case.

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

OK, we will transition to observations / variables instead of samples / variables. Of course, code will remain backwards compatible and only in some future version, we will remove the .smp attributes. This transition should happen within the next few days.

For a bit more in the future, we might also account for an easy way of the storing the row-/col- storing convention of AnnData. As in genomic data, people of store the observations in the columns, one could think of a global switch that allows to change between both conventions... Let's see if this would become too complicated...

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

The new version of anndata and Scanpy use the notion "observations" instead of "samples". This is reflected in the attribute .obs. Of course, .smp will continue to work for some while in future.

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

Please, see http://anndata.readthedocs.io and https://scanpy.readthedocs.io.

from anndata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.