Giter VIP home page Giter VIP logo

single-cell-best-practices's Introduction

Single-cell best practices

Cover

The most recent version of the book is rendered here.

Looking for help and maintainers

We are looking for help in maintaining the book. There are lots of open tasks for both newcomers and seasoned analysts. Please contact us if you are interested in helping out!

Accompanying expert recommendation and citation

This book builds upon our expert recommendation "Best practices for single-cell analysis across modalities": https://www.nature.com/articles/s41576-023-00586-w. If you found the expert recommendation or this book helpful for your research article please cite it as:

Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet (2023). https://doi.org/10.1038/s41576-023-00586-w

Mission

We are writing a book on the current single-cell analysis best-practices with extensive tutorials and code examples.

Contributing

We would like to invite the community to further improve the tutorial and the teaching material. Please read contributing for further instructions.

In case of questions or problems, please get in touch by posting an issue in this repository.

Adapting the notebooks to other datasets

All notebooks for the various steps can be found in the jupyter book folder together with minimal Conda environments. Alternatively, the notebooks can be downloaded directly from the rendered version.

Acknowledgements

This tutorial would not be possible without the input of all Theislab members and the countless benchmarks and reviews of various single-cell tools by the community.

single-cell-best-practices's People

Contributors

abid-abrar avatar alitinet avatar amitfrish avatar annachristina avatar cramsuig avatar danielstrobl avatar dbdimitrov avatar dongzehe avatar dreast avatar emdann avatar ilibarra avatar ivirshup avatar jdhenaos avatar johannesostner avatar lazappi avatar lisasikkema avatar luckymd avatar m0hammadl avatar mt1022 avatar padix-key avatar soroorh avatar toniecrumley avatar vladimirshitov avatar weilerp avatar wisdomadingo avatar xinyuejohn avatar xlancelottx avatar yugeji avatar zethson avatar zoepiran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

single-cell-best-practices's Issues

GRNs

GRNs
Explore concepts in ATAC-RNA matching (and potentially link to multi-model data). Define current best practices in gene regulation and regulatory patterns.

Potential starting points:
SCENIC
cellOracle

Datasets:
GRN specific dataset, check for dimensions

Technical Jupyter book requirements

(Technical) requirements engineering:

  • Visual design and functionality of https://inria.github.io/scikit-learn-mooc/ <- https://github.com/INRIA/scikit-learn-mooc
  • Custom domain
  • Trivial command to build the complete book while rerunning all jupyter notebooks
  • Command to selectively build a single chapter
  • Caching already built jupyter notebooks that were not edited. Don't rerun them when building the book!
  • Interactive sessions for specific chapters with runnable instances with sufficent power with Binder. #1
  • Environment(s) to build the complete book. Identify whether a single environment for the complete book works or we need to somehow build one environment for one chapter.
  • Jupyter lab + a couple of extensions (black, [...]) must be part of the environments to format them cleanly
  • A way of comparing Jupyter Notebook diffs on PRs -> likely just reviewnb
  • A way to build a single PDF from the book.
  • Well written documentation explaining how to build (parts of) the book, exploring the content on your own, [...]
  • Build the book via CI?

Perturbation Analysis

Perturbations

  1. Discern many perturbations (>=5 maybe)
  2. Differentiate between biological and confounded effects
  3. Detect perturbed and non-perturbed cells -> how to remove cells that escaped the perturbations
  4. Visualizing similarities and differences across different perturbations
  5. Applications of tools like ScGen to predict the effect of unseen perturbations

Feature Selection

Feature selection for marker genes
Provide updated best practices for feature selection for marker genes and introduce feature selection methods for multi-sample scRNA-seq data.

Newer methods for DE gene detection combat p-value inflation by learning latent space, clustering, and learning marker genes in one go.

Potential starting points:
(maybe) Jan Hasenauer wilcoxon test method (no separate publication for this method, check/discuss if worth adding)
Plus: marker gene methods which aren’t based on differential expression (check for benchmarks)

https://www.biorxiv.org/content/10.1101/2022.05.09.490241v1

Preamble chapter

We should write a preamble chapter.

Possible content:

  1. What this book is.
  2. Who the book is aimed at and who is not the target audience
  3. People and experts who contributed to this book
  4. Links to both papers
  5. How to use this book -> the "traffic light" system and more

We need to ensure that it does not overlap with the introduction chapter.

Prior art

  • References to other existing single-cell books and tutorials
  • References to best-practice papers and books (aka our earlier work)
  • [...]

Differential gene expression

Dealing with conditions Introduction

Dealing with conditions

Dataset

https://singlecell.broadinstitute.org/single_cell/study/SCP548/an-immune-cell-signature-of-bacterial-sepsis-patient-pbmcs?cluster=All%20Cells&spatialGroups=--&annotation=Cohort--group--study&subsample=100000#study-visualize

image
https://www.nature.com/articles/s41591-020-0752-4

  • UTI = urinary-tract infection
  • Leuk-UTI = UTI with leukocytes, but no organ disfunction
  • Int-URO = UTI with mild or transient organ disfunction
  • URO = UTI with clear or persistent organ disfunction
  • Bac-SEP = bacteremic individuals with sepsis in hospitals

Figure

  • From top to bottom with arrows
  • Start with dataset with several perturbations/conditions
  • Split into the two major schools of thought -> DE (volcano plot) and perturbation analysis (something like this?
    image
    )
  • Split the two major schools of thought again up into
  1. DE: pseudobulk vs cell wise
  2. Perturbation analysis: ScGen aka DNN(VAE) approaches vs something else (Mixscape style)?

Experts

Yuge

Interoperability

Discusses how transition from the R to the Python ecosystem and vice versa.
Introduces packages like anndata2ri, zellkonverter, Seurat functions, AnndataR

Outlook

  • Living book -> will change
  • Open problems will benchmark stuff and we will update
    -(End-to-end pipelines)
  • Map onto reference which include low quality stuff as well
  • Filter low quality stuff out on the fly
  • Embed based on references
  • Technical method outlook
  • Longitudinal data modelling

Interactive visualization with cellxgene

adding this issue as an reminder for the interactive visualization section, which requires the following steps

  • finalization of preprocessing section #10
  • host our preprocessed example dataset on the cellxgene data portal
  • add chan zuckerberg lab to the acknowledgements in paper
  • embedd link to cellxgene data portal in the tutorial

Introduction chapter

1.0 Prior art (bioconductor book, etc.)
1.1 Experimental data collection
1.2 Raw data processing
1.3 Data infrastructure (maybe as separate chapter)
1.4 Interoperability (anndata2ri, zellkonverter, Seurat functions, AnndataR)

Details to follow.

Environment Party

As discussed with @le-ander

  • Based on @le-ander existing container, but our own container for the book. Eventually copy the counter source into our repository
  • Python 3.8
  • Scanpy 1.8.2
  • R 4.1
  • Maybe also pin Pandas to 1.2.5. Not sure anymore, feel free to ignore for now.
  • The rest of the packages that you already have in the existing container
  • jupyter-book
  • https://jupyterlab-code-formatter.readthedocs.io/en/latest/ <- configured with black (if we can pre configure this)

@Zethson will provide a book to start from scratch with in 2 weeks.

Variant calling

Variant calling & SNP analysis

Write new Docker container

Ideally we could use one environment per notebook, but this may not be feasible. I shall think of approaches to do this and evaluate them briefly. Would evolve partially executing the scripts.

We should start out by developing a new R >= 4 based container (look at @le-ander work) and trying to run and integrate @LuckyMD old notebook first. Ideally I could also test the notebook to script and vice versa conversion here. While doing so I shall ensure that as many dependencies as possible are defined in the Mamba environment file.

Cell type annotation

  • How to best annotate cells.
  • How to get good markers
  • Tips on color scale when plotting UMAPs (red on white instead of blue on green)
  • Automated cell type annotation
  • Dendrograms to QC cell type annotations
  • [...]

Includes:

Feature selection and Manual annotation
Automated cell type annotation
Reference mapping

Compositional analysis

Cell type composition analysis is difficult to write best-practices for because there are no reviews or benchmarks yet for it. Therefore, we should focus on which tools are available, 1-2 comments on strengths and weaknesses and what should be kept in mind when conducting cell type composition analysis.
Potential starting points:

Experts: Maren

toc - best practices 2.0 book

  1. Preamble

  2. Intro
    1.0 Prior art (bioconductor book, etc.) (LH)
    1.1 Experimental data collection (LH)
    1.2 Raw data processing (AS)
    1.3 AnnData & Scanpy intro for newbies (LH)
    1.4 Data infrastructure (maybe as separate chapter) (AS)
    1.5 Interoperability (anndata2ri, zellkonverter, Seurat functions, AnndataR, https://github.com/cellgeni/sceasy) (AS)

X. Datasets

  1. Preprocessing and visualization
    2.1 QC
    2.2 Normalization
    2.3 Feature selection
    2.4 Dimensionality reduction
    2.5 Advanced: Ambient RNA
    2.6 Advanced: single-cell vs single nuclei
    2.7 Advanced: Interactive visualization

  2. Dealing with batch effects
    3.1 Batch-aware feature selection
    3.2 Data integration
    3.2 Evaluation

  3. Interpreting cellular structure
    4.1 Clustering
    4.2 Annotation
    4.2.1 Feature selection and Manual annotation
    4.2.2 Automated cell type annotation
    4.2.3 Reference mapping
    4.3 Trajectory inference
    4.4 RNA velocity
    4.5 Fate mapping

  4. Dealing with conditions
    5.1 DE testing
    5.2 Compositional analysis
    5.3 Perturbation modeling
    5.4 Pathway analysis

  5. Modeling mechanisms
    6.1 GRNs
    6.2 Cell-cell communication
    6.3 Variant calling (?)

  6. Deconvolution
    7.1 Bulk deconvolution
    7.2 Donor deconvolution

  7. Spatial omics

  8. Multi-omics

  9. Outlook

Spatial omics

Optional topic.

Refer reviews of spatial omnics methods and give an overview of analysis pipeline. Potentially, discuss benchmarking of spot deconvolution approaches in Visium data and segmentation algorithms. Depending on scope, we might want to discuss the different spatial transcriptomics technologies.

Potential starting points:
Tangram (for reference-based deconvolution)
Cellpose (segmentation)
StarDist (segmentation)

Theislab experts:
Giovanni, Hannah, Leon

Spatial omics content

  • Introduction - scope 0.5-1 page

  • look into reviews

  • broad overview into the field

  • mention nature methods "method of the year 2020"

  • define axis: targeted/high-resolution vs unbiased/cell-aggregates

  • Technologies (depends heavily on what we show in the tutorial) - scope 0.5 page per method

  • ST, visium

  • MERFISH

  • IMC (Imaging Mass Cytometry)

  • tools (plus tutorial) - check current reviews - 1 page overall

  • squidpy

  • spectreMAP

  • giotto

  • ??

  • spot deconvolution (show two methods and tutorials for this) - 0.5 page

  • RCTD

  • SpotLight

  • Stereoscope

  • DestVI

  • cell2location

  • Identification of Cells from mRNA spot - 0.5 page

  • Baysor

  • SSAM

  • Sparcle

  • ClusterMap

  • other (check where to put those, if even include)

  • tangram

  • cell type annotation

  • imputation

  • downstream tools

  • ncem

Cell Hashing

This issue tracks all things related to cell hashing. This will be an advanced topic.

Content

Dataset

Experts

New tutorial for data integration & automated cell type annotation

Data integration & automated cell type annotation
Data integration methods have now been tested extensively and there is (as always) no clear winner. However, some methods are better at certain tasks or problem sizes and we should emphasize that and discuss that.
Dataset: Haber et al 2018 + new mouse dataset from Malte

Cell type annotation can be conducted in an either automated or expert driven fashion. Automated cell type annotation is improving over time as the number of annotated datasets and marker genes grows. Especially, the mapping against reference atlas datasets is becoming more and more interesting.

Data infrastructure

Answers questions like

  1. How to get data
  2. Where to get data from
  3. Sfaira
  4. ....

Quality control including Ambient RNA & doublet detection

QC
Current best-practices suggest cell QC based on three QC covariates: count depth, number of genes per barcode, and fraction of counts from mitochondrial genes per barcode. We want to provide an update if these QC measures are still state of the art or if there are QC tools that function as best practices.

Potential starting points:
goals and approaches
miQC
benchmarks QC, normalisation

Doublet detection & ambient RNA
Benchmark doublet detection

Cell cell communication

Cell-cell communication
Define current best practices in cell-cell communication in terms of between cell communication and prediction of ligand–target links between interacting cells. (Potential link to spatial omnics.)

Potential starting points:
NicheNet
Cellchat
cellphoneDB
omnipath

Normalization

Normalization
BP1.0 suggested scran for normalization of non-full-length datasets. Scaling to a zero mean and unit variance was not preferred. Normalized data should be log(x+1) transformed to obtain normally distributed data. (Side note: scVI etc. uses raw counts for integration, mention this here as well.)

Newer approaches recommend to use GLM-PCA for non-normal distributions, Pearson residuals from regularized negative binomial regression and analytic Pearson residuals. These methods claim to outperform other normalization tasks in downstream analysis tasks like identifying biologically variable genes or dimensionality reduction. Goal is to provide an overview of these methods and check for potential independent benchmarks.

Potential starting points:
GLM-PCA
scTransform
statistical error models for scRNA-seq
Pearson residuals

Design Chapter content

We will split up the current notebook into several chapters and add more chapters later. However, for this to become easily readable we should add some a few optional introductory chapters like https://www.singlecellcourse.org/.
Should also emphasize that contrary to their book our content is NOT biased by their lab, but by best-practices

All chapters should have an overview.
All subchapters should have something like:

  1. Overview/Introduction
  2. Content
  3. Quiz
  4. Solutions
  5. Main takeaways

This is STC

Multiomics

  • Multi omics paper by Fabi & Carlos is the paper to cite
  • multi omics best practices comes in phase 2 later

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.