theislab / single-cell-best-practices Goto Github PK

View Code? Open in Web Editor NEW

684.0 17.0 149.0 726.16 MB

https://www.sc-best-practices.org

Home Page: https://www.sc-best-practices.org

License: Other

Makefile 0.01% CSS 0.01% JavaScript 0.01% Jupyter Notebook 99.65% TeX 0.35% Dockerfile 0.01%

book rna-seq single-cell tutorial

single-cell-best-practices's Introduction

Single-cell best practices

The most recent version of the book is rendered here.

Looking for help and maintainers

We are looking for help in maintaining the book. There are lots of open tasks for both newcomers and seasoned analysts. Please contact us if you are interested in helping out!

Accompanying expert recommendation and citation

This book builds upon our expert recommendation "Best practices for single-cell analysis across modalities": https://www.nature.com/articles/s41576-023-00586-w. If you found the expert recommendation or this book helpful for your research article please cite it as:

Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet (2023). https://doi.org/10.1038/s41576-023-00586-w

Mission

We are writing a book on the current single-cell analysis best-practices with extensive tutorials and code examples.

Contributing

We would like to invite the community to further improve the tutorial and the teaching material. Please read contributing for further instructions.

In case of questions or problems, please get in touch by posting an issue in this repository.

Adapting the notebooks to other datasets

All notebooks for the various steps can be found in the jupyter book folder together with minimal Conda environments. Alternatively, the notebooks can be downloaded directly from the rendered version.

Acknowledgements

This tutorial would not be possible without the input of all Theislab members and the countless benchmarks and reviews of various single-cell tools by the community.

single-cell-best-practices's People

Contributors

Stargazers

Watchers

Forkers

mattjones315 johannesostner combine-lab ivirshup bobia9991 kur1sutaru genomicsnx dongzehe shunsunsun ramadatta dbdimitrov remy-ai paubadiam pythseq namsaraeva yiluheihei antoniojperezcastro andyjzhao sands58 rimanb maxsu mjstrumillo yunbokai yuybio jsgro schnappi-wkl bq-zhao kubotashimpei wangjien leonguos chaunceydust jefferyustc zekaihuang-pku fbnrst gerde padix-key mt1022 sunjiahe-hub leeson89 yujiezhang999 epfarias osnofianresearch alevax bio-lihe binlu1981 sophieastrof dburkhardt cornhundred ddiez ning-yan8926 zktuong stan-dale quentinblampey mdmkac1 mdbabumiamssm albert-sharma gardian7 mengqingren eddielv mmasoud1 fioooona furmanlukasz xiongsq0720 dr-smectite biofisherman soberdog bioinfres phamduchuy21 haoyun45 grst ryan2han tarikexner kribalin faker1c eroell zhaolei6116 hbusra ricardo1789 const-ae hyexployer tb1over abid-abrar amit-singh25 zzq1996 chitimbwasc hypdoctor dragonlongzhilin feigeliudan01 zwdiscover hariesramdhani zhengzha2000 bestcourses-ai hqi87 nicolas-zimmermann gerard-deuner hafakid gorjachevaalina lbgbox pakiessling heksaani

single-cell-best-practices's Issues

Donor deconvolution

Donor inference

Vireo
Souporcell
scsplit
Genetic demultiplexing of pooled single-cell RNA-sequencing samples in cancer facilitates effective experimental design + code: https://github.com/lmweber/snp-dmx-cancer
Can mention demuxlet, but it requires additional information

Dataset: Maybe the Souporcell test dataset mentioned in their README? Would restrict our notebook to Souporcell because Souporcell has a different input than e.g. Vireo.

Advanced: Interactive Visualization

Advanced: Interactive visualization

host our preprocessed example dataset on the cellxgene data portal
add chan zuckerberg lab to the acknowledgements in paper
embedd link to cellxgene data portal in the tutorial

Experimental validation

TODO

GRNs

GRNs
Explore concepts in ATAC-RNA matching (and potentially link to multi-model data). Define current best practices in gene regulation and regulatory patterns.

Potential starting points:
SCENIC
cellOracle

Datasets:
GRN specific dataset, check for dimensions

Trajectory inference

PAGA and much much more

Investigate how to generate a single PDF from the book

https://jupyterbook.org/advanced/pdf.html

Solely for fun

Preprocessing & visualization introduction

Provide an updated version of current best-practices in Quality Control, data normalization, …

Technical Jupyter book requirements

(Technical) requirements engineering:

Fate Mapping

cellrank
Palantir
trajectoryNet
Schiebinger et al optimal transport

Perturbation Analysis

Perturbations

Tools: https://www.scrna-tools.org/tools?sort=name&cats=Perturbations
Quantifying the effect of experimental perturbations at single-cell resolution
scGen
scMAGeCK links genotypes with multiple phenotypes in single-cell CRISPR screens <- really unsure about this one
Conditional out-of-distribution generation for unpaired data using transfer VAE
CPA
Comprehensive benchmarking of single cell RNA sequencing technologies for characterizing cellular perturbation <- this is from an experimental point of view!
Machine learning for perturbational single-cell omics

Discern many perturbations (>=5 maybe)
Differentiate between biological and confounded effects
Detect perturbed and non-perturbed cells -> how to remove cells that escaped the perturbations
Visualizing similarities and differences across different perturbations
Applications of tools like ScGen to predict the effect of unseen perturbations

Feature Selection

Feature selection for marker genes
Provide updated best practices for feature selection for marker genes and introduce feature selection methods for multi-sample scRNA-seq data.

Newer methods for DE gene detection combat p-value inflation by learning latent space, clustering, and learning marker genes in one go.

Potential starting points:
(maybe) Jan Hasenauer wilcoxon test method (no separate publication for this method, check/discuss if worth adding)
Plus: marker gene methods which aren’t based on differential expression (check for benchmarks)

https://www.biorxiv.org/content/10.1101/2022.05.09.490241v1

New tutorial for transcription factor and pathway analysis

A few recent benchmarks on transcription factor and pathway analysis came out. It might be time to update what we mentioned in v1.

Potential starting points:

Preamble chapter

We should write a preamble chapter.

Possible content:

What this book is.
Who the book is aimed at and who is not the target audience
People and experts who contributed to this book
Links to both papers
How to use this book -> the "traffic light" system and more

We need to ensure that it does not overlap with the introduction chapter.

Prior art

References to other existing single-cell books and tutorials
References to best-practice papers and books (aka our earlier work)
[...]

Examine quantecon caching solution

https://github.com/QuantEcon

Pointed out by Isaac. May help to make this efficient.

Differential gene expression

Differential Gene Expression

Potential topics for updated best practices are DE test on conditions and mixture/covariate models

Potential starting points:

Muscat
MAST
limma
Comparison of methods to detect differentially expressed genes between single-cell populations
Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data
aggregateBioVar
Confronting false discoveries in single-cell differential expression
A practical solution to pseudoreplication bias in single-cell studies
Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing
A Comprehensive Survey of Statistical Approaches for Differential Expression Analysis in Single-Cell RNA Sequencing Studies
The paper which contests many of the results: https://www.biorxiv.org/content/10.1101/2022.02.16.480662v1
A rant about why pseudobulk methods are good and Zimmermann et al. is wrong: https://www.biorxiv.org/content/10.1101/2022.02.16.480517v1

Experts

Mark Robinson
John Marioni

Enable Binder

Host own Binder service https://binderhub.readthedocs.io/en/latest/ and then point Jupyter Book to that BinderHub's URL

Pathway analysis

TODO

Experts

Julio Saez-Rodriguez

Dealing with conditions Introduction

Dealing with conditions

Dataset

https://singlecell.broadinstitute.org/single_cell/study/SCP548/an-immune-cell-signature-of-bacterial-sepsis-patient-pbmcs?cluster=All%20Cells&spatialGroups=--&annotation=Cohort--group--study&subsample=100000#study-visualize

https://www.nature.com/articles/s41591-020-0752-4

UTI = urinary-tract infection
Leuk-UTI = UTI with leukocytes, but no organ disfunction
Int-URO = UTI with mild or transient organ disfunction
URO = UTI with clear or persistent organ disfunction
Bac-SEP = bacteremic individuals with sepsis in hospitals

Figure

From top to bottom with arrows
Start with dataset with several perturbations/conditions
Split into the two major schools of thought -> DE (volcano plot) and perturbation analysis (something like this?

)
Split the two major schools of thought again up into

DE: pseudobulk vs cell wise
Perturbation analysis: ScGen aka DNN(VAE) approaches vs something else (Mixscape style)?

Experts

Yuge

Interoperability

Discusses how transition from the R to the Python ecosystem and vice versa.
Introduces packages like anndata2ri, zellkonverter, Seurat functions, AnndataR

Outlook

Living book -> will change
Open problems will benchmark stuff and we will update
-(End-to-end pipelines)
Map onto reference which include low quality stuff as well
Filter low quality stuff out on the fly
Embed based on references
Technical method outlook
Longitudinal data modelling

Clustering

Clustering
BP1.0 included louvain clustering as best practice. New best practices suggest to use leiden as a clustering method for scRNA-seq data. We want to keep this short.

Potential starting points:
Leiden
Leiden issue on Scanpy
Benchmark (2019-12)
Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data

New tutorial for RNA velocity modelling & fate mapping

Interactive visualization with cellxgene

adding this issue as an reminder for the interactive visualization section, which requires the following steps

finalization of preprocessing section #10
host our preprocessed example dataset on the cellxgene data portal
add chan zuckerberg lab to the acknowledgements in paper
embedd link to cellxgene data portal in the tutorial

Computational validation

TODO

Introduction chapter

1.0 Prior art (bioconductor book, etc.)
1.1 Experimental data collection
1.2 Raw data processing
1.3 Data infrastructure (maybe as separate chapter)
1.4 Interoperability (anndata2ri, zellkonverter, Seurat functions, AnndataR)

Details to follow.

Environment Party

As discussed with @le-ander

Based on @le-ander existing container, but our own container for the book. Eventually copy the counter source into our repository
Python 3.8
Scanpy 1.8.2
R 4.1
Maybe also pin Pandas to 1.2.5. Not sure anymore, feel free to ignore for now.
The rest of the packages that you already have in the existing container
jupyter-book
https://jupyterlab-code-formatter.readthedocs.io/en/latest/ <- configured with black (if we can pre configure this)

@Zethson will provide a book to start from scratch with in 2 weeks.

Dimensionality reduction

Dimensionality reduction
TODO

Variant calling

Variant calling & SNP analysis

Maybe a bit tangential, but interesting nevertheless: Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action
Good eQTL intro that has no relevance here, but helps @Zethson: https://elifesciences.org/articles/52155
Optimizing expression quantitative trait locus mapping workflows for single-cell studies
Red panda: a novel method for detecting variants in single-cell RNA sequencing <- a method that focuses on VC using scRNA seq data. Compares to the "usual" WGS and WES VC

Modelling mechanisms introduction

Write introduction TODO

Write new Docker container

Ideally we could use one environment per notebook, but this may not be feasible. I shall think of approaches to do this and evaluate them briefly. Would evolve partially executing the scripts.

We should start out by developing a new R >= 4 based container (look at @le-ander work) and trying to run and integrate @LuckyMD old notebook first. Ideally I could also test the notebook to script and vice versa conversion here. While doing so I shall ensure that as many dependencies as possible are defined in the Mamba environment file.

Cell type annotation

How to best annotate cells.
How to get good markers
Tips on color scale when plotting UMAPs (red on white instead of blue on green)
Automated cell type annotation
Dendrograms to QC cell type annotations
[...]

Includes:

Feature selection and Manual annotation
Automated cell type annotation
Reference mapping

Compositional analysis

Cell type composition analysis is difficult to write best-practices for because there are no reviews or benchmarks yet for it. Therefore, we should focus on which tools are available, 1-2 comments on strengths and weaknesses and what should be kept in mind when conducting cell type composition analysis.
Potential starting points:

Experts: Maren

toc - best practices 2.0 book

Preamble
Intro
1.0 Prior art (bioconductor book, etc.) (LH)
1.1 Experimental data collection (LH)
1.2 Raw data processing (AS)
1.3 AnnData & Scanpy intro for newbies (LH)
1.4 Data infrastructure (maybe as separate chapter) (AS)
1.5 Interoperability (anndata2ri, zellkonverter, Seurat functions, AnndataR, https://github.com/cellgeni/sceasy) (AS)

X. Datasets

Preprocessing and visualization
2.1 QC
2.2 Normalization
2.3 Feature selection
2.4 Dimensionality reduction
2.5 Advanced: Ambient RNA
2.6 Advanced: single-cell vs single nuclei
2.7 Advanced: Interactive visualization
Dealing with batch effects
3.1 Batch-aware feature selection
3.2 Data integration
3.2 Evaluation
Interpreting cellular structure
4.1 Clustering
4.2 Annotation
4.2.1 Feature selection and Manual annotation
4.2.2 Automated cell type annotation
4.2.3 Reference mapping
4.3 Trajectory inference
4.4 RNA velocity
4.5 Fate mapping
Dealing with conditions
5.1 DE testing
5.2 Compositional analysis
5.3 Perturbation modeling
5.4 Pathway analysis
Modeling mechanisms
6.1 GRNs
6.2 Cell-cell communication
6.3 Variant calling (?)
Deconvolution
7.1 Bulk deconvolution
7.2 Donor deconvolution
Spatial omics
Multi-omics
Outlook

Spatial omics

Optional topic.

Refer reviews of spatial omnics methods and give an overview of analysis pipeline. Potentially, discuss benchmarking of spot deconvolution approaches in Visium data and segmentation algorithms. Depending on scope, we might want to discuss the different spatial transcriptomics technologies.

Potential starting points:
Tangram (for reference-based deconvolution)
Cellpose (segmentation)
StarDist (segmentation)

Theislab experts:
Giovanni, Hannah, Leon

bulk deconvolution

Bulk deconvolution is still popular to extract as much knowledge as possible from bulk data. Many or even too many methods exist, but there have been some benchmarks and evaluations discussing them. Let us give some pointers.
Potential starting points:

Experts: Amit & Hana

Spatial omics content

Introduction - scope 0.5-1 page
look into reviews
broad overview into the field
mention nature methods "method of the year 2020"
define axis: targeted/high-resolution vs unbiased/cell-aggregates
Technologies (depends heavily on what we show in the tutorial) - scope 0.5 page per method
ST, visium
MERFISH
IMC (Imaging Mass Cytometry)
tools (plus tutorial) - check current reviews - 1 page overall
squidpy
spectreMAP
giotto
??
spot deconvolution (show two methods and tutorials for this) - 0.5 page
RCTD
SpotLight
Stereoscope
DestVI
cell2location
Identification of Cells from mRNA spot - 0.5 page
Baysor
SSAM
Sparcle
ClusterMap
other (check where to put those, if even include)
tangram
cell type annotation
imputation
downstream tools
ncem

Raw data processing

Should explain how to get from base calling signals up to count tables.

Cell Hashing

This issue tracks all things related to cell hashing. This will be an advanced topic.

Content

Dataset

Experts

Dealing with batch effects

Content of the dealing with batch effects chapter.

Please edit this issue if you want to @LuckyMD @lazappi

Would also be good if you could add a few details to your chapter here: #9

New tutorial for data integration & automated cell type annotation

Data integration & automated cell type annotation
Data integration methods have now been tested extensively and there is (as always) no clear winner. However, some methods are better at certain tasks or problem sizes and we should emphasize that and discuss that.
Dataset: Haber et al 2018 + new mouse dataset from Malte

Cell type annotation can be conducted in an either automated or expert driven fashion. Automated cell type annotation is improving over time as the number of annotated datasets and marker genes grows. Especially, the mapping against reference atlas datasets is becoming more and more interesting.

A comparison of automatic cell identification methods for single-cell RNA sequencing data
Automated methods for cell type annotation on scRNA-seq data <- overview of available methods
Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods
Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data <- R only eww
Mapping single-cell data to reference atlases by transfer learning <- not sure about this one here; LH will revisit
Lisa's atlas paper?

Data infrastructure

Answers questions like

How to get data
Where to get data from
Sfaira
....

Quality control including Ambient RNA & doublet detection

QC
Current best-practices suggest cell QC based on three QC covariates: count depth, number of genes per barcode, and fraction of counts from mitochondrial genes per barcode. We want to provide an update if these QC measures are still state of the art or if there are QC tools that function as best practices.

Potential starting points:
goals and approaches
miQC
benchmarks QC, normalisation

Doublet detection & ambient RNA
Benchmark doublet detection

Cell cell communication

Cell-cell communication
Define current best practices in cell-cell communication in terms of between cell communication and prediction of ligand–target links between interacting cells. (Potential link to spatial omnics.)

Potential starting points:
NicheNet
Cellchat
cellphoneDB
omnipath

Normalization

Normalization
BP1.0 suggested scran for normalization of non-full-length datasets. Scaling to a zero mean and unit variance was not preferred. Normalized data should be log(x+1) transformed to obtain normally distributed data. (Side note: scVI etc. uses raw counts for integration, mention this here as well.)

Newer approaches recommend to use GLM-PCA for non-normal distributions, Pearson residuals from regularized negative binomial regression and analytic Pearson residuals. These methods claim to outperform other normalization tasks in downstream analysis tasks like identifying biologically variable genes or dimensionality reduction. Goal is to provide an overview of these methods and check for potential independent benchmarks.

Potential starting points:
GLM-PCA
scTransform
statistical error models for scRNA-seq
Pearson residuals

Design Chapter content

We will split up the current notebook into several chapters and add more chapters later. However, for this to become easily readable we should add some a few optional introductory chapters like https://www.singlecellcourse.org/.
Should also emphasize that contrary to their book our content is NOT biased by their lab, but by best-practices

All chapters should have an overview.
All subchapters should have something like:

Overview/Introduction
Content
Quiz
Solutions
Main takeaways

This is STC

Experimental data collection

Should explain the complete wetlab workflow up to raw base calling.

A couple of figures will help.

RNA velocity

Potential starting points:
General
Scvelo

Datasets:
Pancreas

Theislab experts:
Philipp, Marius, (Volker)

Interpreting cellular structure Introduction

Interpreting cellular structure

Introduction for this chapter TODO.

Multiomics

Multi omics paper by Fabi & Carlos is the paper to cite
multi omics best practices comes in phase 2 later

theislab / single-cell-best-practices Goto Github PK

single-cell-best-practices's Introduction

Single-cell best practices

Looking for help and maintainers

Accompanying expert recommendation and citation

Mission

Contributing

Adapting the notebooks to other datasets

Acknowledgements

single-cell-best-practices's People

Contributors

Stargazers

Watchers

Forkers

single-cell-best-practices's Issues

Perturbations

Experts

Pathway analysis

Experts

Dealing with conditions

Dataset

Figure

Experts

Interpreting cellular structure

Recommend Projects

Recommend Topics

Recommend Org