Giter VIP home page Giter VIP logo

berenslab / umi-normalization Goto Github PK

View Code? Open in Web Editor NEW
37.0 8.0 2.0 91.31 MB

Companion repository to Lause, Berens & Kobak (2021): "Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data", Genome Biology

Home Page: https://doi.org/10.1186/s13059-021-02451-7

License: GNU Affero General Public License v3.0

Jupyter Notebook 99.88% Python 0.12%
scrna single-cell-rna-seq single-cell-analysis umi-count umi-count-matrix normalization glm-pca negative-binomial-regression negative-binomial-model negative-binomial

umi-normalization's Introduction

Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

Jan Lause, Philipp Berens & Dmitry Kobak

How to use this repository

Version 3.0 of this repository contains the code to reproduce the analysis presented in our Genome Biology paper on UMI data normalization (Lause, Berens & Kobak, 2021) and the corresponding preprint (v3). The code used for versions v1 and v2 of the paper is available under the tags 1.0 and 2.0 in this repository.

To start, follow these steps:

  • install the required software listed below
  • clone this repository to your system
  • go to tools.py and adapt the three import paths as needed
  • follow the dataset download instructions below

Then, you can step through our full analysis by simply following the sequence of the notebooks. If you want to reproduce only parts of our analysis, there are six independent analysis pipelines that you can run individually:

  • Reproduction of the NB regression model by Hafemeister & Satija (2019) and investigation of alternative models (Notebookes 01 & 02, producing Figure 1 from our paper)
  • Estimation of technical overdispersion from negative control datasets (Notebooks 01 & 03, producing Figure S1)
  • Benchmarking normalization by Analytical Pearson residuals vs. GLM-PCA vs. standard methods:
    • on the 33k PBMC dataset (Notebooks 01, 041, 042, 05, producing Figures 2, S2, S4, S5, and additional figures)
    • on different retinal datasets (Notebooks 06, 07, 081, producing Figures 3, S3, and additional figures)
    • on the ground-truth dataset created from FACS-sorted PBMCs (Notebook 101, 102, producing Figures 5 and S7)
  • Analysis of the 2-million cell mouse organogenesis dataset (Notebook 091, producing Figures 4 and S6, and additional figures)
  • Comparison to Sanity (Notebooks 06, 07, 081 and 082 for retina datasets, 091 and 092 for the organogenesis dataset and 101 and 103 for the benchmarking FACS-sorted PBMCs, producing additional figures). These pipelines will require you to run Sanity from the command line; see notebooks 082 and 092 for instructions.

Note that 041 and 101 are R notebooks, the remaining are Python notebooks.

Each of the analyses will first preprocess and filter the datasets. Next, computationally expensive tasks are done (NB regression fits, GLM-PCA, t-SNE, simulations of negative control data, ..) and the results are saved as files. For some analyses, this is done in separate notebooks. Finally, the results files are loaded for plotting (again in separate notebooks for some analyses).

We recommend to run the code on a powerful machine with at least 250 GB RAM.

For questions or feedback, feel free to use the issue system or email us.

Pre-requisites

We used the following software environments:

Python
R

The full R environment used was

attached base packages:
parallel  stats4    stats     graphics  grDevices utils     datasets     methods   base     

other attached packages:
MASS_7.3-53.1               sctransform_0.3.2          SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1   
DelayedArray_0.12.3         BiocParallel_1.20.1        matrixStats_0.58.0          Biobase_2.46.0             
GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        IRanges_2.20.2              S4Vectors_0.24.4           
BiocGenerics_0.32.0         glmpca_0.2.0               

loaded via a namespace (and not attached):
tidyselect_1.1.0       listenv_0.8.0          purrr_0.3.4           reshape2_1.4.4         lattice_0.20-41        colorspace_2.0-0      
vctrs_0.3.7            generics_0.1.0         utf8_1.2.1            rlang_0.4.10           pillar_1.6.0           glue_1.4.2            
DBI_1.1.1              GenomeInfoDbData_1.2.2 lifecycle_1.0.0       plyr_1.8.6             stringr_1.4.0          zlibbioc_1.32.0       
munsell_0.5.0          gtable_0.3.0           future_1.21.0         codetools_0.2-18       fansi_0.4.2            Rcpp_1.0.6            
scales_1.1.1           XVector_0.26.0         parallelly_1.24.0     gridExtra_2.3          ggplot2_3.3.3          digest_0.6.27         
stringi_1.5.3          dplyr_1.0.5            grid_3.6.3            tools_3.6.3            bitops_1.0-6           magrittr_2.0.1        
RCurl_1.98-1.3         tibble_3.1.1           crayon_1.4.1          future.apply_1.7.0     pkgconfig_2.0.3        ellipsis_0.3.1        
Matrix_1.3-2           assertthat_0.2.1       R6_2.5.0              globals_0.14.0         compiler_3.6.3  

Download instructions for presented datasets

All accession numbers can also be found in Table S2 of our paper.

33k PBMC dataset
Counts & Annotations
  • visit https://support.10xgenomics.com/single-cell-gene-expression/datasets
  • look for '33k PBMC from a healty donor' under "Chromium Demonstration (v1 Chemistry)"
  • provide contact details to proceed to downloads
  • download 'Gene / cell matrix (filtered)' (79.23 MB)
  • extract files genes.tsv and matrix.mtx to umi-normalization/datasets/33k_pbmc/
  • download 'Clustering analysis' (23.81 MB) from the same website
  • extract folder analysis to umi-normalization/datasets/33k_pbmc/ as well
10x control / Svensson et al. 2017
inDrop control / Klein et al. 2015
  • visit https://www.ncbi.nlm.nih.gov/geo/
  • search for GSE65525
  • download the *.csv.bz2 file for the sample GSM1599501 (human K562 pure RNA control, 953 samples, 5.1 MB))
  • extract file GSM1599501_K562_pure_RNA.csv to umi-normalization/datasets/indrop/
MicrowellSeq control / Han et al. 2018
  • visit https://www.ncbi.nlm.nih.gov/geo/
  • search for GSE108097
  • search for sample GSM2906413 and download GSM2906413_EmbryonicStemCell_dge.txt.gz (EmbryonicStemCell.E14, 7.9 MB)
  • save to umi-normalization/datasets/microwellseq/
Retina: All cell classes/ Macosko et al. 2015
Counts
  • visit https://www.ncbi.nlm.nih.gov/geo/
  • search for GSE63472
  • download GSE63472_P14Retina_merged_digital_expression.txt.gz (50.7 MB)
  • extract to GSE63472_P14Retina_merged_digital_expression.txt
  • save to umi-normalization/datasets/retina/macosko_all
Cluster annotations
Retina: Bipolar cells / Shekhar et al. 2016
Counts
  • visit https://www.ncbi.nlm.nih.gov/geo/
  • search for GSE81904
  • download GSE81904_BipolarUMICounts_Cell2016.txt.gz (42.9 MB)
  • save to umi-normalization/datasets/retina/shekhar_bipolar/
Cluster annotations
Retina: Ganglion cells / Tran et al. 2019
Raw counts
  • visit https://www.ncbi.nlm.nih.gov/geo/
  • search for GSE133382
  • download GSE133382_AtlasRGCs_CountMatrix.csv.gz (129.3 MB)
  • extract to GSE133382_AtlasRGCs_CountMatrix.csv
  • save to umi-normalization/datasets/retina/tran_ganglion/
Annotations and original gene selection
2-million cells: Mouse Organogenesis / Cao et al. 2019
Raw counts and annotations
FACS-sorted PBMC cells / Zheng et al. (2017) and Duò et al (2018)

umi-normalization's People

Contributors

jlause avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

umi-normalization's Issues

Integration

Hello,

Thanks for your great work. I've been following the development of your method and its' implantation in Scanpy and can't wait to start using it in my future analyses.

I would like to know however, if you have any recommendation of how this method could be used in the integration step of single cell RNA seq analysis? Seurat's SCTransform offers an integration workflow that uses SCT normalized dataset.

Thanks.

Handling batch effects

Hey, how would you recommend handing multiple batches using this normalization scheme? In the past, I've used scTransform, which puts the batch variable into the model itself, so I guess it kind of regresses out the batch effect. It's worked very well in the past for me. This normalization scheme here is simpler and doesn't account for batch effects in the model, so I'm wondering how you recommend dealing with them.

In your paper, in the Cao figure, I noticed you identified batch-specific genes, and just removed them from the dataset. I'm unsure of this, isn't it possible these genes might be biologically relevant? I also noticed that in your PR to scanpy comment you mention applying this normalization to each batch separately, then just concatenating the results. Wouldn't this also be problematic? For instance, if I have two batches of different cell populations and one gene is never expressed in one batch, the residuals will always be zero, since it never deviates. In the other batch, for instance the gene is always expressed. In this case the residuals will also always be zero, since the gene is always expressed, and the model mean can fit this.

I'd love to get your feedback regarding this.

add normalization method to scanpy?

hi!
recently read the paper and found the method really convincing and effective!
would you be interested in submitting a PR to scanpy? Had a look at the code and seems pretty straightforward to re implement there. This is essentially the function needed right?

def pearson_residuals(counts, theta):

If you are interested and willing, I would suggest you to loosely follow sc.pp.normalize_total for implementation
https://github.com/theislab/scanpy/blob/5533b644e796379fd146bf8e659fd49f92f718cd/scanpy/preprocessing/_normalization.py#L28-L202
you can also have a look at docs
would be very happy to assist/help out in case you are interested!

Thank you!
Giovanni

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.