lmweber / bestpracticesst Goto Github PK

View Code? Open in Web Editor NEW

27.0 10.0 7.0 114.12 MB

'Best Practices for Spatial Transcriptomics Analysis with Bioconductor' online book

Home Page: https://lmweber.org/BestPracticesST/

TeX 100.00%

spatially-resolved-transcriptomics bioconductor spatial-transcriptomics

bestpracticesst's People

Contributors

Stargazers

Watchers

Forkers

mdozmorov kayla-morrell sparthib boyiguo1 aedin estellad sqsun

bestpracticesst's Issues

Logcounts

Consider changing to logcounts using standard library size normalization instead of normalization by deconvolution, for (i) simplicity and (ii) easier extension to multi-sample datasets (where normalization by deconvolution is difficult to apply in the context of SRT data, since clustering becomes sample-specific)

QC gene filtering: keep only protein-coding genes

QC chapter should include a short paragraph on filtering to retain only protein-coding genes.

This is a useful standard filtering step in most datasets, although in some contexts / depending on the biological questions it may make sense to also keep others, e.g. lncRNAs.

Depending on the reference used, it may also not be necessary, e.g. if reference only contains protein-coding genes. But standard reference genome/transcriptome includes lncRNAs, pseudogenes, etc, so in this case filtering to remove them is useful.

E.g. simply as follows:

sce <- sce[rowData(sce)$gene_type == "protein_coding", ]

Error trying to compile the book

While compiling I got the error
Error in plotQCspots(spe, discard = "discard") : "barcode_id" %in% colnames(colData(spe)) is not TRUE

which is referred to the following code (91-96) into the human_DLFC.Rmd file

```{r QC_check, fig.height=4, message=FALSE}
library(spatzli)

# check spatial pattern of combined set of discarded spots
plotQCspots(spe, discard = "discard")


I think this can be due to the previous line (44) into your code where the loaded data (`SingleCellExperiment`) haven't the `barcode_id`, but just the `barcode` instead.

I saw that you refactored your `spatzli plotQCspots` code basing it on the VisiumExperiment object, so maybe it's just the loaded data that have to be rearranged or am I missing something? :)

Dario

Workflow chapter - spatialLIBD

Add a new workflow chapter using spatialLIBD dataset/package.

This should include a description of the dataset, followed by a complete workflow, starting with loading the data as a SpatialExperiment and followed by the individual analysis steps. The structure should be adapted as necessary, depending on the specific dataset and platform. Documentation does not need to be extensive, since the analysis steps are described in more detail in previous chapters - but should be enough that readers can follow along and see how the steps connect.

Try to clarify flow of steps related to Loupe, SpaceRanger, VistoSeg

Based on Heena's @heenadivecha feedback, we think that it'll be useful to:

Add a diagram for the Loupe + SpaceRanger + VistoSeg steps
Add a table for the key files: one column could be the file name, another one the tool that creates it. I guess that a third could be links to parts of the book where we talk about them, almost like a glossary table.

CC @abspangler13 @lmweber

Set up method comparison chapters

Set up chapters in Workflows and comparisons part to compare performance of alternative methods for specific analysis steps.

In these chapters we will set up a mini-benchmark using a given dataset and evaluation metrics to compare the methods. For now, there will only be a few methods, but we will set this up so that new methods can easily be added in the future as these become available via the Bioconductor system.

Reverse dependencies table

Reverse dependencies table for packages using SpatialExperiment (in Contributors chapter) is giving an error on GitHub Actions build. Commenting out for now so we can build an updated version for course tomorrow.

The error seems to be related to changes in the latest version of the Bioconductor docker builds.

EBImage for image processing steps

EBImage is a Bioconductor package for image processing, which has come up in discussions as a possible alternative tool to use for the initial image processing steps.

We would like to try this out and see how well it works on our existing Visium datasets, especially for identifying cells and counting the number of cells per spot.

In addition, it may be possible to use EBImage to calculate additional morphological spatial features on the identified cells (either individually or summarized per spot?), which is a type of information we currently do not have.

@edward130603 @msto @raphg mentioned they have some experience with EBImage, so we would welcome ideas and suggestions. Thanks!

Multiple samples / batch effects

Integrate discussion from @boyiguo1 on batch effects in multiple-sample datasets, possibly in a new section on analyses for multiple-sample datasets

Preprocessing with alevin-fry

Possibly add a section on preprocessing and loading data into R using alevin-fry instead of Space Ranger.

See tutorial here: https://combine-lab.github.io/alevin-fry-tutorials/2021/af-spatial/

Questions: does this create all the input files we need? e.g. downsampled image files and scale factors? or only aligned sequencing reads? are there plans to add image functionality to alevin-fry in the future?

scry deviance residuals

Deviance residuals from scry package for normalization (instead of logcounts)

Part IDs

Include a paragraph or section on how to store / handle datasets with multiple parts / pieces per Visium capture area

Workflow chapter using Xenium dataset

Following on from discussion in #42 @estellad

In this issue we are adding a workflow chapter using the Xenium dataset adapted from @estellad's BC2 workflow materials. This will also be a great addition, since we currently do not have any examples with a Xenium dataset.

As discussed in #42 , we prefer to avoid using Seurat objects, so we could try using MoleculeExperiment instead to load the data. The MoleculeExperiment package includes a data loading function readXenium(). I haven't worked with MoleculeExperiment much yet myself though so please let me know how this goes. If this works, we could also add the MoleculeExperiment object to STexampleData (or a separate similar package) to streamline loading in the examples.

Also tagging @stephaniehicks @raphg

Editorial Suggestions of Chapter 9 Quality Control

Link to Chapter: https://lmweber.org/OSTA-book/quality-control.html

Suggestions:

Emphasize the column cell_count in the SpatialExperiment is a product of VistoSeg instead of spaceranger count to avoid confusion between counting the cells and UMIs. For example we can expand the phrase "(which is available for this dataset)" on the line 101
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L101

with

To recall,cell_count is created using VistoSeg (see [OSTA Chapter 5.6](https://lmweber.org/OSTA-book/image-segmentation-visium.html#identify-number-of-cells-per-spot)). spaceranger creates count information with the count pipeline. But it creates expression counts and similarly the library size of each spot (in the following figure sum), which will be used to create assays of a SpatialExperiment object.

Provide more interpretation of the provided QC scatter plots and explain the basic assumptions:

For example we can add the following paragraphs to the paragraph
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L101
after the sentence "We also plot the library sizes against the number of cells per spot (which is available for this dataset). "

The blue curve describes the non-linear relationship between the library size (sum) of each spot and the number of cells (cell_count) in the corresponding spot. Ideally, the blue curve should be a monotone increasing function such that as there are more cells in each spot, the average library size of the spot grows larger. This is based on the assumption that the more cells in the spot, there are more UMIs expressed. Practically, we expect the blue curve to plateau or slightly decrease at certain values of cell_count, i.e. the number of cells per pot. However, the decrease should not be too significant.

Moreover, we could also threshold based on the number of cells per spot. We expect that the number of cells per spot should not be too large to be biologically reasonable. In other words, when the number of cells per spot exceeds a certain threshold (particularly with a small library size), it is not biologically reasonable to believe the spot can accommodate that many cells. Hence, we believe the spot is not of good quality.

The histograms on the top and the left of the graph depict the frequency of dots in the scatter plot. For example, when we interpret the histogram on top of the figure, we can interpret there are roughly 600 spots that accommodate roughly 4 cells. Anecdotally, the blank space in the histogram atop is an artifact due to inadequate break size.

Some histograms describing the marginal distribution of cell_count have artifact blank space. Need to adjust the break size of the histogram. Specifically, the figure generated on lines 103-108
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L103
The sentence "This is to check that we are not inadvertently removing a biologically meaningful group of spots." on line 101 appears too early in the text.
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L101
IMO, this creates confusion that one can use plotQC to tell if the filter spots are within a biologically meaningful group. The sentence seems to direct to the spatial plot where one can display the filtered spots and see if the pattern matches with a certain biological structure, e.g. the laminar organization of a brain. Hence, I suggest to move this sentence to later in the text, perhaps later right before/after the spatial plot on Line 142
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L142

Workflow chapter - Cartana

Add a new workflow chapter using a publicly available Cartana dataset.

An empty RMarkdown file to use for the workflow chapter is here.

Also, we should add a short description of the Cartana platform/technology in the Spatially resolved transcriptomics introductory chapter.

Preprocessing chapters on Loupe Browser and Space Ranger

Add chapters on preprocessing steps required in Loupe Browser and Space Ranger to prepare data for loading into R.

The content and amount of detail is flexible, but should include enough detail that someone who is fairly new to this can follow along, and possibly pointing them to additional details on the 10x Genomics website. For example, we could also include some screenshots from Loupe Browser, and examples of code to run Space Ranger.

I have included two draft outline chapters in the book structure for now (Loupe Browser and Space Ranger), so you can see where it all fits in.

make SPE object

I believe it would be helpful if you include somewhere, f.i., in "load data" a code example for building an SPE object from the individual data elements. At the moment all SPE objects are ready-made and loaded from data packages.

Add workflow chapter on a cancer dataset

Add a workflow chapter on a cancer dataset, e.g. one of the publicly available human breast cancer datasets from 10x Genomics.

Cancer datasets require some adaptations to our existing analysis pipelines due to the different data characteristics. For example:

cancer datasets contain highly distinct cell types, i.e. immune, stroma, cancer
immune cells are very small (around 5 um), so we may have up to 5-10 cells per Visium spot
cell types are highly interspersed, so individual spots contain highly distinct cell types - so spot-level deconvolution will be important (either using existing spot-level deconvolution methods or more general deconvolution methods for cancer based on CNVs / SNPs etc)

Small error in Ch. 12.3: PCA

I wanted to note an error that I received when trying to run the code chunks in Chapter 12.3. The PCA commands needed to be edited to give the indices of the HVGs instead of the gene_ids. I used this code and it gave me the same results that were produced in the book:

top_hvgs_indices <- which(rowData(spe)$gene_id %in% top_hvgs)
spe <- runPCA(spe, subset_row = top_hvgs_indices)

More background / intro material on scientific context

Add an introductory chapter with some additional details on scientific background / introductory material on the scientific context of SRT analyses. This will be helpful for new users / analysts and students.

Possibly linking to some review papers, e.g. https://www.nature.com/articles/s41587-022-01448-2

Suggested by Mike Love and Shila Ghazanfar in spatial Bioc Slack channel.

Plotting with ST data (not visium)

Hi,
Recently, I have been through the book and some tools of the field.( Seurat, STutility , STexampleData, ggspavis)

I was wondering if I miss the part where you explain how to plot with ST platform (the one from Stahl 2016).
It seems SpatialExperiment accept only Visium images to then overlap expression on the slide.

Is there a way to do that ?
The easier/general method would be to plot a raster image without trying to have the slide in background , no ?

Thanks.

Ch 2: Sequencing Prep Overview Figure is Confusing

This is not really on anyone other than 10x themselves, but the image in Chapter 2 showing how UMIs and spatial barcodes are arranged on the chip confusingly says "Partial Read 1" on the opposite end of the oligo from the poly dT. It totally got past me that what they mean "partial read 1 adapter sequence" (as opposed to some component of the cDNA) til I did some digging to find additional illustrations.

Background on technological platforms

Include some info on number of cells per spot for human brain / mouse brain tissue for sequencing-based / spot-based platforms

QC metrics: mitochondrial proportion

Clarify QC on mitochondrial proportion (whether to use)

Failed to compile normalization.Rmd

A few prerequisites:

devtools::install_github("LTLA/rebook")
devtools::install_github("lmweber/STdata")
BiocManager::install("scater")
devtools::install_github("lmweber/spatzli")
BiocManager::install("scran")

Still, the compilation breaks at:

Quitting from lines 26-44 (normalization.Rmd) 
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions
Calls: local ... librarySizeFactors -> .local -> colSums -> colSums -> <Anonymous>

Execution halted
Error in Rscript_render(f, render_args, render_meta, add1, add2) : 
  Failed to compile normalization.Rmd

Thanks, @lmweber, looking forward to working through the book.

Extend analysis chapters

Updates to individual analysis chapters, in particular:

feature selection: discuss alternative methods to select SVGs and HVGs
clustering: discuss alternative methods for spatially resolved clustering, e.g. BayesSpace

Workflow chapter using Visium breast cancer dataset

Hi @estellad, thank you for getting in touch regarding adapting your workflow materials from the BC2 conference for inclusion in the book.

As we discussed by email, I'm opening some GitHub issues so we can keep track of the additions.

In this issue, we would like to add a workflow chapter using the Visium breast cancer dataset. This would be a useful addition since currently we do not have a workflow chapter using a cancer dataset.

The book is now reformatted in Quarto format, and new chapters can be added as new .qmd files.

The main constraint is that we would like to include only Bioconductor or CRAN packages in the runnable examples, since this will help with long-term maintenance, and make it possible to submit the book to Bioconductor and use the Bioconductor build systems. We would also like to exclude Seurat due to its customized object formats and dependencies.

Specifically, this means we can't use GitHub tools (e.g. RCTD) within the examples. For these sections, I would suggest discussing these tools in the text instead, and we could consider adding examples with Bioconductor tools at a later time if we find some Bioconductor tools for these steps that perform well.

For the datasets, we could save them in an ExperimentHub package so they can be loaded within the examples, similar to STexampleData. (We could also add the objects to STexampleData if you prefer.)

Also tagging @stephaniehicks @raphg

Spatial clustering example using BayesSpace

Currently in the clustering chapter, we only have a short example using non-spatial clustering, and then list some alternatives for spatial clustering algorithms. @estellad

We have had good experiences with BayesSpace in our collaborations, and it is available in Bioconductor, so we could include an example using BayesSpace in the clustering chapter.

This would be in addition to using BayesSpace in the new workflow chapters from @estellad.

Here I believe the main constraint may be runtime. For the main examples in the chapters, we want to keep runtime as fast as possible, so that the whole book builds in a reasonable amount of time. Depending on how fast the example is, we could use a sub-region of a full sample or consider subsampling spots if we need to speed up runtime.

Also tagging @stephaniehicks @raphg

lmweber / bestpracticesst Goto Github PK

bestpracticesst's People

Contributors

Stargazers

Watchers

Forkers

bestpracticesst's Issues

Recommend Projects

Recommend Topics

Recommend Org