lmweber / bestpracticesst Goto Github PK
View Code? Open in Web Editor NEW'Best Practices for Spatial Transcriptomics Analysis with Bioconductor' online book
Home Page: https://lmweber.org/BestPracticesST/
'Best Practices for Spatial Transcriptomics Analysis with Bioconductor' online book
Home Page: https://lmweber.org/BestPracticesST/
Consider changing to logcounts using standard library size normalization instead of normalization by deconvolution, for (i) simplicity and (ii) easier extension to multi-sample datasets (where normalization by deconvolution is difficult to apply in the context of SRT data, since clustering becomes sample-specific)
QC chapter should include a short paragraph on filtering to retain only protein-coding genes.
This is a useful standard filtering step in most datasets, although in some contexts / depending on the biological questions it may make sense to also keep others, e.g. lncRNAs.
Depending on the reference used, it may also not be necessary, e.g. if reference only contains protein-coding genes. But standard reference genome/transcriptome includes lncRNAs, pseudogenes, etc, so in this case filtering to remove them is useful.
E.g. simply as follows:
sce <- sce[rowData(sce)$gene_type == "protein_coding", ]
While compiling I got the error
Error in plotQCspots(spe, discard = "discard") : "barcode_id" %in% colnames(colData(spe)) is not TRUE
which is referred to the following code (91-96) into the human_DLFC.Rmd
file
```{r QC_check, fig.height=4, message=FALSE}
library(spatzli)
# check spatial pattern of combined set of discarded spots
plotQCspots(spe, discard = "discard")
I think this can be due to the previous line (44) into your code where the loaded data (`SingleCellExperiment`) haven't the `barcode_id`, but just the `barcode` instead.
I saw that you refactored your `spatzli plotQCspots` code basing it on the VisiumExperiment object, so maybe it's just the loaded data that have to be rearranged or am I missing something? :)
Dario
Add a new workflow chapter using spatialLIBD
dataset/package.
This should include a description of the dataset, followed by a complete workflow, starting with loading the data as a SpatialExperiment
and followed by the individual analysis steps. The structure should be adapted as necessary, depending on the specific dataset and platform. Documentation does not need to be extensive, since the analysis steps are described in more detail in previous chapters - but should be enough that readers can follow along and see how the steps connect.
Based on Heena's @heenadivecha feedback, we think that it'll be useful to:
Set up chapters in Workflows and comparisons part to compare performance of alternative methods for specific analysis steps.
In these chapters we will set up a mini-benchmark using a given dataset and evaluation metrics to compare the methods. For now, there will only be a few methods, but we will set this up so that new methods can easily be added in the future as these become available via the Bioconductor system.
Reverse dependencies table for packages using SpatialExperiment
(in Contributors chapter) is giving an error on GitHub Actions build. Commenting out for now so we can build an updated version for course tomorrow.
The error seems to be related to changes in the latest version of the Bioconductor docker builds.
EBImage is a Bioconductor package for image processing, which has come up in discussions as a possible alternative tool to use for the initial image processing steps.
We would like to try this out and see how well it works on our existing Visium datasets, especially for identifying cells and counting the number of cells per spot.
In addition, it may be possible to use EBImage
to calculate additional morphological spatial features on the identified cells (either individually or summarized per spot?), which is a type of information we currently do not have.
@edward130603 @msto @raphg mentioned they have some experience with EBImage
, so we would welcome ideas and suggestions. Thanks!
Integrate discussion from @boyiguo1 on batch effects in multiple-sample datasets, possibly in a new section on analyses for multiple-sample datasets
Possibly add a section on preprocessing and loading data into R using alevin-fry
instead of Space Ranger.
See tutorial here: https://combine-lab.github.io/alevin-fry-tutorials/2021/af-spatial/
Questions: does this create all the input files we need? e.g. downsampled image files and scale factors? or only aligned sequencing reads? are there plans to add image functionality to alevin-fry
in the future?
Deviance residuals from scry
package for normalization (instead of logcounts
)
Include a paragraph or section on how to store / handle datasets with multiple parts / pieces per Visium capture area
Following on from discussion in #42 @estellad
In this issue we are adding a workflow chapter using the Xenium dataset adapted from @estellad's BC2 workflow materials. This will also be a great addition, since we currently do not have any examples with a Xenium dataset.
As discussed in #42 , we prefer to avoid using Seurat objects, so we could try using MoleculeExperiment instead to load the data. The MoleculeExperiment
package includes a data loading function readXenium()
. I haven't worked with MoleculeExperiment
much yet myself though so please let me know how this goes. If this works, we could also add the MoleculeExperiment
object to STexampleData (or a separate similar package) to streamline loading in the examples.
Also tagging @stephaniehicks @raphg
Link to Chapter: https://lmweber.org/OSTA-book/quality-control.html
Suggestions:
cell_count
in the SpatialExperiment
is a product of VistoSeg
instead of spaceranger count
to avoid confusion between counting the cells and UMIs. For example we can expand the phrase "(which is available for this dataset)" on the line 101with
To recall,
cell_count
is created usingVistoSeg
(see [OSTA Chapter 5.6](https://lmweber.org/OSTA-book/image-segmentation-visium.html#identify-number-of-cells-per-spot)).spaceranger
creates count information with thecount
pipeline. But it creates expression counts and similarly the library size of each spot (in the following figuresum
), which will be used to createassays
of aSpatialExperiment
object.
For example we can add the following paragraphs to the paragraph
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L101
after the sentence "We also plot the library sizes against the number of cells per spot (which is available for this dataset). "
The blue curve describes the non-linear relationship between the library size (
sum
) of each spot and the number of cells (cell_count
) in the corresponding spot. Ideally, the blue curve should be a monotone increasing function such that as there are more cells in each spot, the average library size of the spot grows larger. This is based on the assumption that the more cells in the spot, there are more UMIs expressed. Practically, we expect the blue curve to plateau or slightly decrease at certain values ofcell_count
, i.e. the number of cells per pot. However, the decrease should not be too significant.Moreover, we could also threshold based on the number of cells per spot. We expect that the number of cells per spot should not be too large to be biologically reasonable. In other words, when the number of cells per spot exceeds a certain threshold (particularly with a small library size), it is not biologically reasonable to believe the spot can accommodate that many cells. Hence, we believe the spot is not of good quality.
The histograms on the top and the left of the graph depict the frequency of dots in the scatter plot. For example, when we interpret the histogram on top of the figure, we can interpret there are roughly 600 spots that accommodate roughly 4 cells. Anecdotally, the blank space in the histogram atop is an artifact due to inadequate break size.
Some histograms describing the marginal distribution of cell_count
have artifact blank space. Need to adjust the break size of the histogram. Specifically, the figure generated on lines 103-108
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L103
The sentence "This is to check that we are not inadvertently removing a biologically meaningful group of spots." on line 101 appears too early in the text.
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L101
IMO, this creates confusion that one can use plotQC
to tell if the filter spots are within a biologically meaningful group. The sentence seems to direct to the spatial plot where one can display the filtered spots and see if the pattern matches with a certain biological structure, e.g. the laminar organization of a brain. Hence, I suggest to move this sentence to later in the text, perhaps later right before/after the spatial plot on Line 142
https://github.com/lmweber/OSTA-book/blob/6557b14a22043be9b35b975b3f17e37e6ad485ce/chapters/09-quality_control.Rmd#L142
Add a new workflow chapter using a publicly available Cartana dataset.
This should include a description of the dataset, followed by a complete workflow, starting with loading the data as a SpatialExperiment
and followed by the individual analysis steps. The structure should be adapted as necessary, depending on the specific dataset and platform. Documentation does not need to be extensive, since the analysis steps are described in more detail in previous chapters - but should be enough that readers can follow along and see how the steps connect.
An empty RMarkdown
file to use for the workflow chapter is here.
Also, we should add a short description of the Cartana platform/technology in the Spatially resolved transcriptomics introductory chapter.
Add chapters on preprocessing steps required in Loupe Browser and Space Ranger to prepare data for loading into R.
The content and amount of detail is flexible, but should include enough detail that someone who is fairly new to this can follow along, and possibly pointing them to additional details on the 10x Genomics website. For example, we could also include some screenshots from Loupe Browser, and examples of code to run Space Ranger.
I have included two draft outline chapters in the book structure for now (Loupe Browser and Space Ranger), so you can see where it all fits in.
I believe it would be helpful if you include somewhere, f.i., in "load data" a code example for building an SPE object from the individual data elements. At the moment all SPE objects are ready-made and loaded from data packages.
Add a workflow chapter on a cancer dataset, e.g. one of the publicly available human breast cancer datasets from 10x Genomics.
Cancer datasets require some adaptations to our existing analysis pipelines due to the different data characteristics. For example:
I wanted to note an error that I received when trying to run the code chunks in Chapter 12.3. The PCA commands needed to be edited to give the indices of the HVGs instead of the gene_ids. I used this code and it gave me the same results that were produced in the book:
top_hvgs_indices <- which(rowData(spe)$gene_id %in% top_hvgs)
spe <- runPCA(spe, subset_row = top_hvgs_indices)
Add an introductory chapter with some additional details on scientific background / introductory material on the scientific context of SRT analyses. This will be helpful for new users / analysts and students.
Possibly linking to some review papers, e.g. https://www.nature.com/articles/s41587-022-01448-2
Suggested by Mike Love and Shila Ghazanfar in spatial Bioc Slack channel.
Hi,
Recently, I have been through the book and some tools of the field.( Seurat, STutility , STexampleData, ggspavis)
I was wondering if I miss the part where you explain how to plot with ST platform (the one from Stahl 2016).
It seems SpatialExperiment accept only Visium images to then overlap expression on the slide.
Is there a way to do that ?
The easier/general method would be to plot a raster image without trying to have the slide in background , no ?
Thanks.
This is not really on anyone other than 10x themselves, but the image in Chapter 2 showing how UMIs and spatial barcodes are arranged on the chip confusingly says "Partial Read 1" on the opposite end of the oligo from the poly dT. It totally got past me that what they mean "partial read 1 adapter sequence" (as opposed to some component of the cDNA) til I did some digging to find additional illustrations.
Include some info on number of cells per spot for human brain / mouse brain tissue for sequencing-based / spot-based platforms
Clarify QC on mitochondrial proportion (whether to use)
A few prerequisites:
devtools::install_github("LTLA/rebook")
devtools::install_github("lmweber/STdata")
BiocManager::install("scater")
devtools::install_github("lmweber/spatzli")
BiocManager::install("scran")
Still, the compilation breaks at:
Quitting from lines 26-44 (normalization.Rmd)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Calls: local ... librarySizeFactors -> .local -> colSums -> colSums -> <Anonymous>
Execution halted
Error in Rscript_render(f, render_args, render_meta, add1, add2) :
Failed to compile normalization.Rmd
Thanks, @lmweber, looking forward to working through the book.
Updates to individual analysis chapters, in particular:
Hi @estellad, thank you for getting in touch regarding adapting your workflow materials from the BC2 conference for inclusion in the book.
As we discussed by email, I'm opening some GitHub issues so we can keep track of the additions.
In this issue, we would like to add a workflow chapter using the Visium breast cancer dataset. This would be a useful addition since currently we do not have a workflow chapter using a cancer dataset.
The book is now reformatted in Quarto format, and new chapters can be added as new .qmd
files.
The main constraint is that we would like to include only Bioconductor or CRAN packages in the runnable examples, since this will help with long-term maintenance, and make it possible to submit the book to Bioconductor and use the Bioconductor build systems. We would also like to exclude Seurat due to its customized object formats and dependencies.
Specifically, this means we can't use GitHub tools (e.g. RCTD) within the examples. For these sections, I would suggest discussing these tools in the text instead, and we could consider adding examples with Bioconductor tools at a later time if we find some Bioconductor tools for these steps that perform well.
For the datasets, we could save them in an ExperimentHub package so they can be loaded within the examples, similar to STexampleData. (We could also add the objects to STexampleData
if you prefer.)
Also tagging @stephaniehicks @raphg
Currently in the clustering chapter, we only have a short example using non-spatial clustering, and then list some alternatives for spatial clustering algorithms. @estellad
We have had good experiences with BayesSpace in our collaborations, and it is available in Bioconductor, so we could include an example using BayesSpace in the clustering chapter.
This would be in addition to using BayesSpace in the new workflow chapters from @estellad.
Here I believe the main constraint may be runtime. For the main examples in the chapters, we want to keep runtime as fast as possible, so that the whole book builds in a reasonable amount of time. Depending on how fast the example is, we could use a sub-region of a full sample or consider subsampling spots if we need to speed up runtime.
Also tagging @stephaniehicks @raphg
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.