stemangiola / cellsig Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 377.06 MB

License: GNU General Public License v3.0

R 96.97% Stan 1.34% C++ 0.01% Shell 1.68%

cellsig's People

Contributors

Stargazers

Watchers

Forkers

kamran-khan96

cellsig's Issues

Use posterior probabilities of estimated transcript abundance to rank the genes.

rank quantiles

First explorative tasks

We have 4 parsed Seurat datasets with the columns (if you use tidyseurat you can see this)

.cell, sample, dataset, cell_type

Get a table of total cell number per sample per dataset (print out a tibble)
Get unique cell type lists for each dataset (print out a tibble, seurat_obj |> distinct(dataset, sample, cell_type))
Create pseudobulk for each sample and visualise all samples in a principal component plot (what's the variability within dataset compared to across dataset)
Create pseudobulk for each sample/cell_type and visualise all samples in a principal component plot (what's the variability within dataset compared to across dataset and cell types)
For each dataset separately, integrate the samples (if needed) and draw UMAP plots (dataset heterogeneity at the single-cell level)

Goals for final figures

Hi @stemangiola,

These are goals we currently have-

Ways to reach the goals for the final figure:

Heatmap with the markers to see if cellular hierarchy has been addressed ( will try to do by 05-08-22)

Generate the signature with hierarchy
construct the signature matrix
create heatmap with the hierarchy-based markers, and compare with non-hierarchy markers

Benchmark with signatures of different sizes (will try to do by 12-08-22)

create the signatures with different marker sizes (like the approach previously done for non-hierarchy)
benchmark by deconvolution and create boxplots of error values

Benchmark with gene perturbation routine (will try to do by 19-08-22)

create signatures after randomly perturbing a number of markers
do the benchmark and create boxplot of error values

Measure the difference between variability of Bayes and naive methods.

Hello @XpelC ,

the other initial task (beside harmonizing single-cell data) is to replicate Jian plots. The same analyses will be done on the single-cell-derived data.

The hypothesis is that the Bayes method we developed better models the noise derived by multi-dataset-derived signature. Difference in variability estimation can be caused by underestimation from excessive imputation (of NA non 0s). Other bises can be caused by unbalance datasets size (if a dataset has many similar samples, compared to a dataset with 2 samples, the estimation will be biased toward the first dataset eventhough does not express enough variability).

The goal

We have 10% 90% quantiles for each gene and cell type , from the Bayes dataset, and the for the naive dataset we can calculate 10% 90% quantile interval and compare them with a scatter plot, colouring by whether a gene is imputed (I will explain why the next time we meet)
Produce the same plot for comparing the (log) means.
From (1) and (2) plot boxplot of most distinct genes

Jian Wu already did a lot of work on this, please read the issue linked below and see you understand the goal

#36

Here the two datasets to compare

Bayes, slightly imputed as works with missing information

https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/EYGdKSDTAWxKvYKScXwE2KIBkC3oHQvMLR-jWk94y3uoWA?e=bNlC81

Naive, highly imputed

https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/EXtWtb3KlmdCoG6qN0cFXK4B1iebhq5CiLHyMi3yUfUFCw?e=94dXg1

For each dataset, we want to test whether integration is needed across samples, or it is NOT needed.

We get SCT from each dataset, and plot UMAP of all samples in a dataset. If colouring by sample we see that all samples overlap, this means that we don't need integration across those samples.

The result is:

for which of those 5 datasets samples do not need to be integrated.

Paste below 5 UMAPs coloured by sample

Develop cell type-specific signatures. For example, I want to deconvolute NK cells, so I want sufficient markers (and not more) to achieve separation from NK cells to the others.

I want to select
epithelial, immune cells
t cell bcell nkcell
activated nk resting nk

Defining the clusters

Clean the dataset

Eliminate the doublet.

cell type name formatting

Two columns: one column is original cell type names, the other column is the formatted names.
These name should be lower case, with no space, singular (eliminate inconsistency)
Left join the table with the data.

Check the dataset:

To see which sample is from cancer patients, which is not (print a table).
Check the scRNA sequencing method (10X, SMARTseq2)
Check if the count from all dataset are raw count (plot the log + 1 counts densitied ghroup by sample, color by dataset (ggplot))
Check which counts are "raw" (integers) and which have decimals, we should avoid to transform the data twice.
Color the umap by cancer and normal sample.
FeaturePlot: Use the original umap (split by cell marker, and sample).
Use features and the markers:
"CD14", "FCGR3A", "CD79A", "CD3G" for monocyte, monocyte (NK) , b cell, t cell
EPCAM for epithelial
VIM for fibroblasts
CD31 for endothelial
CD68 for monocyte

Cluster the cells separately [NOT NEEDED ANYMORE]

Divide the dataset by epithelial cells and non-epithelial cells. Cluster the two groups separately and try to define the cell type of each cluster.
Filter epithelial out from the immune-cell dataset; filter immune cells out from epithelial dataset.

cell type decision

decide the cell type for each cluster
Another stream: Use DCA to see if the cells can be separate better.
Check the cell marker of clusters with the following command
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

Sanity check

double check:

Run another integration (let seurat automatically choose reference)

For your user-oriented pipeline, produce an evaluation function that, given

the original user-provided data set and
your signature output data set

provides a faceted boxplot, of the transcript abundance, for each gene and cell_type.

For example, if you have 4 genes and 4 cell types the plot should include 16 facets and four-gene elements on the x axis for each facet. You can fill by the cell type each marker is for, for 16 facets should be filled with four-colour, and each facet should only have one colour.

Additional feature:
Tell the user how confounded are cell-type pairs for deconvolution. For example, can you distinguish really between CD8 effector memory and central memory?
The method would be to build a mixture of one cells type (0-100%), deconvolute you measure the relative from pure 100%.

If you cannot distinguish between two cells types you would get 50%-50%.

After the signature generation, even after using non-hierarchical approach, the generated signatures show a hierarchical arrangement of the cells.

Maybe the use of a hierarchically arranged tree is causing the trouble? (causing imputation based on the hierarchical arrangement of the cell types because of the tree organization?)

Maybe there are worst cell types which might be resulting in some extreme marker selection? - (probable solution would be to add some new code in the rank bayes module - where instead of taking the mean upper quantile, we would take the worst upper quantile?)

*Maybe using .contrast_method = pairwise_contrast ((rather than mean contrast, with worse-quantile-interval-difference ranking policy) is contributing for the apparent hierarchical marker selection.

Build figure 1 and draft figure 3.

Try to cluster the datasets with 10 different resolution

Find cluster numbers

Run the clustering with 10 different thresholds, then plot UMAP coloured by cluster producing 10 total UMAP, then attach these multipanelled figure into the Github issue project. Use patrchwork to build a faceted plot, with 10 UMAPS.

Trouble shooting

Try to set the number of cores to 1
Replace mcapply with lapply
Select 1 sample and run it with SCT
If the three lines above not work, contact the authors in github.

Find alternative algorithm

Look the names of some other algorithms, and just record them currently.

Vignette for Bayes

Hello @Kamran-Khan96,

some feedback on the signature selection for Bayes. I think that since Bayes input is quite different from the raw counts input, the main function (or its subfunctions) should be called in the backend, and we should have another cleaner dedicate interface (another function, see below)

In summary, rather than this code in the vignette

library(here)

# Load the functions
source(here("dev/jian_R_files/cellsig_for_kamran.R"), local = knitr::knit_global())

# Load the count data and tree file
count = readRDS(here("dev/jian_R_files/test_count_data.rds"))

# Load tree
tree = yaml.load_file(here("dev/jian_R_files/test_tree.yaml")) %>% 
  as.Node

# Load the bayes count file
bayes = readRDS(here("dev/jian_R_files/test_count_bayes.rds"))  %>%

  # add sample column to bayes file
  inner_join(readRDS(here("dev/jian_R_files/bayes_add_sample.rds")))

signature_bayes = count %>%
  main(.sample = sample, .symbol = symbol, .count = count, .cell_type = cell_type,
       .is_hierarchy=FALSE,
       .tree = tree,
       .contrast_method = pairwise_contrast, .ranking_method = rank_bayes, .bayes= bayes,
       .selection_method = "silhouette", .reduction_method = "PCA", .dims=2, .discard_number = 10,
       .optimisation_method = "penalty",
       .is_complete = TRUE)

We should have

marker_genes = 
  bayes_quantiles %>%
  get_markers_from_transcription_abundance_quantiles(
       .cell_type = cell_type,
       .symbol = symbol, 
       .lower_quantile = `10%`,
       .upper_quantile = `90%`,

      # Secondary arguments
       tree = NULL, # If tree is NULL we know that hierarchy is FALSE, so we simplify the interface
       contrast_method = pairwise_contrast, 
       selection_method = "silhouette", 
       reduction_method = "PCA", 
       dims=2, 
       discard_number = 10,
       optimisation_method = "penalty",
       is_complete = TRUE
  )

Let me know what you think and if you can implement it.

modify the cell_type column in the data to adapt to the subset tree.

Generate at least 100 mixtures for each method and calculated the average MSE for all methods; compare all methods in terms of average MSE from 100 mixtures;

Choose more PC for the silhouette score, and run the deconvolution benchmark and see if it improves, out methods compared to cibersortX

Produce bar plot based on deconvolution accuracy rather than siluette score

Implement shiny app in the WEHI servers

Test shiny app locally with your data
test ap locally and improve speed and stability
organise the transfer of the app into WEHI server (directory in WEHI machine)
test app remotely, and improve performance if needed

comment on commit ID and attach the benchmark plot for the pipeline

all *_join should have the by= argument

Hello @jianwu1 ,

please when you have time all *_join should have the by= argument. I am spending many hours with @ZijieGA and this aspect is causing the code to be fragile, and not robust, causing many errors very hard to debug.

Thesis Figures

Figures and other files for the manuscript

Optimise signature extraction memory consumption.

Harmonise the datasets

we need to

integrate the raw counts to
cluster the data and
use the existing annotation to label each cluster.

I will provide you with this pipeline so you can try to execute it on the WEHI servers. This pipeline takes care of dead cells and doublets and does SCT normalisation with no factors
Take each Seurat file, run clustering and plot ribosomal score and mitochondrial score (they will be already ion the datasets), and see if you get cluster based on those scores, then you re normalise based on those factors
you will use the output being SCT normalised dataset to integrate at the sample level

Cell annotation

Re-annotate the Seurat cluster (resolution 0.6).
Look at each single cluster and look at cell types. If the cell types in one cluster do not agree with each other; increase the resolution (discuss with Stefano).

scSHC

Pick max 500 cells from each sample, run findCluster with the integrated assay.

Note: A possible alternative to ellipse plotting or ellipse-based calculation

Poligon outer tracing

I found it on a random page that does not tell how to make it

https://ggplot2.tidyverse.org/articles/extending-ggplot2.html#creating-a-new-geom-1

but I'm sure this can be found already done

calculate the ratio for each gene whose counts are imputed for a cell type and filter for genes whose imputation ratios are below 0.2.

move to-do list into gihub projects. I left instructions for you.

Hello, @jianwu1 we can try to move the to-do list into GitHub projects (top menu). I left instructions for you.

polish cellsigDB (the minimum standard)

todos:

Execute as sanity check this file https://github.com/stemangiola/cellsig/blob/Kamran/dev/process_counts.R and make sure you can reconstruct the raw caounts
If you did add more raw data, create a new R script called parse_fourth_databse.R
Add the call to parse_fourth_databse.R to process_counts.R and re execute process counts to see that you get the raw data back
polish any of the parse_*_databse.R when you find mistakes (JUST THE BIG ONES)

generalise the ranking function with edgeR to arrange of a different column

rank_function_edgeR(.., column_name){
column_name = enquo(column_name)

... %>%
arrange(!!column_name)
}

rank_function_edgeR(..., PValue)
rank_function_edgeR(..., LogFC)

reproduce the bar plots with new PCA parameters.(top=Inf, scale=FALSE).

Test tidybulk again as_SummarizedExperiment

Try sparse matrix from Matrix of delayedArray for storing expression matrix for SummarizedExpriment object

Download single-cell immune cell data

Define the batch

Create a new column of the data (csv) with the batch correspondance to each sample. ex: If one dataset does not need internal integration, all samples within that dataset remaining in the same batch. Otherwise, if samples from dataset need integration, they will have their own batch.
Divide the datasets by batch and run the pipeline (save the results in another directory).

Evaluation plots of the effect of Bayesian estimated gene expression distribution vs empirical gene expression from samples of various sizes

1- Pick manually signature size for every node (be your own optimisation method), and test CD4 PCA, does it look good.

Pick manually signature size for every node (be your own optimisation method), and test CD4 PCA, does it look good.

If this is true, we try to build a programmatic method that achieves the same selection.

First, check if increasing penalty achieves the knee identification for optimisation

IF NOT

Change optimisation method

kernel smooth (choose the right kernel)
and identify 2nd derivatives

what does justify dataset difference

If we do pseudobulk analyses of the samples and cell types we see some clustering of datasets but also some splitting within dataset, can we understand what is the biological category that most causes these differences?

Use/compare new method for signature selection

Hello @Kamran-Khan96 , I completed the Bayes dataset, which should be better generalizable to unknown new query datasets.

https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/EYGdKSDTAWxKvYKScXwE2KIBkC3oHQvMLR-jWk94y3uoWA?e=bNlC81

Jian wrote a testing/feature selection module that instead of using test_differential_abundance simply compares the credible intervals of genes between cell types, and ranks the gene markers based on the greatest distance between credible intervals. I am pretty sure you should be able to just replace the input dataset with the above one, and get an alternative feature list.

Please let me know if anything is not clear.

Gene expression visualisation in leaf cell types at each level.

Figure generation

This one should indicate that the cellsig generated signatures are on par or maybe even better compared to the cibersortx

two panels:

The benchmark error plots

1.1. Do the benchmark plot with some randomly selected markers from the cibersortx signature to balance the size between this and our signature gene list

For each cell type specific genes, for each gene values will be centered and scaled, overlap rest of the genes.

x-axis should have method, y-axis have centered and scaled log(transcript_abundance),

-can be some Density plot/histogram

In the plot, centered scaled counts of each marker-cell types should be in y-axis, and the method should be in x-axis > just geom_beeswarm

Then figure out if cellsig signature genes have more conservative genes compared to the cibersortx genes.

Figure 2

We show that, if we generate the cellsig signatures based on a hierarchical input, we can fix the "no-cell hierarchical/lineage" related signatures of cibersortx method

Cross-validation_without_tree

After generating the markers with non-hierarchical robust approach of "Cellsig", we selected 7 particular cell-types which shows the highest deviation between the samples for the selected markers of that cell-types. So, in order to, see whether the Bayesian inferred markers can reduce the sample-to-sample variations for the markers, we opted for the cross-validation approach with these 7 cell-types data.

In that process, we splited the data into reference and query (3 different sets), where the reference data was used for the bayesian quantiles and the query was used for generating the mixtures.

set_1

#set_2

#set_3

Modularisation

Gene marker identification
modularise the functions into pipelines. Modules should take same type of input and produce same type of output.
Shiny app

user interface

Hello @Kamran-Khan96,

I have a Bayes quantiles as such

# A tibble: 1,449,323 × 10
   symbol      cell_type .feature_idx variable  lower_quantile  `50%` upper_quantile log_mean log_sd file                                            
   <fct>       <fct>            <int> <chr>              <dbl>  <dbl>          <dbl>    <dbl>  <dbl> <chr>                                           
 1 ABCC1       b_memory             1 Y_gen[1]          1038.   2511          5677.    7.81    0.704 dev/modeling_results//level_1_cell_type_b_memor…
 2 ABCD1P5     b_memory             2 Y_gen[2]             0       0             0     0.0462  0.293 dev/modeling_results//level_1_cell_type_b_memor…
 3 ABI1        b_memory             3 Y_gen[3]          5614.  10174.        17182.    9.22    0.561 dev/modeling_results//level_1_cell_type_b_memor…
 4 ABLIM3      b_memory             4 Y_gen[4]             0       2            28     1.27    1.30  dev/modeling_results//level_1_cell_type_b_memor…
 5 AC004866.3  b_memory             5 Y_gen[5]             0       0             0     0.0237  0.195 dev/modeling_results//level_1_cell_type_b_memor…
 6 AC005537.2  b_memory             6 Y_gen[6]            26.9   134           527.    4.84    1.54  dev/modeling_results//level_1_cell_type_b_memor…
 7 AC005682.5  b_memory             7 Y_gen[7]            71.9   317          1645.    5.76    1.34  dev/modeling_results//level_1_cell_type_b_memor…
 8 AC005703.2  b_memory             8 Y_gen[8]             0       0             0     0.0379  0.194 dev/modeling_results//level_1_cell_type_b_memor…
 9 AC005789.9  b_memory             9 Y_gen[9]             0       0             1     0.216   0.603 dev/modeling_results//level_1_cell_type_b_memor…
10 AC006335.11 b_memory            10 Y_gen[10]            0       4            39.1   1.75    1.48  dev/modeling_results//level_1_cell_type_b_memor…
# … with 1,449,313 more rows

But I get this error

sample not found in .data. quantiles should not be forced to have sample column, could you please solve this in the backend.

This is the code that generated the error

 .level <- "root"
    
    tibble(level = .level) %>%
      
      mutate(tt = map(level, ~ .imputed_counts %>%
                        
                        # non-hierarchical methods should only compare leaf cell types
                        filter(!!.cell_type %in% as.phylo(.tree)$tip.label) %>% 
                        
                        # create a root column for pre(.level)
                        mutate(!!as.symbol(.level) := !!.cell_type) %>%
                        mutate(!!as.symbol(pre(.level)) := .level) %>%
                        
                        create_hierarchy_and_calculate_imputation_ratio(.level=.x, .sample=!!.sample, .symbol=!!.symbol)))

I would like this to work

marker_genes = 
  
  count %>%
  
  markers_from_transcription_abundance_quantiles(
    
    
    .sample = sample, 
    .symbol = symbol, 
    .count= count,
    .cell_type = cell_type,
    
    
    quantile_dataset = bayes_quantiles,
    lower_quantile = "10%",
    upper_quantile = "90%",
    
    
    contrast_method = pairwise_contrast,
    selection_method = "silhouette", 
    reduction_method = "PCA", 
    dims=2, 
    discard_number = 10,
    optimisation_method = "penalty",
    is_complete = TRUE
  )

Signature benchmark

show that ratiop_imputed_samples for the bayes and non_bayes datasets is the same for c(cell_type, symbol) pairs.

For benchmark do not filter imputation ratio
for user pipeline do filter for imputation ratio

3- After the full pipeline will be ready, please provide the signature for the attached differentiation tree and your new signature database.

Here the file

tree.zip

See what utilities you might need to use an external data.tree/dendrogram file. Most likely I have functions ready for whatever you need. So as soon as you analyse the problem and the needs (if any), let meet.

2- Design the algorithm such that it takes the input of a transcript abundance data frame(symbol, sample, count, cell_type) and a specified hierarchy. (Find the file in cellsig) output nodal signatures.

In the github README, there is an example of the code of how to do this. I have done this for you.

before that, you should

install dev branch of tidybulk from github
pull the update cellsig from github (do you commits first)
reinstall cellsig (branch data-exploration)

Compare prostate cancer features with unspecific B1000 data

The goal: test the hypothesis whether it is worth to detect signature from disease specific data rather than large non-specific data. To test is a prothesis you can compare the features select it as markers as well as their transcriptional abundance.
cellsig + cibersortX
for B1000 and for prosate cancer pseudobulk
and test 1) feature overlap, 2) reature transcriptional abundance, 3) deconvolution consistency