Giter VIP home page Giter VIP logo

cellsig's People

Contributors

jianwu1 avatar kamran-khan96 avatar stemangiola avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

kamran-khan96

cellsig's Issues

First explorative tasks

We have 4 parsed Seurat datasets with the columns (if you use tidyseurat you can see this)

.cell, sample, dataset, cell_type

  • Get a table of total cell number per sample per dataset (print out a tibble)
  • Get unique cell type lists for each dataset (print out a tibble, seurat_obj |> distinct(dataset, sample, cell_type))
  • Create pseudobulk for each sample and visualise all samples in a principal component plot (what's the variability within dataset compared to across dataset)
  • Create pseudobulk for each sample/cell_type and visualise all samples in a principal component plot (what's the variability within dataset compared to across dataset and cell types)
  • For each dataset separately, integrate the samples (if needed) and draw UMAP plots (dataset heterogeneity at the single-cell level)

Goals for final figures

Hi @stemangiola,

These are goals we currently have-

Ways to reach the goals for the final figure:

  1. Heatmap with the markers to see if cellular hierarchy has been addressed ( will try to do by 05-08-22)
  • Generate the signature with hierarchy
  • construct the signature matrix
  • create heatmap with the hierarchy-based markers, and compare with non-hierarchy markers
  1. Benchmark with signatures of different sizes (will try to do by 12-08-22)
  • create the signatures with different marker sizes (like the approach previously done for non-hierarchy)
  • benchmark by deconvolution and create boxplots of error values
  1. Benchmark with gene perturbation routine (will try to do by 19-08-22)
  • create signatures after randomly perturbing a number of markers
  • do the benchmark and create boxplot of error values

Measure the difference between variability of Bayes and naive methods.

Hello @XpelC ,

the other initial task (beside harmonizing single-cell data) is to replicate Jian plots. The same analyses will be done on the single-cell-derived data.

The hypothesis is that the Bayes method we developed better models the noise derived by multi-dataset-derived signature. Difference in variability estimation can be caused by underestimation from excessive imputation (of NA non 0s). Other bises can be caused by unbalance datasets size (if a dataset has many similar samples, compared to a dataset with 2 samples, the estimation will be biased toward the first dataset eventhough does not express enough variability).

The goal

  1. We have 10% 90% quantiles for each gene and cell type , from the Bayes dataset, and the for the naive dataset we can calculate 10% 90% quantile interval and compare them with a scatter plot, colouring by whether a gene is imputed (I will explain why the next time we meet)

  2. Produce the same plot for comparing the (log) means.

  3. From (1) and (2) plot boxplot of most distinct genes

Jian Wu already did a lot of work on this, please read the issue linked below and see you understand the goal

#36

Here the two datasets to compare

Bayes, slightly imputed as works with missing information

https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/EYGdKSDTAWxKvYKScXwE2KIBkC3oHQvMLR-jWk94y3uoWA?e=bNlC81

Naive, highly imputed

https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/EXtWtb3KlmdCoG6qN0cFXK4B1iebhq5CiLHyMi3yUfUFCw?e=94dXg1

Defining the clusters

Clean the dataset

  • Eliminate the doublet.

cell type name formatting

  • Two columns: one column is original cell type names, the other column is the formatted names.

  • These name should be lower case, with no space, singular (eliminate inconsistency)

  • Left join the table with the data.

Check the dataset:

  • To see which sample is from cancer patients, which is not (print a table).
  • Check the scRNA sequencing method (10X, SMARTseq2)
  • Check if the count from all dataset are raw count (plot the log + 1 counts densitied ghroup by sample, color by dataset (ggplot))
  • Check which counts are "raw" (integers) and which have decimals, we should avoid to transform the data twice.
  • Color the umap by cancer and normal sample.
  • FeaturePlot: Use the original umap (split by cell marker, and sample).
  • Use features and the markers:
  • "CD14", "FCGR3A", "CD79A", "CD3G" for monocyte, monocyte (NK) , b cell, t cell
  • EPCAM for epithelial
  • VIM for fibroblasts
  • CD31 for endothelial
  • CD68 for monocyte

Cluster the cells separately [NOT NEEDED ANYMORE]

  • Divide the dataset by epithelial cells and non-epithelial cells. Cluster the two groups separately and try to define the cell type of each cluster.
  • Filter epithelial out from the immune-cell dataset; filter immune cells out from epithelial dataset.

cell type decision

  • decide the cell type for each cluster
  • Another stream: Use DCA to see if the cells can be separate better.
  • Check the cell marker of clusters with the following command
    pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC) -> top10
    DoHeatmap(pbmc, features = top10$gene) + NoLegend()

https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

Sanity check

  • color UMAP by cell cycle
  • color UMAP by cell mitochondrial
  • by sample
  • 10x vs smart-seq
  • total RNA

double check:

  • Run another integration (let seurat automatically choose reference)

For your user-oriented pipeline, produce an evaluation function that, given

  1. the original user-provided data set and
  2. your signature output data set

provides a faceted boxplot, of the transcript abundance, for each gene and cell_type.

For example, if you have 4 genes and 4 cell types the plot should include 16 facets and four-gene elements on the x axis for each facet. You can fill by the cell type each marker is for, for 16 facets should be filled with four-colour, and each facet should only have one colour.

Additional feature:
Tell the user how confounded are cell-type pairs for deconvolution. For example, can you distinguish really between CD8 effector memory and central memory?
The method would be to build a mixture of one cells type (0-100%), deconvolute you measure the relative from pure 100%.

If you cannot distinguish between two cells types you would get 50%-50%.

After the signature generation, even after using non-hierarchical approach, the generated signatures show a hierarchical arrangement of the cells.

Maybe the use of a hierarchically arranged tree is causing the trouble? (causing imputation based on the hierarchical arrangement of the cell types because of the tree organization?)

Maybe there are worst cell types which might be resulting in some extreme marker selection? - (probable solution would be to add some new code in the rank bayes module - where instead of taking the mean upper quantile, we would take the worst upper quantile?)

*Maybe using .contrast_method = pairwise_contrast ((rather than mean contrast, with worse-quantile-interval-difference ranking policy) is contributing for the apparent hierarchical marker selection.

image

Try to cluster the datasets with 10 different resolution

Find cluster numbers

  • Run the clustering with 10 different thresholds, then plot UMAP coloured by cluster producing 10 total UMAP, then attach these multipanelled figure into the Github issue project. Use patrchwork to build a faceted plot, with 10 UMAPS.

Trouble shooting

  • Try to set the number of cores to 1
  • Replace mcapply with lapply
  • Select 1 sample and run it with SCT
  • If the three lines above not work, contact the authors in github.

Find alternative algorithm

  • Look the names of some other algorithms, and just record them currently.

Vignette for Bayes

Hello @Kamran-Khan96,

some feedback on the signature selection for Bayes. I think that since Bayes input is quite different from the raw counts input, the main function (or its subfunctions) should be called in the backend, and we should have another cleaner dedicate interface (another function, see below)

In summary, rather than this code in the vignette

library(here)

# Load the functions
source(here("dev/jian_R_files/cellsig_for_kamran.R"), local = knitr::knit_global())

# Load the count data and tree file
count = readRDS(here("dev/jian_R_files/test_count_data.rds"))

# Load tree
tree = yaml.load_file(here("dev/jian_R_files/test_tree.yaml")) %>% 
  as.Node

# Load the bayes count file
bayes = readRDS(here("dev/jian_R_files/test_count_bayes.rds"))  %>%

  # add sample column to bayes file
  inner_join(readRDS(here("dev/jian_R_files/bayes_add_sample.rds")))

signature_bayes = count %>%
  main(.sample = sample, .symbol = symbol, .count = count, .cell_type = cell_type,
       .is_hierarchy=FALSE,
       .tree = tree,
       .contrast_method = pairwise_contrast, .ranking_method = rank_bayes, .bayes= bayes,
       .selection_method = "silhouette", .reduction_method = "PCA", .dims=2, .discard_number = 10,
       .optimisation_method = "penalty",
       .is_complete = TRUE)

We should have

marker_genes = 
  bayes_quantiles %>%
  get_markers_from_transcription_abundance_quantiles(
       .cell_type = cell_type,
       .symbol = symbol, 
       .lower_quantile = `10%`,
       .upper_quantile = `90%`,

      # Secondary arguments
       tree = NULL, # If tree is NULL we know that hierarchy is FALSE, so we simplify the interface
       contrast_method = pairwise_contrast, 
       selection_method = "silhouette", 
       reduction_method = "PCA", 
       dims=2, 
       discard_number = 10,
       optimisation_method = "penalty",
       is_complete = TRUE
  )

Let me know what you think and if you can implement it.

Implement shiny app in the WEHI servers

  • Test shiny app locally with your data
  • test ap locally and improve speed and stability
  • organise the transfer of the app into WEHI server (directory in WEHI machine)
  • test app remotely, and improve performance if needed

Harmonise the datasets

we need to

  1. integrate the raw counts to
  2. cluster the data and
  3. use the existing annotation to label each cluster.
  • I will provide you with this pipeline so you can try to execute it on the WEHI servers. This pipeline takes care of dead cells and doublets and does SCT normalisation with no factors
  • Take each Seurat file, run clustering and plot ribosomal score and mitochondrial score (they will be already ion the datasets), and see if you get cluster based on those scores, then you re normalise based on those factors
  • you will use the output being SCT normalised dataset to integrate at the sample level

Cell annotation

  • Re-annotate the Seurat cluster (resolution 0.6).

  • Look at each single cluster and look at cell types. If the cell types in one cluster do not agree with each other; increase the resolution (discuss with Stefano).

scSHC

  • Pick max 500 cells from each sample, run findCluster with the integrated assay.

Define the batch

  1. Create a new column of the data (csv) with the batch correspondance to each sample. ex: If one dataset does not need internal integration, all samples within that dataset remaining in the same batch. Otherwise, if samples from dataset need integration, they will have their own batch.
  2. Divide the datasets by batch and run the pipeline (save the results in another directory).

1- Pick manually signature size for every node (be your own optimisation method), and test CD4 PCA, does it look good.

Pick manually signature size for every node (be your own optimisation method), and test CD4 PCA, does it look good.

If this is true, we try to build a programmatic method that achieves the same selection.

First, check if increasing penalty achieves the knee identification for optimisation

IF NOT

Change optimisation method

  • kernel smooth (choose the right kernel)
  • and identify 2nd derivatives

what does justify dataset difference

If we do pseudobulk analyses of the samples and cell types we see some clustering of datasets but also some splitting within dataset, can we understand what is the biological category that most causes these differences?

Use/compare new method for signature selection

Hello @Kamran-Khan96 , I completed the Bayes dataset, which should be better generalizable to unknown new query datasets.

https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/EYGdKSDTAWxKvYKScXwE2KIBkC3oHQvMLR-jWk94y3uoWA?e=bNlC81

Jian wrote a testing/feature selection module that instead of using test_differential_abundance simply compares the credible intervals of genes between cell types, and ranks the gene markers based on the greatest distance between credible intervals. I am pretty sure you should be able to just replace the input dataset with the above one, and get an alternative feature list.

Please let me know if anything is not clear.

Figure generation

This one should indicate that the cellsig generated signatures are on par or maybe even better compared to the cibersortx

two panels:

  1. The benchmark error plots

1.1. Do the benchmark plot with some randomly selected markers from the cibersortx signature to balance the size between this and our signature gene list

  1. For each cell type specific genes, for each gene values will be centered and scaled, overlap rest of the genes.

x-axis should have method, y-axis have centered and scaled log(transcript_abundance),

-can be some Density plot/histogram

image

In the plot, centered scaled counts of each marker-cell types should be in y-axis, and the method should be in x-axis > just geom_beeswarm

Then figure out if cellsig signature genes have more conservative genes compared to the cibersortx genes.

Figure 2

We show that, if we generate the cellsig signatures based on a hierarchical input, we can fix the "no-cell hierarchical/lineage" related signatures of cibersortx method

Cross-validation_without_tree

After generating the markers with non-hierarchical robust approach of "Cellsig", we selected 7 particular cell-types which shows the highest deviation between the samples for the selected markers of that cell-types. So, in order to, see whether the Bayesian inferred markers can reduce the sample-to-sample variations for the markers, we opted for the cross-validation approach with these 7 cell-types data.

In that process, we splited the data into reference and query (3 different sets), where the reference data was used for the bayesian quantiles and the query was used for generating the mixtures.

set_1

image

#set_2
image

#set_3
image

Modularisation

Gene marker identification
modularise the functions into pipelines. Modules should take same type of input and produce same type of output.
Shiny app

user interface

Hello @Kamran-Khan96,

I have a Bayes quantiles as such

# A tibble: 1,449,323 × 10
   symbol      cell_type .feature_idx variable  lower_quantile  `50%` upper_quantile log_mean log_sd file                                            
   <fct>       <fct>            <int> <chr>              <dbl>  <dbl>          <dbl>    <dbl>  <dbl> <chr>                                           
 1 ABCC1       b_memory             1 Y_gen[1]          1038.   2511          5677.    7.81    0.704 dev/modeling_results//level_1_cell_type_b_memor2 ABCD1P5     b_memory             2 Y_gen[2]             0       0             0     0.0462  0.293 dev/modeling_results//level_1_cell_type_b_memor3 ABI1        b_memory             3 Y_gen[3]          5614.  10174.        17182.    9.22    0.561 dev/modeling_results//level_1_cell_type_b_memor4 ABLIM3      b_memory             4 Y_gen[4]             0       2            28     1.27    1.30  dev/modeling_results//level_1_cell_type_b_memor5 AC004866.3  b_memory             5 Y_gen[5]             0       0             0     0.0237  0.195 dev/modeling_results//level_1_cell_type_b_memor6 AC005537.2  b_memory             6 Y_gen[6]            26.9   134           527.    4.84    1.54  dev/modeling_results//level_1_cell_type_b_memor7 AC005682.5  b_memory             7 Y_gen[7]            71.9   317          1645.    5.76    1.34  dev/modeling_results//level_1_cell_type_b_memor8 AC005703.2  b_memory             8 Y_gen[8]             0       0             0     0.0379  0.194 dev/modeling_results//level_1_cell_type_b_memor9 AC005789.9  b_memory             9 Y_gen[9]             0       0             1     0.216   0.603 dev/modeling_results//level_1_cell_type_b_memor10 AC006335.11 b_memory            10 Y_gen[10]            0       4            39.1   1.75    1.48  dev/modeling_results//level_1_cell_type_b_memor# … with 1,449,313 more rows

But I get this error

sample not found in .data. quantiles should not be forced to have sample column, could you please solve this in the backend.

This is the code that generated the error

 .level <- "root"
    
    tibble(level = .level) %>%
      
      mutate(tt = map(level, ~ .imputed_counts %>%
                        
                        # non-hierarchical methods should only compare leaf cell types
                        filter(!!.cell_type %in% as.phylo(.tree)$tip.label) %>% 
                        
                        # create a root column for pre(.level)
                        mutate(!!as.symbol(.level) := !!.cell_type) %>%
                        mutate(!!as.symbol(pre(.level)) := .level) %>%
                        
                        create_hierarchy_and_calculate_imputation_ratio(.level=.x, .sample=!!.sample, .symbol=!!.symbol))) 

I would like this to work

marker_genes = 
  
  count %>%
  
  markers_from_transcription_abundance_quantiles(
    
    
    .sample = sample, 
    .symbol = symbol, 
    .count= count,
    .cell_type = cell_type,
    
    
    quantile_dataset = bayes_quantiles,
    lower_quantile = "10%",
    upper_quantile = "90%",
    
    
    contrast_method = pairwise_contrast,
    selection_method = "silhouette", 
    reduction_method = "PCA", 
    dims=2, 
    discard_number = 10,
    optimisation_method = "penalty",
    is_complete = TRUE
  )

Compare prostate cancer features with unspecific B1000 data

  1. The goal: test the hypothesis whether it is worth to detect signature from disease specific data rather than large non-specific data. To test is a prothesis you can compare the features select it as markers as well as their transcriptional abundance.
  2. cellsig + cibersortX
    for B1000 and for prosate cancer pseudobulk
    and test 1) feature overlap, 2) reature transcriptional abundance, 3) deconvolution consistency

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.