esebesty / splicingfactory Goto Github PK

View Code? Open in Web Editor NEW

4.0 1.0 0.0 1.01 MB

Splicing Diversity Analysis for Transcriptome Data

Home Page: https://www.bioconductor.org/packages/release/bioc/html/SplicingFactory.html

License: GNU General Public License v3.0

R 100.00%

rna-seq transcriptomics splicing shannon-entropy gini-index simpson-index

splicingfactory's Introduction

SplicingFactory R package

The SplicingFactory R package uses transcript-level expression values to analyze splicing diversity based on various statistical measures, like Shannon entropy or the Gini index. These measures can quantify transcript isoform diversity within samples or between conditions. Additionally, the package analyzes the isoform diversity data, looking for significant changes between conditions.

Installation

You can install SplicingFactory using Bioconductor with:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SplicingFactory")

Alternatively, you can install the latest development version from github with:

install.packages("devtools")
devtools::install_github("esebesty/SplicingFactory")

splicingfactory's People

Contributors

Stargazers

Watchers

splicingfactory's Issues

Update example dataset in documentation

Update example dataset to use a more recent dataset, with a larger number of genes, where the pre-selected genes showing differential diversity are selected based on mean difference and we use TPM.

User adjustable pseudocount for Laplace (Dirichlet) entropy

Laplace adds a pseudocount of 1 to all categories. If we allow the user to set the pseudocount, we change the function into a more general Dirichlet entropy calculation, where pseudocount = 1 means Laplace, pseudocount = 1/2 means Jeffreys, etc, etc.

Package info updates

Update package URLs, emails, citation.

Build fails due to example data error

Build fails as vignette build fails here. Currently only 4 samples are present in the data instead of 40.

Use SummarizedExperiment rather than ExpressionSet

as required by Bioconductor devs.

Add runnable examples to diversity calculation functions

Based on the Bioconductor build reports.

Label shuffling slow

Performance of label shuffling for significance calculation is not ideal. Need to improve speed.

Negative Gini index values

If there are only two transcripts from a gene with the same low expression values (0-1) then we get negative Gini index values: -4.44089209850063e-16. This value seems to be .Machine$double.eps * -2.

Implement IHW or similar for controlling FDR

Implement IHW or a similar method to control FDR and use average gene expression across samples or the number of transcripts as informative priors.

Use more recent example dataset

Use data from curatedTCGAData or GenomicDataCommons for the vignette and example dataset, instead of the legacy TCGA data.

Calculate difference - label shuffling error

I encountered the following error when running calculate_difference() with label shuffling:
Error in ecdf(shuffled[i, ]) : 'x' must have 1 or more non-missing values

It turned out that some rows (genes) with only 0 values caused the issue.
Would be nice to find a solution/recommendation on how to handle this problem (e.g. by simply dropping genes with only 0s before difference calculation or something else).

Another issue related to this problem: after calculating diversity, there are some rows with a few really low values (values < .Machine$double.eps), that are handled as 0s so these rows also cause errors:

Error in if (ecdf(shuffled[i, ])(log2_fc[i]) >= 0.5) { : 
  missing value where TRUE/FALSE needed
Calls: calculate_difference -> label_shuffling
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I avoided this issue with some pre-filtering:

# Convert really small values to 0s:
diversity_data[.Machine$double.eps > diversity_data] <- 0

# Filter out samples with only zeros:
diversity_data_filtered <- diversity_data %>% 
  mutate(rowsum=rowSums(select(., starts_with("dataset")))) %>% 
  filter(rowsum != 0) %>% 
  dplyr::select(-rowsum)

SummarizedExperiment for user-facing functions

Ideally, calculate_diversity (and calculate_method) should return a SummarizedExperiment, that is the input for calculate_difference. This way, we can store gene/transcript annotation in a DataFrame accessible using the function rowData(), instead of additional columns in the data.frame that contains the expression/diversity values.

Besides SummarizedExperiment, should calculate_difference also accept a matrix and data.frame as input or do we ask users to create a SummarizedExperiment object?

calculate_difference should return a data.frame, not a SummarizedExperiment object.

Additional significance calculations

Additional ways to calculate significance later:

Bootstrap resampling, if we have enough samples. Do we? We can calculate the bootstrapped confidence interval for the log2 fold change of the category means or medians.
Jackknife for the log2 fold changes.
Bootstrap using kallisto/salmon/sailfish bootstraps, but here we also need to aggregate the bootstrap values across samples. In the previous method, the bootstrap refers to drawing random samples, while here we draw random sets of reads (kind of) for each sample. Use bootstrap values to calculate a 95% CI for the differential diversity results.
Beta regression with a likelihood ratio or Wald-test using the normalized entropy values. Stg like entropy ~ condition + tpm | tpm vs entropy ~ tpm | tpm if we want to take into account the effect of gene expression on entropy.