Giter VIP home page Giter VIP logo

splicingfactory's Introduction

SplicingFactory R package

The SplicingFactory R package uses transcript-level expression values to analyze splicing diversity based on various statistical measures, like Shannon entropy or the Gini index. These measures can quantify transcript isoform diversity within samples or between conditions. Additionally, the package analyzes the isoform diversity data, looking for significant changes between conditions.

Installation

You can install SplicingFactory using Bioconductor with:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SplicingFactory")

Alternatively, you can install the latest development version from github with:

install.packages("devtools")
devtools::install_github("esebesty/SplicingFactory")

splicingfactory's People

Contributors

esebesty avatar hpages avatar nturaga avatar peterszikora avatar portomi avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

splicingfactory's Issues

Negative Gini index values

If there are only two transcripts from a gene with the same low expression values (0-1) then we get negative Gini index values: -4.44089209850063e-16. This value seems to be .Machine$double.eps * -2.

User adjustable pseudocount for Laplace (Dirichlet) entropy

Laplace adds a pseudocount of 1 to all categories. If we allow the user to set the pseudocount, we change the function into a more general Dirichlet entropy calculation, where pseudocount = 1 means Laplace, pseudocount = 1/2 means Jeffreys, etc, etc.

Calculate difference - label shuffling error

I encountered the following error when running calculate_difference() with label shuffling:
Error in ecdf(shuffled[i, ]) : 'x' must have 1 or more non-missing values

It turned out that some rows (genes) with only 0 values caused the issue.
Would be nice to find a solution/recommendation on how to handle this problem (e.g. by simply dropping genes with only 0s before difference calculation or something else).

Another issue related to this problem: after calculating diversity, there are some rows with a few really low values (values < .Machine$double.eps), that are handled as 0s so these rows also cause errors:

Error in if (ecdf(shuffled[i, ])(log2_fc[i]) >= 0.5) { : 
  missing value where TRUE/FALSE needed
Calls: calculate_difference -> label_shuffling
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I avoided this issue with some pre-filtering:

# Convert really small values to 0s:
diversity_data[.Machine$double.eps > diversity_data] <- 0

# Filter out samples with only zeros:
diversity_data_filtered <- diversity_data %>% 
  mutate(rowsum=rowSums(select(., starts_with("dataset")))) %>% 
  filter(rowsum != 0) %>% 
  dplyr::select(-rowsum)

Label shuffling slow

Performance of label shuffling for significance calculation is not ideal. Need to improve speed.

Update example dataset in documentation

Update example dataset to use a more recent dataset, with a larger number of genes, where the pre-selected genes showing differential diversity are selected based on mean difference and we use TPM.

Use more recent example dataset

Use data from curatedTCGAData or GenomicDataCommons for the vignette and example dataset, instead of the legacy TCGA data.

Additional significance calculations

Additional ways to calculate significance later:

  • Bootstrap resampling, if we have enough samples. Do we? We can calculate the bootstrapped confidence interval for the log2 fold change of the category means or medians.
  • Jackknife for the log2 fold changes.
  • Bootstrap using kallisto/salmon/sailfish bootstraps, but here we also need to aggregate the bootstrap values across samples. In the previous method, the bootstrap refers to drawing random samples, while here we draw random sets of reads (kind of) for each sample. Use bootstrap values to calculate a 95% CI for the differential diversity results.
  • Beta regression with a likelihood ratio or Wald-test using the normalized entropy values. Stg like entropy ~ condition + tpm | tpm vs entropy ~ tpm | tpm if we want to take into account the effect of gene expression on entropy.

SummarizedExperiment for user-facing functions

Ideally, calculate_diversity (and calculate_method) should return a SummarizedExperiment, that is the input for calculate_difference. This way, we can store gene/transcript annotation in a DataFrame accessible using the function rowData(), instead of additional columns in the data.frame that contains the expression/diversity values.

Besides SummarizedExperiment, should calculate_difference also accept a matrix and data.frame as input or do we ask users to create a SummarizedExperiment object?

calculate_difference should return a data.frame, not a SummarizedExperiment object.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.