Giter VIP home page Giter VIP logo

matrixtests's Introduction

CRAN version Build Status codecov dependencies Monthly Downloads

Matrix Tests

A package dedicated to running multiple statistical hypothesis tests on rows and columns of matrices.

illustration

Goals

  1. Fast execution via vectorization.
  2. Convenient and detailed output format.
  3. Compatibility with tests implemented in base R.
  4. Careful handling of missing values and edge cases.

Examples

1. Bartlett's test on columns

Bartlett's test on every column of iris dataset using Species as groups:

col_bartlett(iris[,-5], iris$Species)
             obs.tot obs.groups var.pooled df statistic                pvalue
Sepal.Length     150          3 0.26500816  2 16.005702 0.0003345076070163084
Sepal.Width      150          3 0.11538776  2  2.091075 0.3515028004158132768
Petal.Length     150          3 0.18518776  2 55.422503 0.0000000000009229038
Petal.Width      150          3 0.04188163  2 39.213114 0.0000000030547839322

2. Welch t-test on rows

Welch t-test performed on each row of 2 large (million row) matrices:

X <- matrix(rnorm(10000000), ncol = 10)
Y <- matrix(rnorm(10000000), ncol = 10)

row_t_welch(X, Y)  # running time: 2.4 seconds

Confidence interval computations can be turned-off for further increase in speed:

row_t_welch(X, Y, conf.level = NA)  # running time: 1 second

Available Tests

Variant Name Function
Location tests (1 group) Single sample Student's t.test row_t_onesample
Single sample Wilcoxon's test row_wilcoxon_onesample
Location tests (2 groups) Equal variance Student's t.test row_t_equalvar
Welch adjusted Student's t.test row_t_welch
Two sample Wilcoxon's test row_wilcoxon_twosample
Location tests (paired) Paired Student's t.test row_t_paired
Paired Wilcoxon's test row_wilcoxon_paired
Location tests (2+ groups) Equal variance oneway anova row_oneway_equalvar
Welch's oneway anova row_oneway_welch
Kruskal-Wallis test row_kruskalwallis
van der Waerden's test row_waerden
Scale tests (2 groups) F variance test row_f_var
Scale tests (2+ groups) Bartlett's test row_bartlett
Fligner-Killeen test row_flignerkilleen
Levene's test row_levene
Brown-Forsythe test row_brownforsythe
Association tests Pearson's correlation test row_cor_pearson
Periodicity tests Cosinor row_cosinor
Distribution tests Kolmogorov-Smirnov test row_kolmogorovsmirnov_twosample
Normality tests Jarque-Bera test row_jarquebera
Anderson-Darling test row_andersondarling

Further Information

For more information please refer to the Wiki page:

  1. Installation Instructions
  2. Design Decisions
  3. Speed Benchmarks
  4. Bug Fixes and Improvements to Base R

See Also

Literature

Computing thousands of test statistics simultaneously in R, Holger Schwender, Tina Müller.
Statistical Computing & Graphics. Volume 18, No 1, June 2007.

Packages

CRAN:

  1. ttests() in the Rfast package.
  2. row.ttest.stat() in the metaMA package.
  3. MultiTtest() in the ClassComparison package.
  4. bartlettTests() in the heplots package.
  5. harmonic.regression() in the HarmonicRegression package.

BioConductor:

  1. lmFit() in the limma package.
  2. rowttests() in the genefilter package.
  3. mt.teststat() in the multtest package.
  4. row.T.test() in the HybridMTest package.
  5. rowTtest() in the viper package.
  6. lmPerGene() in the GSEAlm package.

GitHub:

  1. rowWilcoxonTests() in the sanssouci package.
  2. matrix.t.test() in the pi0 package.
  3. wilcoxauc() in the presto package.

matrixtests's People

Contributors

karoliskoncevicius avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

matrixtests's Issues

Investigate potential performance improvements by turning main functions to be column-wise

Right now all functions work on rows by default and column versions transpose the input data and call the corresponding row function.

However in R column-wise functions should be about 2x faster due to how the matrices are stored in memory. Therefore it's worth investigating if making the test functions work on columns instead of rows could add additional boost to performance.

Make the behaviour of infinite values consistent

Double-check and synchronize the way different tests handle Inf values.

Right now some of the non-parametric tests treat the Inf/-Inf values as highest/lowest ranks. However the behaviour for parametric tests are not fully defined. In particular when both Inf and -Inf values are present within the same group.

  • t_equalvar
  • t_paired
  • t_onesample
  • t_welch
  • cor_pearson
  • f_var
  • oneway_equalvar
  • oneway_welch
  • kruskalwallis
  • waerden
  • bartlett
  • jarquebera
  • flignerkilleen
  • levene
  • brownforsythe
  • wilcoxon_onesample
  • wilcoxon_paired
  • wilcoxon_twosample
  • cosinor
  • andersondarling

Make conf.level=NA suppress confidence interval computation

Consider changing the behaviour of conf.level=NA from throwing an error to suppressing the result of confidence interval. In most cases this should provide a considerable improvement in speed.

  • t_equalvar
  • t_paired
  • t_onesample
  • t_welch
  • cor_pearson
  • f_var

Reach 100% coverage

Unit test coverage now is 97%. Only column-wise function variants seem to be missing.

Add Kolmogorov-Smirnov test

Add a KS test to compare distribution of two samples (ks.test() in base R stats package).

For now focus on two sample variant and name it row_kolmogorovsmirnov_twosample().

Later, if necessary, a version that compares a single sample against a named distribution can be added.

Make row_ievora self-contained

Currently row_ievora calls row_bartlett and row_t_welch. Both these functions do some preprocessing on their own.

Rewriting row_ievora to have all the needed code to run the tests within itself would improve the speeed and handling of warnings.

Adjust wilcoxon test warning messages when all the values are constant

Currently row_wilcoxon_twosample does not produce a warning when all the values are constant and equal to the NULL:

row_wilcoxon_twosample(c(1,1,1), c(2,2,2), exact=FALSE, null=1)
# NaN p-value but no warning

In a similar situation row_wilcoxon_onesample() can end up producing three warnings for the same row:

row_wilcoxon_onesample(c(0,0,0))
# 3 separate warnings are produced

And the above issue is also present in a paired version.

Solution: add the warning for "twosample" case and adjust the warnings so that only one is produced at the time.

Make warning messages more consistent

Right now some of the warning messages are a bit different between tests. For example:

row_oneway_equalvar: 1 of the rows had essentially constant values.\nFirst occurrence at row 1
row_kruskalwallis: 1 of the rows were essentially constant.\nFirst occurrence at row 1

Integer overflow in case of big matrices

This is a bug reported by @Close-your-eyes via email.

m1 <- matrix(rnorm(100000), nrow=4)                                                                                                                                                                                                 
m2 <- matrix(rnorm(1000000), nrow=4)

row_wilcoxon_twosample(m1, m2)

obs.x  obs.y obs.tot  statistic pvalue location.null alternative exact corrected
1 25000 250000  275000 3130434539     NA             0   two.sided FALSE      TRUE
2 25000 250000  275000 3118218180     NA             0   two.sided FALSE      TRUE
3 25000 250000  275000 3141448608     NA             0   two.sided FALSE      TRUE
4 25000 250000  275000 3117295315     NA             0   two.sided FALSE      TRUE
   
Warning messages:
1: In nx * ny : NAs produced by integer overflow
2: In nx * ny : NAs produced by integer overflow

This is caused by storing matrix dimensions as integers - then wilcox test multiplies number of observations from both samples together which leads to integer overflow. Solution is to store those values as numeric.

Add Fisher's exact G test

Fisher's exact G test is used to test the null hypothesis of Gaussian white noise against the alternative of an added deterministic periodic component in a timeseries.

In R the test is implemented by fisher.g.test function in a GeneTS package. It should be useful for detecting hidden periodicities.

Add support for sparse matrices

Thanks for this amazing package! I have used it in so many projects already!

I just wanted to check if there are plans to support sparse matrices as well any time soon?

Add Cucconi test

"Cucconi test" is an interesting test for both location and scale.

Seems like NULL distribution is based on permutations but that shouldn't be an issue.

Add Van der Waerden test

A test for location difference between multiple groups. Seems to be somewhat in the middle between ANOVA and Kruskal-Wallis. Wiki.

Add F test for linear model comparison

Need to come up with a name and an interface. Current proposal:

  • Name: row_lm_f()
  • Arguments: 1) data matrix 2) alternative model matrix 3) null model matrix

Have to do a lot of testing in order to make sure it will produce the same output as lm(). Keep track of pivoting.

Allow multiple optional parameters

Right now parameters are expanded only in a few special cases. For example when x has 10 rows and y has 1 row - y will be repeated 10 times to match x.

We can add this for other parameters as well, like conf.level. This would allow running same test on the same input with different confidence levels in parallel.

Add Spearman's correlation

Hi,
First thanks a lot for developing such an awesome package! It gives me a lot of help!
I noticed that only pearson correlation is allowed now, is it possible for you to add spearman correlation in cor_test?

Thanks!

Yang

Add confidence intervals for wilcoxon tests

Currently row_wilcoxon_* tests do not return the pseudo-median estimates nor confidence intervals. The tests in base R can return confidence intervals when asked: wilcox.test(..., conf.int=TRUE).

The problem is that it's hard to speed up confidence interval calculations for this tests so for now they are not implemented. But would be a nice addition in the future.

Add benchmarks

Would be nice to try improving the speed further. However cannot reliably do this without having benchmarks first.

Add Fisher's exact test

Implement Fisher's exact test for a 2x2 case.

Input can be a pair of matrices. For a row-wise case each row of both matrices will have 2 unique levels (logical, factor, character or even numeric).

Add Cuzick test

Add Cuzick test, possibly using PMCMRplus::cuzickTest implementation as reference.

Implement proper column versions

Currently the col_ versions of all tests transpose the inputs and call row_ functions.

This behaviour is undesirable because of at least three reasons:

  1. Transposing inputs takes time.
  2. col-wise functions (like colMeans) are typically faster than their row equivalents
  3. When inputs have different lengths the functions might throw a warning about rows.
  • row_t_onesample(x)
  • row_t_welch(x, y)
  • row_t_equalvar(x, y)
  • row_t_paired(x, y)
  • row_wilcoxon_onesample(x)
  • row_wilcoxon_twosample(x, y)
  • row_wilcoxon_paired(x, y)
  • row_cor_pearson(x, y)
  • row_oneway_welch(x, g)
  • row_oneway_equalvar(x, g)
  • row_kruskalwallis(x, g)
  • row_f_var(x, y)
  • row_bartlett(x, g)
  • row_flignerkilleen(x, g)
  • row_jarquebera(x)

Turn all statistic-related outputs to NA in case of warning

Currently, when a test for a specific row cannot be executed because of special corner cases like constant values, only test statistic, p-value, and confidence intervals are turned to NA. As a result, in some situations, degrees of freedom can be still returned as 0 or even -1, and standard-error often is returned as NaN.

Better - turn all outputs computed by the test to NA, except for descriptive statistics like mean, number of observations, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.