Light

karoliskoncevicius / matrixtests Goto Github PK

View Code? Open in Web Editor NEW

36.0 4.0 5.0 623 KB

R package for computing multiple hypothesis tests on rows/columns of a matrix or a data.frame

Home Page: https://cran.r-project.org/web/packages/matrixTests/index.html

R 100.00%

r package matrix rows fast hypothesis-testing t-test anova wilcoxon-test

matrixtests's Introduction

Matrix Tests

A package dedicated to running multiple statistical hypothesis tests on rows and columns of matrices.

Goals

Fast execution via vectorization.
Convenient and detailed output format.
Compatibility with tests implemented in base R.
Careful handling of missing values and edge cases.

Examples

1. Bartlett's test on columns

Bartlett's test on every column of iris dataset using Species as groups:

col_bartlett(iris[,-5], iris$Species)

             obs.tot obs.groups var.pooled df statistic                pvalue
Sepal.Length     150          3 0.26500816  2 16.005702 0.0003345076070163084
Sepal.Width      150          3 0.11538776  2  2.091075 0.3515028004158132768
Petal.Length     150          3 0.18518776  2 55.422503 0.0000000000009229038
Petal.Width      150          3 0.04188163  2 39.213114 0.0000000030547839322

2. Welch t-test on rows

Welch t-test performed on each row of 2 large (million row) matrices:

X <- matrix(rnorm(10000000), ncol = 10)
Y <- matrix(rnorm(10000000), ncol = 10)

row_t_welch(X, Y)  # running time: 2.4 seconds

Confidence interval computations can be turned-off for further increase in speed:

row_t_welch(X, Y, conf.level = NA)  # running time: 1 second

Available Tests

Variant	Name	Function
Location tests (1 group)	Single sample Student's t.test	`row_t_onesample`
	Single sample Wilcoxon's test	`row_wilcoxon_onesample`
Location tests (2 groups)	Equal variance Student's t.test	`row_t_equalvar`
	Welch adjusted Student's t.test	`row_t_welch`
	Two sample Wilcoxon's test	`row_wilcoxon_twosample`
Location tests (paired)	Paired Student's t.test	`row_t_paired`
	Paired Wilcoxon's test	`row_wilcoxon_paired`
Location tests (2+ groups)	Equal variance oneway anova	`row_oneway_equalvar`
	Welch's oneway anova	`row_oneway_welch`
	Kruskal-Wallis test	`row_kruskalwallis`
	van der Waerden's test	`row_waerden`
Scale tests (2 groups)	F variance test	`row_f_var`
Scale tests (2+ groups)	Bartlett's test	`row_bartlett`
	Fligner-Killeen test	`row_flignerkilleen`
	Levene's test	`row_levene`
	Brown-Forsythe test	`row_brownforsythe`
Association tests	Pearson's correlation test	`row_cor_pearson`
Periodicity tests	Cosinor	`row_cosinor`
Distribution tests	Kolmogorov-Smirnov test	`row_kolmogorovsmirnov_twosample`
Normality tests	Jarque-Bera test	`row_jarquebera`
	Anderson-Darling test	`row_andersondarling`

Further Information

For more information please refer to the Wiki page:

See Also

Literature

Computing thousands of test statistics simultaneously in R, Holger Schwender, Tina Müller.
Statistical Computing & Graphics. Volume 18, No 1, June 2007.

Packages

CRAN:

ttests() in the Rfast package.
row.ttest.stat() in the metaMA package.
MultiTtest() in the ClassComparison package.
bartlettTests() in the heplots package.
harmonic.regression() in the HarmonicRegression package.

BioConductor:

lmFit() in the limma package.
rowttests() in the genefilter package.
mt.teststat() in the multtest package.
row.T.test() in the HybridMTest package.
rowTtest() in the viper package.
lmPerGene() in the GSEAlm package.

GitHub:

rowWilcoxonTests() in the sanssouci package.
matrix.t.test() in the pi0 package.
wilcoxauc() in the presto package.

matrixtests's People

Contributors

Stargazers

Watchers

Forkers

shenghusang han-tun liubingdong shaoyoucheng

matrixtests's Issues

Investigate potential performance improvements by turning main functions to be column-wise

Right now all functions work on rows by default and column versions transpose the input data and call the corresponding row function.

However in R column-wise functions should be about 2x faster due to how the matrices are stored in memory. Therefore it's worth investigating if making the test functions work on columns instead of rows could add additional boost to performance.

multiple test correction options?

Is there anyway to include a padj for multiple test correction using Bonferroni correction or Benjamini-Hochberg procedure?

Make the behaviour of infinite values consistent

Double-check and synchronize the way different tests handle Inf values.

Right now some of the non-parametric tests treat the Inf/-Inf values as highest/lowest ranks. However the behaviour for parametric tests are not fully defined. In particular when both Inf and -Inf values are present within the same group.

Make conf.level=NA suppress confidence interval computation

Consider changing the behaviour of conf.level=NA from throwing an error to suppressing the result of confidence interval. In most cases this should provide a considerable improvement in speed.

Change the name of null argument to be the same across tests

Right now the name for this argument is copied from base. In turn this can change from on test to another. As an example: it is called mu in t.test() but ratio in var.test().

Probably should always be named null.

Reach 100% coverage

Unit test coverage now is 97%. Only column-wise function variants seem to be missing.

Add Kolmogorov-Smirnov test

Add a KS test to compare distribution of two samples (ks.test() in base R stats package).

For now focus on two sample variant and name it row_kolmogorovsmirnov_twosample().

Later, if necessary, a version that compares a single sample against a named distribution can be added.

Make row_ievora self-contained

Currently row_ievora calls row_bartlett and row_t_welch. Both these functions do some preprocessing on their own.

Rewriting row_ievora to have all the needed code to run the tests within itself would improve the speeed and handling of warnings.

Adjust wilcoxon test warning messages when all the values are constant

Currently row_wilcoxon_twosample does not produce a warning when all the values are constant and equal to the NULL:

row_wilcoxon_twosample(c(1,1,1), c(2,2,2), exact=FALSE, null=1)
# NaN p-value but no warning

In a similar situation row_wilcoxon_onesample() can end up producing three warnings for the same row:

row_wilcoxon_onesample(c(0,0,0))
# 3 separate warnings are produced

And the above issue is also present in a paired version.

Solution: add the warning for "twosample" case and adjust the warnings so that only one is produced at the time.

Sync the Inf value behaviour of wilcox test with base after R 4.0.0

After R 4.0.0. wilcox.test() will change it's behaviour with Infinite values. Once R 4.0.0. is out we need to make sure row_wilcox_test() will handle Inf values in the same way wilcox.test() does.

Move from Travis to GitHub actions

Travis is failing for a while now, instead of spending time fixing it we can take this opportunity to try out GitHub actions.

Make warning messages more consistent

Right now some of the warning messages are a bit different between tests. For example:

row_oneway_equalvar: 1 of the rows had essentially constant values.\nFirst occurrence at row 1
row_kruskalwallis: 1 of the rows were essentially constant.\nFirst occurrence at row 1

Integer overflow in case of big matrices

This is a bug reported by @Close-your-eyes via email.

m1 <- matrix(rnorm(100000), nrow=4)                                                                                                                                                                                                 
m2 <- matrix(rnorm(1000000), nrow=4)

row_wilcoxon_twosample(m1, m2)

obs.x  obs.y obs.tot  statistic pvalue location.null alternative exact corrected
1 25000 250000  275000 3130434539     NA             0   two.sided FALSE      TRUE
2 25000 250000  275000 3118218180     NA             0   two.sided FALSE      TRUE
3 25000 250000  275000 3141448608     NA             0   two.sided FALSE      TRUE
4 25000 250000  275000 3117295315     NA             0   two.sided FALSE      TRUE
   
Warning messages:
1: In nx * ny : NAs produced by integer overflow
2: In nx * ny : NAs produced by integer overflow

This is caused by storing matrix dimensions as integers - then wilcox test multiplies number of observations from both samples together which leads to integer overflow. Solution is to store those values as numeric.

Add Fisher's exact G test

Fisher's exact G test is used to test the null hypothesis of Gaussian white noise against the alternative of an added deterministic periodic component in a timeseries.

In R the test is implemented by fisher.g.test function in a GeneTS package. It should be useful for detecting hidden periodicities.

Add Boschloo's test (a more powerful alternative for exact Fisher's test)

https://en.wikipedia.org/wiki/Boschloo%27s_test

Add support for sparse matrices

Thanks for this amazing package! I have used it in so many projects already!

I just wanted to check if there are plans to support sparse matrices as well any time soon?

Add Cucconi test

"Cucconi test" is an interesting test for both location and scale.

Seems like NULL distribution is based on permutations but that shouldn't be an issue.

Simplify and automate unit testing

Remove testthat dependency
~~Try to automate the parameter testing based on input argument types~~

Add Van der Waerden test

A test for location difference between multiple groups. Seems to be somewhat in the middle between ANOVA and Kruskal-Wallis. Wiki.

Add F test for linear model comparison

Need to come up with a name and an interface. Current proposal:

Name: row_lm_f()
Arguments: 1) data matrix 2) alternative model matrix 3) null model matrix

Have to do a lot of testing in order to make sure it will produce the same output as lm(). Keep track of pivoting.

Add Anderson-Darling test for normality

For testing could be tuned with ad.test() from the nortest package.

Allow multiple optional parameters

Right now parameters are expanded only in a few special cases. For example when x has 10 rows and y has 1 row - y will be repeated 10 times to match x.

We can add this for other parameters as well, like conf.level. This would allow running same test on the same input with different confidence levels in parallel.

Add Spearman's correlation

Hi,
First thanks a lot for developing such an awesome package! It gives me a lot of help!
I noticed that only pearson correlation is allowed now, is it possible for you to add spearman correlation in cor_test?

Thanks!

Yang

Add Barnard's test (a more powerful alternative for exact Fisher's test)

https://en.wikipedia.org/wiki/Barnard%27s_test

Add Brunner–Munzel test

For now just a place holder as a reminder that such a test might be useful in this package.

Add O'Brien's test

O'Brien's test is yet another test to for homogeneity of variances.

Add confidence intervals for wilcoxon tests

Currently row_wilcoxon_* tests do not return the pseudo-median estimates nor confidence intervals. The tests in base R can return confidence intervals when asked: wilcox.test(..., conf.int=TRUE).

The problem is that it's hard to speed up confidence interval calculations for this tests so for now they are not implemented. But would be a nice addition in the future.

Add benchmarks

Would be nice to try improving the speed further. However cannot reliably do this without having benchmarks first.

When calling the functions with matrixTests:: prefix - the warning gets repeated multiple times

To reproduce:

mat <- matrix(c(0,0,1,1,0,0,1,1),nrow=2)
grps <- c(0,1,0,1)
res <- row_oneway_welch(mat,grps)
res <- matrixTests::row_oneway_welch(mat,grps)

Thanks to @matthewcarlucci for pointing this out.

Add Fisher's exact test

Implement Fisher's exact test for a 2x2 case.

Input can be a pair of matrices. For a row-wise case each row of both matrices will have 2 unique levels (logical, factor, character or even numeric).

Add Hartigan's dip test

Implement dip test. The original R implementation is available in the diptest library.

Add Cuzick test

Add Cuzick test, possibly using PMCMRplus::cuzickTest implementation as reference.

Implement proper column versions

Currently the col_ versions of all tests transpose the inputs and call row_ functions.

This behaviour is undesirable because of at least three reasons:

Transposing inputs takes time.
col-wise functions (like colMeans) are typically faster than their row equivalents
When inputs have different lengths the functions might throw a warning about rows.

Turn all statistic-related outputs to NA in case of warning

Currently, when a test for a specific row cannot be executed because of special corner cases like constant values, only test statistic, p-value, and confidence intervals are turned to NA. As a result, in some situations, degrees of freedom can be still returned as 0 or even -1, and standard-error often is returned as NaN.

Better - turn all outputs computed by the test to NA, except for descriptive statistics like mean, number of observations, etc.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.