Hello,
First off: thanks for the great paper, talks and package. The concept of computing p-values on clusters has been very useful to me.
When using large datasets I sometimes notice that the computation of p-values is very time demanding, however. I therefore profiled where I lose the most time, and found stats::dist
to be the main offender when using
test_hier_clusters_exact(X=X, link="average")
So I played around with alternative ways to yield the same result as stats::dist
but faster. I stumbled across this post by Florian Privé who deserves full credit for the following implementation, which I adapted to return the exact same data as stats::dist
for consistency.
prive_dist = function(X){
# Credit: Florian Privé (https://stackoverflow.com/a/59687204)
suppressWarnings(as.dist(sqrt(outer(rowSums(X^2), rowSums(X^2), '+') - tcrossprod(X, 2 * X))))
}
For large dataset, Privé's implementation is much faster, but performs worse for smaller datasets. See the following:
library(microbenchmark) # used for benchmarking
# which dataset sizes to test for distance computation
N = 10^(2:6)
# wrapper to benchmark a dataset of a given size
times = function(N){
print(N)
X = matrix(runif(N), nrow=sqrt(N))
timings = microbenchmark(stats::dist(X),
prive_dist(X),
times=10)
print(timings, unit="relative")
}
sapply(N, times)
To which the output looks like this
[1] 100
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 1.000000 1.000000 1.000000 1.000000 1.00000 1.00000 10 a
prive_dist(X) 6.718533 6.702482 6.758787 5.977713 5.77667 8.49686 10 b
[1] 1000
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 10 a
prive_dist(X) 1.366609 1.36663 1.411547 1.397399 1.460971 1.445506 10 b
[1] 10000
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 3.707272 7.008658 4.558213 6.926236 6.633612 1.34467 10 b
prive_dist(X) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
[1] 1e+05
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 16.49259 15.07916 16.4001 13.62932 19.25519 14.8057 10 b
prive_dist(X) 1.00000 1.00000 1.0000 1.00000 1.00000 1.0000 10 a
[1] 1e+06
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 73.23724 55.5484 54.38444 54.05382 50.83549 46.74125 10 b
prive_dist(X) 1.00000 1.0000 1.00000 1.00000 1.00000 1.00000 10 a
Arguably, the worse performance in small datasets is negligible given that it would still run fast in absolute terms, whereas the performance gain in larger datasets is massive.
One might want to think about implementing it in a way that you always get best performance. For example like this:
custom_dist = function(X){
if(length(X) < 1000){
return(stats::dist(X))
} else {
return(prive_dist(X))
}
}
If you think this is a good idea I could submit a PR, probably next week.
Thanks again for the package and al your work! Best wishes,
Jesko