tguillerme / disprity Goto Github PK
View Code? Open in Web Editor NEWMeasuring disparity with R
License: GNU General Public License v3.0
Measuring disparity with R
License: GNU General Public License v3.0
Come up with a multidimensional version of the bhatt.coeff
:
time.subsamples
and custom.subsamples
should generate the $call$dimensions
element on the dispRity
objectCryptic unit test warnings to sort:
coercing argument of type 'double' to logical
the condition has length > 1 and only the first element will be used
This can work by adding more boot.type
options to boot.matrix
based on an input matrix with elements in rows and probabilities in columns (e.g. boot.matrix(matrix, type = personalised_matrix)
).
I've made a start at this, but it'll take a while as it's quite long and complicated! I think the vignettes need an overhaul, some details below and on the document after #NC: (I know this isn't proper comments in Rmd but I was trying to be faster!). I reckon maybe rather than doing this all in one hit I might need to take a function a day or something, as it's making my head hurt. But I can work though it all slowly.
I got as far as check_morpho_fun today
Some comments below.
README - vignette info, also short description of disparity.
Note that DESCRIPTION is a little more specific to disparity - maybe echo this in the README, but add that it can be used for ecological stuff too? Might make uses more obvious?
patch notes - I haven't edited as these are technical and personal anyway.
disparity_object
| | ---[[...]] = class:"list" (the following rarefactions)
| | | |
| | ---[[...]] = class:"numeric" (the bootstraps)
| |
| ---[[...]] = class:"list" (the following series)
| |
| ---$elements* = class:"matrix" (a one column matrix containing the elements within this series)
| |
| ---[[...]] = class:"matrix" (the rarefactions)
This bit gets confusing - maybe delete and just mention that of course this continues to the end of the list of series.
\---$disparity
|
\---[[2]] = class:"list" (the first series)
??? Also this is repeated for the second iteration, I think this is an error.
also what is disparity[[1]]?
Need to complete this for get.dispRity
or extract.dispRity
I'll put more general comments as an issue...
Allow the function to run without a phylogeny but with dates (will ignore the models and the nodes arguments).
data(BeckLee_mat50)
data(BeckLee_tree)
crown_stem <- crown.stem(BeckLee_tree, inc.nodes = FALSE)
subset_crown_stem <- custom.subsets(BeckLee_mat50, group = crown_stem)
disp_crown_stem <- dispRity(subset_crown_stem, metric = c(median, centroids))
plot(disp_crown_stem)
The options don't make sense to me
If it's a number > 0, it keeps that number of dimensions, and if < 0 it removes the last dimensions by that %? This is very odd. Please explain more clearly
It's worth thinking about the purpose of a vignette. I've always thought they were there to help people run analyses, and understand the method. Yours seems very focussed in places on how the programming was done and how things work on a technical level, with less scientific focus. Check out the caper vignette for what I think is a nice example. I'd perhaps think about what people really need to know, and maybe put the programmatic stuff into a separate document, while bringing the separate examples into one main manual - this would reduce un-necessary repetition? Also remember the vignette goes with the package help files so doesn't need to repeat information too much.
Maybe you could start with a page of general background to disparity, and the overlap across disciplines. I'm thinking this would be almost the same as the intro to the paper, so easy to write. You can then move on to some specific examples using your datasets, then into the detail of the metrics and making custom metrics, then decide what details are still needed to make it possible to use the package from the rest of the vignette? This is mostly rearranging stuff, and a bit of writing that can be used for the paper too so not time wasted.
It will be long, that that's not a bad thing for a vignette.
See the vignette for more detailed comments at least towards the start
Modify the dispRity
function to keep distance matrices if elements are missing using a keep.distance = TRUE
option.
With this option, the row selection information from boot.matrix
, custom.subsets
or chrono.subsets
is also passed for the columns (dimensions
is ignored).
Error when installing caper?
devtools::install_github("TGuillerme/dispRity", ref = "release")
Downloading GitHub repo TGuillerme/dispRity@release
from URL https://api.github.com/repos/TGuillerme/dispRity/zipball/release
Installing dispRity
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/caper_0.5.2.tgz'
Content type 'application/x-gzip' length 1274260 bytes (1.2 MB)
==================================================
downloaded 1.2 MB
Installing caper
Error in FUN(X[[i]], ...) :
Invalid comparison operator in dependency: >=
-dispRity
does not work with wrong metrics combinations (c(var, mean)
)
-centroids
does not work as stand alone with optional centroid
as a vector (centroid = c(0,0,0)
)
-sort.dispRity
bugged (see desiderata)
I've made this comment in the manual too - but aov is not how you do an ANOVA unless you have a balanced design. You basically never do, so always use lm instead. ANOVA is just a type of linear model
Include the metric_list
from dispRity
in the dispRity
object when available.
Bit puzzled by the offspring/descendent terminology - I'd assume these were the same thing. By descendant do you mean ancestor? I'm confused.
model.test.sim
Mark:
model.test.sim(..., cent.tend = median)
.We need to add a tutorial in inst/gitbook/03_specific-tutorials.Rmd
after the section "### null morphospace testing with null.test
".
This section should cover:
print
summary
plot
Make sure that model.test
and model.test.sim
have full test coverage:
Thomas check for:
Mark check for:
This can work by adding more boot.type
options to boot.matrix
based on an input matrix with elements in rows and probabilities in columns (e.g. boot.matrix(matrix, type = personalised_matrix)
).
A # 3D array of Procrustes coordinates
pxk <- dim(A)[1] * dim(A)[2]
n <- dim(A)[3]
tmp <- aperm(A, c(3, 2, 1))
dim(tmp) <- c(n, pxk)
rownames(tmp) <- dimnames(A)[[3]] # tmp is a 2D matrix
X <- prcomp(tmp)$x # ordination matrix
In get.contrast.matrix you say you set the diagonal to 0 but the code sets it to 1, as far as I can see - maybe check this?
I hate this phrase - sounds like you're cleaning a toilet. I'd normally use data cleaning or something. Keep it if you want but I needed to voice my revulsion 🚽
Remove matrix size restrictions in dispRity
family functions.
test.metric
reduce.space
multi.ace
multi.ace
testing with the new version of castor
(bugged?)dispRity
output option for multi.ace
fit.multi.ace
multi.ace
)make.test
utilities_old
zzz
Rather than add this to the document you're working on, I'll paste it here. This is a rough outline. Fill in the details with what you've already written!
Biological data are complex; understanding the ecology and evolution of species often requires that we analyse multiple variables that covary with each other, and through space and time.
Multivariate analyses aim to capture and incorporate this multidimensional complexity, while providing outputs that are interpretable without needing to visualise n-dimensions simultaneously.
Key multidimensional features of species that have important roles in ecology and evolution include morphology, functional traits. etc. etc.
Many of these can be represented as matrices, these can be used in our package.
In the interests of clarity, we here focus on just one kind of data: morphological diversity.
In evolutionary biology and palaeontology, morphological diversity is often referred to as disparity.
DESCRIBE COOL USES OF DISPARITY ANALYSES. WHY IS DISPARITY AWESOME.
Although disparity is commonly studied, its definition varies widely across authors due to the myriad ways it can be measured.
LIST WAYS IT IS MEASURED.
This is further complicated by confusion over whether diparity refers to the metric describing or summarising one or more aspects of this space, or to the multidimensional space which is a specific mathematical object.
This multidimensional space can also be defined in multiple ways.
LIST ways
In theory, this multitude of ways to measure/define disparity is not an issue.
The ideal solution is to choose either the most appropriate method for your question or data, or to apply several definitions and compare across these.
Unfortuantely, in practice this is hampered by existing software implementations.
Package maintainers/software developers choose their preferred definition of disparity and method for measuring it, and then enforce this on users by allowing zero flexibility.
This can lead to a chain of inappropriate analyses led by everyone just using the package that exists and the method within it, and in the worst case, using the defaults of the package. Oh no!
The dispRity package will solve this problem by providing a completely flexible framework for studing disparity.
It implements all commonly used metrics and definitions of disparity, as well as providing a simple interface for users to implement their own disparity functions.
The package is described here for use with morphological diversity data, but the functions are equally applicable to ecological contexts such as Diaz paper, Deirdre's paper etc.
clean.data$dropped.rows
should be proper NA
class when no dropped rows.
Not sure if you want to change level to dimension-level in these functions?
As used in Brusatte et al. (2014, Current Biology, 24, 2386-):
we used permutation tests to perform pairwise comparisons of group means in R and PAST. These comparisons tested the equality of multivariate means (based on the 125 recovered PCO axes, which comprise 90% of total variance) of two designated groups (for example, avialans versus deinonychosaurs). The Mahalanobis distance between the two group means was calculated and compared with a null distribution of between-group distances obtained by random permutation of the group labels.
Test the actual function times using
Rprof()
...
Rprof(NULL)
summaryRprof()
Points to improve:
ifelse
:bla <- ifelse(any(something == -1), FALSE, TRUE)
bla <- !any(something == -1)
which
selector:bla <- bob[which(bib != -1)]
bla <- bob[bib != -1]
pairwise.dist
adonis.dispRity
Methods from vegan::vegdist
:
"manhattan", "euclidean", "canberra", "bray", "kulczynski", "jaccard", "gower", "altGower", "morisita", "horn", "mountford", "raup" , "binomial", "chao", "cao" or "mahalanobis".
TODO:
char.diff
branch mergedtime.slice
on master branchchar.diff
castor
dependency functionschar.diff
We were unable to install dispRity
, whether from CRAN or github:
> install.packages("dispRity")
Installing package into ‘/home/josephwb/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘dispRity’ is not available (for R version 3.6.3)
This was surprising, since no other phylo packages (that I am aware of) require such recent versions of R. For example:
package R_version
1 phytools 3.5.0
2 phangorn 3.2.0
3 geiger 2.15
4 rotl 3.1.1
5 ape 3.2.0
6 paleotree 3.0.0
7 caper 2.10
8 BAMMtools 2.10
Changing the R version in the DESCRIPTION
file to R (>= 3.5) revealed an odd issue with Rv4. During error-checking, dispRity
checks the class of data. Weirdly, Rv4 has a different class for a matrix than does Rv3.*:
# generate a matrix
matrix(1:9, nrow=3, ncol=3) -> a;
b <- a;
# Rv4 gives the class above as:
# class(a);
# [1] "matrix" "array"
# while R v3.* gives:
# class(a);
# [1] "matrix"
Let's construct an example to demonstrate the problem:
# set the class attributes independent of R version:
class(a) <- "matrix";
class(b) <- c("matrix", "array");
The dispRity error check is essentially the following (where it is clear why things fail when there is only a single class attribute):
# the issue with Rv3.* was the following check:
all(class(a) == c("matrix", "array"))
# [1] FALSE
# while in Rv4:
all(class(b) == c("matrix", "array"))
# [1] TRUE
However, we can make a general check that works for both R versions:
# we can make this work for both versions of R with the following:
all(class(a) %in% c("matrix", "array"))
# [1] TRUE
all(class(b) %in% c("matrix", "array"))
# [1] TRUE
Yay! This is currently implemented in my fork. However, will submit a PR as it is a bit of a pain to update R itself just to install one package (^_-)≡☆
Improve the speed in bootstrap.lapply
with http://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html (pipe)
Try adding the functions as metric = c(f1,f2,f3)
-> f1(f2(f3(matrix)))
Add a serial mode for calculating disparity were the calculation is slower (for
loop) but can be depending on previously calculated values (e.g. for ancestral.dist
) or can be applied between subsets (e.g. for measuring the distance between groups).
TODO:
"serial"
"serial"
class option for metrics (input matrix
and groups
)make.metric
to test for the "serial"
class option"serial"
option for dispRity
: this argument can be logical (default is FALSE
; TRUE
= sequential comparison for chrono.subset
objects/pairwise comparisons for custom.subset
objects) or a list of pairs of comparisons (like for test.dispRity
's comparisons
argument)ancestral.dist
as a "serial"
metricmin.dist
as a "serial"
metricThings can be sped up by directly passing a list of comparisons (pairwise or sequential or user defined) that will be passed as a lapply_loop
argument (or similar).
Hi,
I created a phylogeny tree, and downstream analysis requires the tree to remove all zero-length branches. so I ran remove.zero.brlen(), seems it is not working. Here is my code and outputs
t <- read.tree("311coreGen.tre")
t
is.binary.tree(t)
tre <- remove.zero.brlen(t, 100,verbose = FALSE) # package dispRity
is.binary.tree(tre)
tre
Phylogenetic tree with 311 tips and 309 internal nodes.
Tip labels:
Sta003, Tul001, SJ002, Sta002, LMG215, Tul004, ...
Node labels:
, 0.000, 0.000, 0.000, 0.000, 1.000, ...
Unrooted; includes branch lengths.
[1] TRUE
longer object length is not a multiple of shorter object lengththe condition has length > 1 and only the first element will be used[1] TRUE
Phylogenetic tree with 311 tips and 309 internal nodes.
Tip labels:
S003, T001, S002, S002, L215, T004, ...
Node labels:
, 0.000, 0.000, 0.000, 0.000, 1.000, ...
Unrooted; includes branch lengths.
Can anyone please explain to me what is wrong with my codes? or How should I remove zero-length branches?
Best,
Gets an automatic interface for handling I/O from treats
in chrono.subsets
, custom.subsets
and dispRity
Hi Thomas,
here the two feature requests on GitHub that I sent you before. Would be great, if you could incorporate these at some point in the package.
# #Now let's calculate the mean estimated ancestral trait values across all trees
# #Note, that this function will not provide the correct output, if the input trees differ in topology (but the input trees can differ in
# #branch length)calc.mean.anc <- function(list.of.reconstructions){# #list.of.reconstructions: list of ace output (1 list element per tree)
summary.lik.anc <- lapply(list.of.reconstructions, function(x){x$lik.anc})# #Extract lik.anc for all 100 trees
summary.lik.anc <- Reduce('+', summary.lik.anc) # #Calculate sum of all list elements
summary.lik.anc <- summary.lik.anc/length(list.of.reconstructions) # #Divide by number of list elements (in this case 100, since there are 100 trees)
summary.lik.anc <- round(summary.lik.anc,3) # #Round the calculated mean estimates
return(summary.lik.anc)
}
Thank you very much!
Reduce functions processing time by:
disparity.through.time
, the metrics are wrongly passed to the dispRity
object (under generic metric
name).In terms of what needs to be written - wrapper functions might be nice, but not necessary. Also maybe a function for the grouping factors step which currently looks super complicated. Basically any part of the vignette examples that looks complicated would be worth thinking about writing a function for.
time.subsamples
/custom.subsamples
to generate empty groupsboot.matrix
and dispRity
to skip the empty groupssummary
/plot
/test
to skip the empty groupsI still hate this terminology - have changed to dimension-level for clarity, see what you think... In my mind levels refers to factors, or to nested hierarchies. This is just 1D, 2D or 3D.
Not that important - but decide if error messages start with capital letter or small letter, and whether or not they end with a full stop...
🎲
I wonder if changing the word "series" to "subsample" might be sensible? Series really only makes sense for time slice analyses. If you truly want this to be a multidisciplinary package, then maybe subsample is more accurate. And actually, even within morpho stuff, you might want to split into two groups and compare them, so series is a really odd word choice. It might be worth thinking about other function names etc you might want to change so you can go through and replace everything at once!
Check for places this is an issue.
I just noticed at "Measuring disparity as a distribuion" you say let's use previously created object boot_time_slices and do something to it. This object was mde way back, so would be likely the person has forgotten what it is.
I think when these prelim steps have been done a long time earlier, and/or are not built into the package, you probably want to include the code to get boot_time_slices here too. Most people won't read the manual cover to cover in one go.
This may be true in other places, so have a check and sort this out. shouldn't add much code, but will add clarity for anyone trying out just one bit of the manual
Initialise vectors and matrices to their final size throughout the code (C attribution style) but mainly for the bottlenecks in:
dispRity.metric
dispRity
reduce.space
multi.ace
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.