tguillerme / disprity Goto Github PK

View Code? Open in Web Editor NEW

23.0 6.0 12.0 101.34 MB

Measuring disparity with R

License: GNU General Public License v3.0

R 11.65% TeX 3.91% C 0.15% Shell 0.04% HTML 81.42% CSS 0.33% JavaScript 2.49%

disparity multidimensionality r palaeobiology ecology

disprity's People

Contributors

Stargazers

Watchers

Forkers

andrewljackson puttickmacroevolution dwbapst jarioksa hattie415 graemetlloyd paulesantos nhcooper123 aegit jhhatfield leibniz2207

disprity's Issues

The use of the term series...

I wonder if changing the word "series" to "subsample" might be sensible? Series really only makes sense for time slice analyses. If you truly want this to be a multidisciplinary package, then maybe subsample is more accurate. And actually, even within morpho stuff, you might want to split into two groups and compare them, so series is a really odd word choice. It might be worth thinking about other function names etc you might want to change so you can go through and replace everything at once!

Sanitizing

I hate this phrase - sounds like you're cleaning a toilet. I'd normally use data cleaning or something. Keep it if you want but I needed to voice my revulsion 🚽

Time slice terminology

Bit puzzled by the offspring/descendent terminology - I'd assume these were the same thing. By descendant do you mean ancestor? I'm confused.

`treats` interface

Gets an automatic interface for handling I/O from treats in chrono.subsets, custom.subsets and dispRity

dispRity_fun.R

Not sure if you want to change level to dimension-level in these functions?

More complex rarefaction algorithms

Implement "fair" rather than equal rarefaction algorithms (Kotric and Knoll 2015, Paleobiology, 41, 68–.)
Implement "personalized" rarefaction algorithm that can use external data sources such as the number of occurrences or geographic regions.

This can work by adding more boot.type options to boot.matrix based on an input matrix with elements in rows and probabilities in columns (e.g. boot.matrix(matrix, type = personalised_matrix)).

Towards 1.5

Add serial type metrics
Add test.metric
Add reduce.space
Add multi.ace
fix multi.ace testing with the new version of castor (bugged?)
Add dispRity output option for multi.ace
Add fit.multi.ace
Add ancestor matrix category (from multi.ace)

Consistency of error messages

Not that important - but decide if error messages start with capital letter or small letter, and whether or not they end with a full stop...

🎲

matrix restriction in dispRity

Remove matrix size restrictions in dispRity family functions.

Allow `dispRity` object to deal with NAs

Allow time.subsamples/custom.subsamples to generate empty groups
Allow boot.matrix and dispRity to skip the empty groups
Fix summary/plot/test to skip the empty groups

Vignette edits

It's worth thinking about the purpose of a vignette. I've always thought they were there to help people run analyses, and understand the method. Yours seems very focussed in places on how the programming was done and how things work on a technical level, with less scientific focus. Check out the caper vignette for what I think is a nice example. I'd perhaps think about what people really need to know, and maybe put the programmatic stuff into a separate document, while bringing the separate examples into one main manual - this would reduce un-necessary repetition? Also remember the vignette goes with the package help files so doesn't need to repeat information too much.

Maybe you could start with a page of general background to disparity, and the overlap across disciplines. I'm thinking this would be almost the same as the intro to the paper, so easy to write. You can then move on to some specific examples using your datasets, then into the detail of the metrics and making custom metrics, then decide what details are still needed to make it possible to use the package from the rest of the vignette? This is mostly rearranging stuff, and a bit of writing that can be used for the paper too so not time wasted.

It will be long, that that's not a bad thing for a vignette.

See the vignette for more detailed comments at least towards the start

Functions algorithm optimality

Reduce functions processing time by:

Declaring variables ahead (C style)
When possible do negative comparisons

Fix `tree.age` issues

Tree ages with fossils only
Updated precision

Minor bugs

-dispRity does not work with wrong metrics combinations (c(var, mean))
-centroids does not work as stand alone with optional centroid as a vector (centroid = c(0,0,0))
-sort.dispRity bugged (see desiderata)

Converting shape data arrays

A # 3D array of Procrustes coordinates
pxk <- dim(A)[1] * dim(A)[2]
n <- dim(A)[3]
tmp <- aperm(A, c(3, 2, 1))
dim(tmp) <- c(n, pxk)
rownames(tmp) <- dimnames(A)[[3]] # tmp is a 2D matrix
X <- prcomp(tmp)$x # ordination matrix

intro for the paper

Rather than add this to the document you're working on, I'll paste it here. This is a rough outline. Fill in the details with what you've already written!

Biological data are complex; understanding the ecology and evolution of species often requires that we analyse multiple variables that covary with each other, and through space and time.
Multivariate analyses aim to capture and incorporate this multidimensional complexity, while providing outputs that are interpretable without needing to visualise n-dimensions simultaneously.
Key multidimensional features of species that have important roles in ecology and evolution include morphology, functional traits. etc. etc.
Many of these can be represented as matrices, these can be used in our package.
In the interests of clarity, we here focus on just one kind of data: morphological diversity.

In evolutionary biology and palaeontology, morphological diversity is often referred to as disparity.
DESCRIBE COOL USES OF DISPARITY ANALYSES. WHY IS DISPARITY AWESOME.
Although disparity is commonly studied, its definition varies widely across authors due to the myriad ways it can be measured.
LIST WAYS IT IS MEASURED.
This is further complicated by confusion over whether diparity refers to the metric describing or summarising one or more aspects of this space, or to the multidimensional space which is a specific mathematical object.
This multidimensional space can also be defined in multiple ways.
LIST ways

In theory, this multitude of ways to measure/define disparity is not an issue.
The ideal solution is to choose either the most appropriate method for your question or data, or to apply several definitions and compare across these.
Unfortuantely, in practice this is hampered by existing software implementations.
Package maintainers/software developers choose their preferred definition of disparity and method for measuring it, and then enforce this on users by allowing zero flexibility.
This can lead to a chain of inappropriate analyses led by everyone just using the package that exists and the method within it, and in the worst case, using the defaults of the package. Oh no!
The dispRity package will solve this problem by providing a completely flexible framework for studing disparity.
It implements all commonly used metrics and definitions of disparity, as well as providing a simple interface for users to implement their own disparity functions.
The package is described here for use with morphological diversity data, but the functions are equally applicable to ecological contexts such as Diaz paper, Deirdre's paper etc.

post 0.4 suggestions

time.subsamples and custom.subsamples should generate the $call$dimensions element on the dispRity object
Add confidence interval plot when plotting a non bootstrapped dimensions-level 2 metric

Feature requests for multi.ace

Hi Thomas,

here the two feature requests on GitHub that I sent you before. Would be great, if you could incorporate these at some point in the package.

Could you incorporate the log-likelihood and the AIC value in the multi.ace() output?
This one might be a bit more work: Could you implement a summary function for the ancestral estimates (i. e. summarize ancestral states across a set of trees)? Maybe the mean of the ancestral estimates across the analyses set of trees (for a fixed topology it is relatively straightforward, but for trees that differ in topology it might be a bit more work)? I have also attached a little function, that is meant to produce the mean of ancestral state estimates based on the ace() output (unfortunately I only saw your multi.ace() function later on, but I reckon this could be easily modified accordingly).

# #Now let's calculate the mean estimated ancestral trait values across all trees
# #Note, that this function will not provide the correct output, if the input trees differ in topology (but the input trees can differ in
# #branch length)calc.mean.anc <- function(list.of.reconstructions){# #list.of.reconstructions: list of ace output (1 list element per tree)
  summary.lik.anc <- lapply(list.of.reconstructions, function(x){x$lik.anc})# #Extract lik.anc for all 100 trees
  summary.lik.anc <- Reduce('+', summary.lik.anc) # #Calculate sum of all list elements
  summary.lik.anc <- summary.lik.anc/length(list.of.reconstructions) # #Divide by number of list elements (in this case 100, since there are 100 trees)
  summary.lik.anc <- round(summary.lik.anc,3) # #Round the calculated mean estimates
  return(summary.lik.anc)
}

Thank you very much!

Update distances methods to `vegdist` once

pairwise.dist
adonis.dispRity
others?

Methods from vegan::vegdist:
"manhattan", "euclidean", "canberra", "bray", "kulczynski", "jaccard", "gower", "altGower", "morisita", "horn", "mountford", "raup" , "binomial", "chao", "cao" or "mahalanobis".

Speed improvements

Initialise vectors and matrices to their final size throughout the code (C attribution style) but mainly for the bottlenecks in:

dispRity.metric
dispRity
reduce.space
multi.ace

Examples dependent on earlier examples

Check for places this is an issue.

I just noticed at "Measuring disparity as a distribuion" you say let's use previously created object boot_time_slices and do something to it. This object was mde way back, so would be likely the person has forgotten what it is.

I think when these prelim steps have been done a long time earlier, and/or are not built into the package, you probably want to include the code to get boot_time_slices here too. Most people won't read the manual cover to cover in one go.

This may be true in other places, so have a check and sort this out. shouldn't add much code, but will add clarity for anyone trying out just one bit of the manual

functions I have not editted

make.test
utilities_old
zzz

levels of functions

I still hate this terminology - have changed to dimension-level for clarity, see what you think... In my mind levels refers to factors, or to nested hierarchies. This is just 1D, 2D or 3D.

Issues with R < 4.0

We were unable to install dispRity, whether from CRAN or github:

> install.packages("dispRity")
Installing package into ‘/home/josephwb/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘dispRity’ is not available (for R version 3.6.3)

This was surprising, since no other phylo packages (that I am aware of) require such recent versions of R. For example:

    package R_version
1  phytools     3.5.0
2  phangorn     3.2.0
3    geiger      2.15
4      rotl     3.1.1
5       ape     3.2.0
6 paleotree     3.0.0
7     caper      2.10
8 BAMMtools      2.10

Changing the R version in the DESCRIPTION file to R (>= 3.5) revealed an odd issue with Rv4. During error-checking, dispRity checks the class of data. Weirdly, Rv4 has a different class for a matrix than does Rv3.*:

# generate a matrix
matrix(1:9, nrow=3, ncol=3) -> a;
b <- a;

# Rv4 gives the class above as:
# class(a);
# [1] "matrix" "array"
# while R v3.* gives:
# class(a);
# [1] "matrix"

Let's construct an example to demonstrate the problem:

# set the class attributes independent of R version:
class(a) <- "matrix";
class(b) <- c("matrix", "array");

The dispRity error check is essentially the following (where it is clear why things fail when there is only a single class attribute):

# the issue with Rv3.* was the following check:
all(class(a) == c("matrix", "array"))
# [1] FALSE
# while in Rv4:
all(class(b) == c("matrix", "array"))
# [1] TRUE

However, we can make a general check that works for both R versions:

# we can make this work for both versions of R with the following:
all(class(a) %in% c("matrix", "array"))
# [1] TRUE
all(class(b) %in% c("matrix", "array"))
# [1] TRUE

Yay! This is currently implemented in my fork. However, will submit a PR as it is a bit of a pain to update R itself just to install one package (^_-)≡☆

ANOVA aov and lm

I've made this comment in the manual too - but aov is not how you do an ANOVA unless you have a balanced design. You basically never do, so always use lm instead. ANOVA is just a type of linear model

Intallation problem with caper?

Error when installing caper?

devtools::install_github("TGuillerme/dispRity", ref = "release")
Downloading GitHub repo TGuillerme/dispRity@release
from URL https://api.github.com/repos/TGuillerme/dispRity/zipball/release
Installing dispRity
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/caper_0.5.2.tgz'
Content type 'application/x-gzip' length 1274260 bytes (1.2 MB)
==================================================
downloaded 1.2 MB

Installing caper
Error in FUN(X[[i]], ...) : 
  Invalid comparison operator in dependency: >=

Wrapper functions etc.

In terms of what needs to be written - wrapper functions might be nice, but not necessary. Also maybe a function for the grouping factors step which currently looks super complicated. Basically any part of the vignette examples that looks complicated would be worth thinking about writing a function for.

fail to remove zero-length branches,

Hi,
I created a phylogeny tree, and downstream analysis requires the tree to remove all zero-length branches. so I ran remove.zero.brlen(), seems it is not working. Here is my code and outputs

t <- read.tree("311coreGen.tre")
t
is.binary.tree(t)

tre <- remove.zero.brlen(t, 100,verbose = FALSE) # package dispRity
is.binary.tree(tre)
tre

Phylogenetic tree with 311 tips and 309 internal nodes.

Tip labels:
Sta003, Tul001, SJ002, Sta002, LMG215, Tul004, ...
Node labels:
, 0.000, 0.000, 0.000, 0.000, 1.000, ...

Unrooted; includes branch lengths.
[1] TRUE
longer object length is not a multiple of shorter object lengththe condition has length > 1 and only the first element will be used[1] TRUE

Phylogenetic tree with 311 tips and 309 internal nodes.

Tip labels:
S003, T001, S002, S002, L215, T004, ...
Node labels:
, 0.000, 0.000, 0.000, 0.000, 1.000, ...

Unrooted; includes branch lengths.

Can anyone please explain to me what is wrong with my codes? or How should I remove zero-length branches?

Best,

Display improvements

in disparity.through.time, the metrics are wrongly passed to the dispRity object (under generic metric name).

Warning messages in unit testing:

Cryptic unit test warnings to sort:

coercing argument of type 'double' to logical

the condition has length > 1 and only the first element will be used

Logical improvements

Test the actual function times using

Rprof()
...
Rprof(NULL)
summaryRprof()

Points to improve:

logical ifelse:

bla <-  ifelse(any(something == -1), FALSE, TRUE)  
bla <- !any(something == -1)

which selector:

bla <- bob[which(bib != -1)]
bla <- bob[bib != -1]

Add `PAST` permutation tests.

As used in Brusatte et al. (2014, Current Biology, 24, 2386-):

we used permutation tests to perform pairwise comparisons of group means in R and PAST. These comparisons tested the equality of multivariate means (based on the 125 recovered PCO axes, which comprise 90% of total variance) of two designated groups (for example, avialans versus deinonychosaurs). The Mahalanobis distance between the two group means was calculated and compared with a null distribution of between-group distances obtained by random permutation of the group labels.

boot.matrix dimensions argument

The options don't make sense to me

If it's a number > 0, it keeps that number of dimensions, and if < 0 it removes the last dimensions by that %? This is very odd. Please explain more clearly

Expand node/tip probability structure

For time/group subseting

Allow a probability table to be passed similarly to a FADLAD argument containing probabilities for each point in time. Probabilities can include one element (second element is set to NA), two (like in equal.split/gradual.split) or more.

For rarefaction

Implement "fair" rather than equal rarefaction algorithms (Kotric and Knoll 2015, Paleobiology, 41, 68–.)
Implement "personalized" rarefaction algorithm that can use external data sources such as the number of occurrences or geographic regions.

This can work by adding more boot.type options to boot.matrix based on an input matrix with elements in rows and probabilities in columns (e.g. boot.matrix(matrix, type = personalised_matrix)).

multidimensional `bhatt.coeff`

Come up with a multidimensional version of the bhatt.coeff:

implementation
test
manual

From Natalie's pull request

I've made a start at this, but it'll take a while as it's quite long and complicated! I think the vignettes need an overhaul, some details below and on the document after #NC: (I know this isn't proper comments in Rmd but I was trying to be faster!). I reckon maybe rather than doing this all in one hit I might need to take a function a day or something, as it's making my head hurt. But I can work though it all slowly.

I got as far as check_morpho_fun today

Some comments below.

README - vignette info, also short description of disparity.
Note that DESCRIPTION is a little more specific to disparity - maybe echo this in the README, but add that it can be used for ecological stuff too? Might make uses more obvious?
patch notes - I haven't edited as these are technical and personal anyway.
disparity_object

| | ---[[...]] = class:"list" (the following rarefactions)
| | | |
| | ---[[...]] = class:"numeric" (the bootstraps)
| |
| ---[[...]] = class:"list" (the following series)
| |
| ---$elements* = class:"matrix" (a one column matrix containing the elements within this series)
| |
| ---[[...]] = class:"matrix" (the rarefactions)

This bit gets confusing - maybe delete and just mention that of course this continues to the end of the list of series.

\---$disparity
	|
	\---[[2]] = class:"list" (the first series)

??? Also this is repeated for the second iteration, I think this is an error.

also what is disparity[[1]]?

Need to complete this for get.dispRity or extract.dispRity

I'll put more general comments as an issue...

Add serial mode to metrics

Add a serial mode for calculating disparity were the calculation is slower (for loop) but can be depending on previously calculated values (e.g. for ancestral.dist) or can be applied between subsets (e.g. for measuring the distance between groups).

TODO:

find a more general name than "serial"
create a new "serial" class option for metrics (input matrix and groups)
allow make.metric to test for the "serial" class option
add a "serial" option for dispRity: this argument can be logical (default is FALSE; TRUE = sequential comparison for chrono.subset objects/pairwise comparisons for custom.subset objects) or a list of pairs of comparisons (like for test.dispRity's comparisons argument)
implement ancestral.dist as a "serial" metric
implement min.dist as a "serial" metric

Things can be sped up by directly passing a list of comparisons (pairwise or sequential or user defined) that will be passed as a lapply_loop argument (or similar).

Include `metric` in the `dispRity` object

Include the metric_list from dispRity in the dispRity object when available.

Improve `dispRity` speed

Improve the speed in bootstrap.lapply with http://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html (pipe)

Try adding the functions as metric = c(f1,f2,f3) -> f1(f2(f3(matrix)))

get.contrast.matrix - in morpho.utlities

In get.contrast.matrix you say you set the diagonal to 0 but the code sets it to 1, as far as I can see - maybe check this?

Allow `plot.dispRiy` to plot single disparity values

data(BeckLee_mat50)
data(BeckLee_tree)
crown_stem <- crown.stem(BeckLee_tree, inc.nodes = FALSE)
subset_crown_stem <- custom.subsets(BeckLee_mat50, group = crown_stem)
disp_crown_stem <- dispRity(subset_crown_stem, metric = c(median, centroids))
plot(disp_crown_stem)

Finish model.test implementation

Add functionality for `model.test.sim`

Mark:

Allow the user to specify the function for the central tendency calculation (default should be model.test.sim(..., cent.tend = median).

Checking manuals

Adding references
Checking for typos

Making a page in the dispRity manual

We need to add a tutorial in inst/gitbook/03_specific-tutorials.Rmd after the section "### null morphospace testing with null.test".
This section should cover:

a bit of background on the test
how the test works (implemented)
how to summarize/interpret the test results

Finalizing the S3 methods

print
summary
plot

Test coverage

Make sure that model.test and model.test.sim have full test coverage:

Thomas check for:

Mark check for:

Add distance matrix conservation

Modify the dispRity function to keep distance matrices if elements are missing using a keep.distance = TRUE option.
With this option, the row selection information from boot.matrix, custom.subsets or chrono.subsets is also passed for the columns (dimensions is ignored).

More tests!

Dissimilarity analyses: ANOVA using dissimilarities, ANOSIM, MRPP, BIOENV, Mantel and partial Mantel tests.
Ordination and environment: vector fitting, centroid fitting and smooth surface fitting, adding species scores as weighted averages, adding convex hull, SD ellipses, arrows etc. to ordination.
Ordination: support and meta functions for NMDS, redundancy analysis, constrained correspondence analysis, constrained analysis of proximities (all three with partial analysis),