ropensci / rnexml Goto Github PK

View Code? Open in Web Editor NEW

13.0 9.0 9.0 8.74 MB

Implementing semantically rich NeXML I/O in R

Home Page: https://docs.ropensci.org/RNeXML

License: Other

R 65.68% XSLT 8.26% Makefile 0.02% TeX 25.90% Dockerfile 0.15%

r rstats nexml phylogenetics r-package

rnexml's Introduction

RNeXML

Maintainer: Carl Boettiger
Authors: Carl Boettiger, Scott Chamberlain, Hilmar Lapp, Kseniia Shumelchyk, Rutger Vos
License: BSD-3
Issues: Bug reports, feature requests, and development discussion.

An extensive and rapidly growing collection of richly annotated phylogenetics data is now available in the NeXML format. NeXML relies on state-of-the-art data exchange technology to provide a format that can be both validated and extended, providing a data quality assurance and adaptability to the future that is lacking in other formats. See Vos et al 2012 for further details on the NeXML format.

How to cite

RNeXML has been published in the following article:

Boettiger C, Chamberlain S, Vos R and Lapp H (2016). “RNeXML: A Package for Reading and Writing Richly Annotated Phylogenetic, Character, and Trait Data in R.” Methods in Ecology and Evolution, 7, pp. 352-357. doi:10.1111/2041-210X.12469

Although the published version of the article is pay-walled, the source of the manuscript, and a much better rendered PDF, are included in this package (in the manuscripts folder). You can also find it freely available on arXiv.

Installation

The latest stable release of RNeXML is on CRAN, and can be installed with the usual install.packages("RNeXML") command. Some of the more specialized functionality described in the Vignettes (such as RDF manipulation) requires additional packages which can be installed using:

install.packages("RNeXML", deps = TRUE)

The development version can be installed using:

remotes::install_github("ropensci/RNeXML")

Getting Started

See the vignettes below for both a general quick start and an overview of more specialized features.

rnexml's People

Contributors

Stargazers

Watchers

Forkers

nexml nizamibilal craigcitro fmichonneau vanderphylum stevenysw amoeba krlmlr davisvaughan

rnexml's Issues

Testing

@schamberlain Hey Scott, sorry about the bugs this morning. I think all the tests should be working now. I think this counts as our first real milestone!

Like you say, they could use a lot of expect_that tests still. It's not necessary to go too crazy with the expect_that tests though, thanks to the validator (xmlSchemaValidate). This is the really nice thing about the schema-based data structure, it comes with this really rigorous check that we've done things right.

Checking the schema is valid doesn't check that we have accurately read in a tree, of course. Rather than the more "unit" unit tests of checking each line, we'd probably get some really solid tests just by going back and forth between formats and then seeing if the objects were identical or not; starting both with a bunch of ape trees and with a bunch of NeXML trees (which you can grab from the treebase github repo: https://github.com/rvosa/supertreebase/tree/master/data/treebase )

Of course RNeXML doesn't completely implement all of NeXML (character matrices, networks, and other things still missing), so these are silently dropped when being read in. There are also a few things the schema validation does not check, like consistent use of ids (see #8 (comment) )

Post bugs as you hit them, should help us flush out special cases. A few I can anticipate already (e.g. #9 ) which one of us should get around too..

add TSNs from species names using taxize

Goal 2 from Kseniia's list, mentioned in #21 (comment)

Will be a good example for the manuscript, once gov't shutdown ends and the taxize servers are back up...

Parsing citation information

In addition to adding useful metadata, it would be helpful to be aware of the ontologies most often used in existing NeXML, such that we can attempt to parse and serve this data to the R user in native R formats without assuming knowledge of the database. Doing so is sometimes challenging.

For instance, we might want to extract citation information, author, title, etc, and wrap it up as an R bibentry object, which existing R tools can then generate a bibtex file for, or format in various ways (such as text or html). Unfortunately, this harder as a parsing activity then it is to serialize an R bibentry into NeXML, since we don't have a way to identify which dc:creators are authors of which other cited resources or just creators of the nexml file. (For instance, consider how we might annotate the bird.orders data set in ape -- the data is from Sibley, but compiled into ape from Paradis and written into nexml by us).

Currently, the get_citation function sidesteps this by just returning the content of dcterms:bibliographicCitation content, since this data isn't otherwise hierarchical (though it could be).

Attaching nexml metadata to phylo objects

Crazy idea: when reading into a phylo, should we create a new RNeXML environment, store the full nexml tree in there, and add a new slot to the phylo object storing an unevaluated get("<unique_tree_id>", envir=RNeXML) that methods could use to access the full NeXML??

This would let us do something like:

tr <- nexml_read("tree.xml")
metadata(tr)

instead of

tr <- nexml_read("tree.xml", type="nexml")
metadata(tr)

That is, reading in as the default (ape) type, and still calling functions that need the full nexml metadata, while also having an ape tree object that can still be passed around to the usual R packages.

Or maybe that's stupid and asking for trouble, and we should be explicit about what type of object we want.

Where can I help?

@cboettig Is there anything I can help with?

Include `about` property to any node containing metadata annotation

Necessary so that extracted RDF nodes still refer to the appropriate point in the DOM. Currently we are not writing this, e.g. https://github.com/ropensci/RNeXML/blob/c04435b8a9759fdd3e21aa2527a7f454b3a166d7/inst/examples/ncbii.xml

Maybe we should just include an about attribute whenever we write an id attribute? Alternatively we have to check if an about attribute is present whenever we add a meta annotation to a node.

hierarchical metadata?

Can a <meta> element have child meta elements? Do we have support for typeof and resource?

For example, one might want to declare a citation of a paper together with the data set in some really verbose SPAR RDFa as:

<li prefix="
        fabio: http://purl.oeg/spar/fabio/
        datacite: http://purl.org/spar/datacite/
        dcterms: http://purl.org/dc/terms/
        foaf: http://xmlns.com/foaf/0.1/
        prism: http://prismstandard.org/namespaces/basic/2.0/
        frbr: http://purl.org/vocab/frbr/core#
        ex: http:///www.carlboettiger.info/example/" 
    about="ex:work-early-warning-signals" typeof="fabio:ResearchArticle">
    <span rel="dcterms:creator">
        <span about="http://www.carlboettiger.info#me" typeof="foaf:Person">
            <span property="foaf:givenName">Carl</span> 
            <span property="foaf:familyName">Boettiger</span>
        </span>, 
        <span about="http://two.ucdavis.edu/~me" typeof="foaf:Person">
            <span property="foaf:givenName">Alan</span>
            <span property="foaf:familyName">Hastings</span>
        </span>
    </span>
    <span rel="frbr:realization" resource="ex:expr-early-warning-signals"  typeof="fabio:JournalArticle">
        (<span property="fabio:hasPublicationYear">2012</span>).
        <span property="dcterms:title" rel="frbr:partOf" resource="ex:RSB-279-1748">Early Warning Signals and the Prosecutor's Fallacy</span> 
    </span>
    <span about="ex:RSB-279" typeof="fabio:JournalVolume" 
        property="prism:volume" rel="frbr:partOf" resource="ex:RSB">279</span>
    (<span about="ex:RSB-279-1748" typeof="fabio:JournalIssue" 
        property="prism:issueIdentifier" rel="frbr:partOf" resource="ex:RSB-279">1748</span>) 
    <span about="ex:man-early-warning-signals" typeof="fabio:PrintObject">
        <span property="prism:startingPage">4734</span>-
        <span property="prism:endingPage">4739</span>
    </span>. 
    <em><span about="ex:RSB" typeof="fabio:Journal" property="dcterms:title">Proceedings of the Royal Society B</span> </em>
    <a href="http://dx.doi.org/10.1098/rspb.2012.2085" typeof="fabio:WebPage">
        <span rel="frbr:embodimentOf">
            <span about="ex:expr-doi-redirect-metadata" 
                typeof="fabio:MetadataDocument" rel="frbr:realizationOf">
                    <span about="ex:work-doi-redirect-metadata" 
                        type="fabio:MetadataEntity" rel="frbr:subject" 
                        resource="ex:work-early-warning-signals">doi:</span>
                    <span about="ex:expr-early-warning-signals" property="prism:doi">10.1098/rspb.2012.2085</span>
            </span>
        </span>
    </a>
    (<a href="http://dx.doi.org/10.5061/dryad.2k462" typeof="fabio:Manifestation">
        <span rel="frbr:embodimentOf">
            <span about="ex:expr-data" 
                typeof="fabio:DataFile" rel="frbr:realizationOf">
                    <span about="ex:work-data" 
                        type="fabio:Dataset" rel="datacite:hasDescription" 
                        resource="ex:expr-early-warning-signals">dryad doi:10.5061/dryad.2k462</span>
            </span>
        </span>
    </a>)
</li>

Generate otus element when coercing from tree to nexml

Validator says we need meta element or otus element before a trees element. (see inst/tests/test_serializing.R).

When coercing from ape::phylo to S4 tree, we put tip.label (e.g. taxanomic names) in the otu attribute of the node labels. This can be extracted to generate the otus element (possibly just a coercion from ListOfnode class into otus, unless defining such a method is convoluted...).

Need to think about user workflow in (optionally/automatically) providing additional annotation for these nodes (e.g. could query the names against taxize and add this data to the otus element.

Need also to think about strategy for post-hoc extending this annotation.

@rvosa It appears that annotation of species information could occur at the node level in the tree or in the otus level. Presumably most generic annotation about the taxanomic unit should be at the otus level?

add show and summary methods for nexml class

When to create a new otus block?

If I understand the schema correctly, nexml permits multiple otus blocks, though I haven't seen this used.

In writing a function to add characters data to an existing nexml object, we run into the following cases. Only the 3rd seems to involve a non-trivial case:

No otus node has yet been generated: so we generate one corresponding to the rownames in the matrix.
All rownames in the matrix already match otu elements in an existing otus block. Then we do nothing.
If only some match, we add the unmatched otus on the character matrix as new otu entries in the (first) existing otus node. (assuming the matches are found in the first otus node).

@rvosa It appears the schema supports multiple otus blocks, though I have only ever seen one used. Not sure why one would have multiple otus blocks, but it does raise a few questions about handling this third case:

So, I just wanted to make sure that we were handling this third case appropriately. We could instead add a new otus block in this case with all the otus for the characters, which would mean duplicate entries of certain otus. I'm not sure just how problematic that would be?

Because a characters node must refer to a single otus node for reference (I think), there's no point in checking for matches across multiple otus sets, or writing only the unmatched otu labels into a separate otus node. I assume it's no trouble to have more otu nodes in an otus block than any one trees or characters block actually needs?

Provide a concatenate method for nexml class

One can generate a nexml file with multiple characters or mulitple trees just by passing in a list of these things, e.g. add_characters(list_of_characters, nexml), add_trees(list_of_trees, nexml). But sometimes it might be useful to concatenate directly. R defines the c method to concatenate objects, so perhaps a method that could concatenate nexml objects (e.g. append trees and characters). Would error if all id values were not unique.

Conversions for DNA character data into/from ape::DNAbin class

Probably not a high priority; might be worth discovering users of the DNAbin class to get feedback first.

(When to) substitute labels for ids? (character matrices)

In reading nexml to phylo objects, we take the liberty of converting tip node otu ids (the otu attribute on the nodes) to the taxonomic labels (the label attribute on the otu node with the matching id).

This runs the risk of assuming the otu has a label attribute. If it doesn't, we use the id instead. (I believe "label" is always an optional attribute while "id" is always required when present -- at least for otu elements?)

It's unclear just how far to go with this logic, particularly given that labels are optional. For instance, in extracting the character matrix, it makes sense to replace otu id numbers (used as row names) for the actual species names (label on the matching otu node).

Likewise, I've applied this logic for characters. Characters end up as columns in the character matrix. When we first do the extraction, columns are named by the character id values (just as rows are label by otu ids to begin with). It seems to make sense to convert these to label provided in the matching char element (in format), if available. Does that make sense? Or should we be doing a different mapping?

Lastly, we can apply this logic to states. States are the values/cells of the matrix. For continuous characters, it seems there is no states element in the format, and the numeric values already what we want. But for discrete states, perhaps we should be replacing the ids (e.g. s1, s2 in your example) for the symbol attributes in the state elements? Or maybe not?

Automating these mappings make sense to match the matrix users expect to get out. While the id data is still available in the S4 object (and in the parsed XML of course), not sure if we should be discarding the ids in this extraction process or not.

Handle alternative namespaces on attribute values

Question for @duncantl

The XML tools respect namespace definitions on XML-level content, but view attribute values as strings. Consequently, when parsing an element such as

    <meta xsi:type = "nex:LiteralMeta" id = "m1"/>

What should we be doing such that we can successfully identify this as a meta element of type 'http://www.nexml.org/2009/LiteralMeta' regardless of whether the prefix is nex = http://www.nexml.org/2009 or something else?

It appears XML tools do respect the xsi prefix as just a prefix and not a string. newXMLNode warns if the prefix is not defined, as below:

m = newXMLNode("meta", attrs=c("xsi:type" = "nex:LiteralMeta", id = "m1"), namespaceDefinitions=c(xsi="http://www.w3.org/2001/XMLSchema-instance"))

and if we setClass with a slot named without the prefix, the default xmlToS4 handles it just fine:

setClass("meta", slots=c(type="character"))
xmlToS4(m)

returning:

An object of class "meta"
Slot "type":
[1] "nex:LiteralMeta"

Now it would be nice if we had a way to resolve the prefix appropriately before checking the value of the string. Currently RNeXML assumes the prefix is nex, e.g.

if(meta@type == "nex:LiteralMeta")
...

@duncantl How would we modify this statement to use whatever namespace is defined for nex, if possible? Should we just manually be splitting all attribute strings on the : symbol and querying against the namespaces list or is there something more clever?

(Note that these namespaces assume the role of proper XML namespaces when we do RDFa extraction to get an RDF file).

nexmlTree class when as(obj, "phylo") is a list

nexml can contain a single <tree> node inside the <trees> node, multiple <tree> nodes in a single <trees> node, or even multiple <trees> nodes. The first two cases map naturally onto a "phylo" object and a "multiPhylo" object, defined as classes in ape. In order to preserve the associations, we map the third case (multiple <trees> nodes) to a list of multiPhylo objects, which isn't something immediately reconizable as a phylogenetic tree class....

Consequently, we make this mapping automatically when asked to coerce a nexml file into a phylo, even though this means technically not returning the requested class (ask for phylo and get multiPhylo). We should probably handle that differently.

For instance, this current creates a problem when coercing a nexml file with multiple <trees> nodes into a nexmlTree class (because this includes a conversion to phylo. Because nexmlTree is the defaul read-in format now, this can cause reading in of such nexml files to fail.

Obviously it would be nice for a user to convert nexml to the ape classes without having to know how many trees are in the nexml file. Need to figure out the best way to do this.

Flushing out the class slots

Lots of my class definitions are missing slots for optional attributes. This doesn't break anything, we simply don't read that information out of the XML and into the S4 when we don't find a slot for it (we = xmlToS4 method). Ideally we'll want to support the full schema, so all classes should be expanded. (The cool thing about the S4 approach is this is relatively easy to do without breaking anything).

Generate MIAPA checklist-compliant nexml

RNeXML should optionally be able to include all the basic metadata listed on the MIAPA checklist, hopefully guiding users that are unfamiliar with the process and being able to provide reasonable automated suggestions when possible (e.g. suggesting external identifiers based on OTU labels, #24) A function might be provided that could check (and perhaps summarize/return) miapa compliance(?).

I've reproduced the checklist below with notes added on how we're doing in RNeXML.

For each item, I've either made a note on if/how we handle it in NeXML, or a question when I'm unsure how to handle it. For instance, I can sometimes find a corresponding block in the example files in the miapa repo, but they are in OWL and the translation to NeXML's meta/RDFa isn't clear to me. An example nexml file that satisfies all these requirements would be super helpful to me.

Topology

The topology itself, possibly as an identifier of a database (such as a !TreeBASE) record. included in the nexml tree node

Is this a gene tree or species tree? Do we use the treebase namespace to define this, or is there a better alternative?

<meta content="Species Tree" datatype="xsd:string" id="meta24059" property="tb:kind.tree" xsi:type="nex:LiteralMeta"/>
<meta content="21" datatype="xsd:integer" id="meta24062" property="tb:ntax.tree" xsi:type="nex:LiteralMeta"/>
<meta content="Unrated" datatype="xsd:string" id="meta24061" property="tb:quality.tree" xsi:type="nex:LiteralMeta"/>

It is a tree or a network? nexml defines this by using <tree> or <network>
Is topology rooted or not? In nexml, defined by an attribute root="true" on a member nod_. Should we consider declaring this in metadata too?
The type of consensus if this a consensus topology (that summarizes the topology inference in some way, rather than being directly provided by the inference method)

Do we use the treebase namespace for this as well? e.g.
```
<meta content="Consensus" datatype="xsd:string" id="meta24060" property="tb:type.tree" xsi:type="nex:LiteralMeta"/>
```
The topology should be "well described", as applicable to the inference method being used. For example, a likelihood for maximum likelihood analysis. For Bayesian analyses this should also include the burn-in period excluded, and the convergence of the chain(s). This may also include more then one topology, for example a sample from the posterior probability distribution for Bayesian, or equally scoring topologies for a maximum parsimony analysis. Examples?

OTUs:

All terminal nodes should be appropriately labelled and referenced in one of the following ways. Internal nodes need not be.

A meaningful external identifier (a combination of database or resource and identifier/accession within that database).
We generate with taxize, #24
For specimens, museum, collection (if applicable), and specimen identifier. Alternatively, if a specimen is not in a museum collection, use the laboratory, laboratory collection, and accession within that collection.
Precise (GPS) georeferences for specimens are highly desirable (but not always available).
Branch lengths: Some measure of branch length required unless it is not applicable to the analysis method.. Further semantics of the measure should be implied by the tree inference method. length attribute in nexml is sufficient
Branch support: Some value of branch support should be provided, for example posterior probability, or bootstrap value, unless it is not applicable to the analysis method. meta annotation of edge node. example?

Character matrix:

I note that this description is entirely in reference to the character matrix being data from which the tree was derived. It appears that the MIAPA standard doesn't refer to comparative trait data. Further, it many not always be desirable to include a copy of the character matrix in the data file, where that alignment can be found in a separate file might suffice?

aligned data matrix that is the basis for the tree (by having been the input for the tree inference method)

MIAPA shows an example how how to state that the tree wasDerivedFrom the alignment, not sure whe corresponding rdfa in the nexml would look like

 <owl:NamedIndividual rdf:about="&Peters2011hymenoptera;tree0000001">
        <rdf:type rdf:resource="&obo;CDAO_0000012"/>
        <rdf:type rdf:resource="&obo;CDAO_0000073"/>
        <prov:wasGeneratedBy rdf:resource="&annot;InferenceOfPetersTree"/>
        <prov:wasDerivedFrom rdf:resource="&annot;PetersAlignment"/>
    </owl:NamedIndividual>

Data type must be provided, for example DNA, RNA, protein, morphology, etc.
For molecular matrices, the accession numbers (and respective database(s) if different from Genbank) of the sequences used for each row must be provided.
a mapping that relates each row identifier to a tip of the topology otu attribute present on row
a mapping that relates each accession number or specimen identifier to a row label inverse of the above map

Alignment method

name of software used, version of program

MIAPA defines that the alignment wasGeneratedBy some software.

    <owl:NamedIndividual rdf:about="&annot;PetersMUSCLEAlignmentActivity">
        <rdf:type rdf:resource="&edamontology;operation_2928"/>
        <rdf:type rdf:resource="&obo;MIAPA_0000003"/>
        <prov:wasAssociatedWith rdf:resource="&annot;Muscle"/>
        <prov:used rdf:resource="&obo;MIAPA_0000013"/>
    </owl:NamedIndividual>

parameters used (or default if default values were used).
whether alignment was manually corrected or edited

Character trait data

This is not part of the draft MIAPA standard, but merely my own suggestions/brainstorm list, based on the required metadata for EML description of character traits

character trait name (Or trait label/definition pair)
possible states a discrete trait can have
units (for continuous traits)
methodological description of how the trait was measured

Tree inference method

name of software used, version of program

    <owl:NamedIndividual rdf:about="&annot;RaXML_7.2.8">
        <rdf:type rdf:resource="&obo;MIAPA_0000016"/>
        <rdfs:label>RAxML_7.2.8</rdfs:label>
        <swo2:SWO_0000740 rdf:resource="&annot;UseMaximumLikelihood"/>
        <swo:SWO_0004000 rdf:resource="&obo;MIAPA_0000017"/>
    </owl:NamedIndividual>

parameters used, including model of evolution, and optimality criterion

 <owl:NamedIndividual rdf:about="&annot;UseMaximumLikelihood">
        <rdf:type rdf:resource="&obo;MIAPA_0000015"/>
        <rdfs:label>Maximum Likelihood algorithm</rdfs:label>
        <dc:description>The inference algorithm uses maximum likelihood as an optimality criterion. </dc:description>
    </owl:NamedIndividual>

character weights if (normally then morphological) characters were weighted.

Compatibility with read.nexus.data and write.nexus.data

Might imagine a user would want to extract character data that comes in nexus format (used by http://www.morphobank.org/, for instance) into NeXML.

Unfortunately, though ape provides read.nexus.data, the function is extraordinarily limited. For instance, it cannot handle spaces in species names or sequences, and appears to return only a list of elements named by taxa without any metadata about the traits themselves, at least in reading in my hand-edited nexus matrix from morpho. read.nexus.data also fails to read in (after a minute of cpu effort) even the example nexus file produced by ape's own write.nexus.data ...

Guessing these ape functions are not widely used. Perhaps there's a better alternative pipeline for getting the nexus data into NeXML that we can wrap?

Better merging of character data frames

@schamberlain Thought I might tap some of your plyr/reshape2 awesomeness here:

get_character_list extracts a list of data.frames. Each data frame has taxon names as the rownames and character traits as colnames, and may have one or more columns depending on how may traits are in the dataset. The list has multiple data frames if either they correspond to different sets of taxa, or if they are different kinds of traits (continuous vs discrete). For example:

f <- system.file("examples", "comp_analysis.xml", package="RNeXML")
nex <- read.nexml(f)
char_list <- get_characters_list(nex)

returns a list of length 2, with a data frame for continuous trait, and a data.frame for a discrete trait (factor).

> char_list
$cs15
         log snout-vent length
taxon_8             -3.2777799
taxon_9              2.0959433
taxon_10             3.1373971
taxon_1              4.7532824
taxon_2             -2.7624146
taxon_3              2.1049413
taxon_4             -4.9504770
taxon_5              1.2714718
taxon_6              6.2593966
taxon_7              0.9099634

$cs31
         reef-dwelling
taxon_8              0
taxon_9              1
taxon_10             0
taxon_1              1
taxon_2              0
taxon_3              0
taxon_4              0
taxon_5              1
taxon_6              1
taxon_7              1

How best to merge these into a single data.frame:

in the general case of N list entries, and without assuming the rownames are in the same order?

(If the rownames are in the same order, we can do do.call(cbind, char_list).)

The rownames could correspond to different taxa values entirely. How then would we bind them into a single data.frame, such that each unique taxa gets a new row (and perhaps any taxon found in both lists would be combined?)

I tried hacking this with ldply, see

RNeXML/R/characters.R

Line 38 in 2f07e2f

get_characters <- function(nexml){

but that doesn't collapse the rows when the columns match, and has that ugly for loop too.

Thanks!

more metadata use cases

Many R-based tools need ultrametric / time-calibrated phylogenies. R also provides several tools to do this. A good use case for metadata reading and writing might be to work out what metadata we might add if we: read in an uncalibrated phylogeny, use a given function (and parameter choice potentially) in a given software to perform the time-calibration, and then write out the time-calibrated tree. For instance, we might annotate:

statement that tree is ultrametric
software used to calibrate the tree
function call used (with parameters, maybe CDATA the r code??)
reference to source file on which calibration was based
... what else?

Tackling "special cases"

Warn or error on special cases (e.g. some valid nexml trees probably cannot be completely valid phylo objects. For instance, network with horizontal transfer events.

Navigating ontologies

Thanks everyone for suggesting various ontologies we can use to start making our R-generated nexml more expressive (partly for my own record, I've compiled a list of those suggested in the issues tracker and replies to my nexml-discuss query below). However, lacking experience in this area, I haven't been very successful at finding the terms I need in these. Often I don't know where to start looking, and clearly not all of these are populated.

For starters, it would be good to have a list of ontologies most commonly used in nexml files, as it would make sense to parse these for R users in cases where their interpretation might not be obvious. I've started with those used by TreeBase (dc, prism, etc), e.g. #25.

Still looking for lots of particular terms, e.g. the open check boxes from issue #21 .

Any suggestions on how to go about discovering a term that I might need?

For example, I can try to skim something like: http://cdao.cvs.sourceforge.net/viewvc/cdao/cdao/OWL/cdao.owl?revision=1.34 for a useful term. e.g., it looks like I might declare the tree to be time calibrated with the term: http://www.evolutionaryontology.org/cdao/1.0/cdao.owl#TimeCalibratedLengthType, but I'm not sure quite how to do that. adding a meta element to every edge element stating the same thing doesn't seem ideal... And scanning the owl file by hand for a term doesn't seem ideal either...

Ontologies mentioned so far:

Ones we include by default (e.g. from TreeBase), which I think I mostly understand pretty well (except for cdao)

"nex"   = "http://www.nexml.org/2009",
"xsi"   = "http://www.w3.org/2001/XMLSchema-instance",
"xml"   = "http://www.w3.org/XML/1998/namespace",
"cdao"  = "http://www.evolutionaryontology.org/cdao/1.0/cdao.owl#",
"xsd"   = "http://www.w3.org/2001/XMLSchema#",
"dc"    = "http://purl.org/dc/elements/1.1/",
"dcterms" = "http://purl.org/dc/terms/",
"prism" = "http://prismstandard.org/namespaces/1.2/basic/",
"cc"    = "http://creativecommons.org/ns#",

Additional ones recently mentioned, which I don't have a good grasp for exactly what kind of terms they provide or how to characterize their intersection...

https://github.com/phylotastic/ontologies

which also points to:

Darwin core,

Karen suggests: http://opentree.wikispaces.com/NexSON
Rutger mentioned: http://edamontology.org/EDAM.owl

(will continue to update this as my running list)

Conversion between EML and NeXML (at least for character data)

Given a csv (or excel?) file containing (phenotypic) character data and an associated EML metadata file describing it, can we completely serialize this csv file and associated metadata into RNeXML characters nodes?

Slots vs. representation?

The documentation for setClass states

S3methods, representation, access, version: All these arguments are deprecated from version 3.0.0 of R and should be avoided.

I noticed that you are using representation whereas in spocc I have been using slots. I don't know which is right, but it looks like combo of slots and contains is recommended

representation is an argument inherited from S that included both slots and contains, but the use of the latter two arguments is clearer and recommended.

character matrix strategies

NeXML, like nexus, can contain a lot of character matrix (e.g. sequence) data. Current approach, like the read.nexus functions, simply ignores this.

We will want to be able to read and manipulate R objects without having to carry around the weight of the character data. This can probably be controlled through the read_nexml top-level api, e.g. nexml_read("file.nexml", type="phylo") vs type = character_matrix.

We need to figure out what R object we want to coerce character matrices to, if any. Not familiar with many R functions that use sequence data, so learning what functions exist and what formats they expect would be a first step. Meanwhile, we will presumably just read it into our S4 object equivalent. Methods can always be added later.

Comparative methods, which dominate the phylogenetics R tools, have the notion of character data as well, but usually as phenotypic data that is not meant to be informative of the tree inference, has no notion of alignment, etc. It would seem strange to represent this data in the NeXML in the same way. @rvosa What is the best way to go about this? Presumably this is related to the phenoscape project, but I haven't looked at that. Advice / strategies welcome.

Write a knitr readme / vignette showing basic reading and writing

write.nexml methods for multiPhylo objects

We can read in multiPhylo, but don't convert multiPhylo to single nexml:

f <- system.file("examples", "trees.xml", package="RNeXML")
trees <- read.nexml(f, "phylo") # reads in multiPhylo of 2 trees
write.nexml(trees) # errors

blog post?

Hi guys,

I am quite keen to write a little blog post on http://biophylo.blogspot.com about your work. Would that be OK? Do you feel comfortable exposing this to my vast, vast readership?

Rutger

ape conversion should handle case of some missing branch lengths

Some trees (e.g. Treebase "S10327.xml", "S10334.xml", ...) have branch lengths defined for only some edges, leading to the errors seen in #15. RNeXML should simply write these as NAs in converting to phylo type. Currently generates an error.

Does the current conversion to ape lose element id information?

Does the current conversion to ape lose element id information? Can we program around this while still maintaining valid phylo objects?

E.g. once I have coerced a nexml file to a phylo tree and identified a node of interest in the phylo object, can I unambiguously query for metadata on that node id, or the corresponding otu?

Related: trying to wrap my head around the use of having both label and id attributes on most elements...

phylobase strategies

We'll want to support coercion into phylobase trees (phylo4, etc).

On one hand, since these trees are S4 objects, are more carefully defined than ape::phylo trees, richer, and more extensible, we could build our entire strategy around going from RNeXML::nexml => phylobase::phylo4 => ape::phylo.

On the other hand, speaking to ben bolker at ESA, phylobase uptake is low (though does have reverse dependencies), and more troublingly, NCL (Nexus Class Library) is kinda a weak point that has made maintenance challenging.

I suspect the ideal solution given this issue is to support direct coercion into ape::phylo, and include phylobase as "suggests" only, with separate methods to go from nexml => ape::phylo and nexml => phylobase. Package could then be used without installing phylobase, which would only be needed for the phylobase methods.

Obviously we will want to write a phylobase coercion method in any event. Can reuse many of the same elements, but will also have slots for additional data which could make our life easier (or harder -- in the end it might be better to support adding annotations directly to the RNeXML S4 objects, which map 1:1 to the schema, rather than first figuring out how they map to phylobase objects...)

Thoughts?

Implement R support for standard metadata elements

There's a variety of meta elements we might want to attach to the top of most nexml documents we write

Creator, with contact information
(Not providing contact info in default configuration due to spam concerns. Contact information can be added as additional metadata, including things like foaf:homepage or foaf:account = "https://github.com/cboettig")
Title
Description
License declaration (e.g. CC0)
Timestamp
Information about where the file is released / published (e.g. Github, Dryad, etc)
Journal citation information of an associated article

Will give us some good practice writing RDFa style meta elements before we tackle more serious annotation. @rvosa @hlapp might you point to a good model NeXML files we can template off of for these? (e.g. the treebase nexmls are a good example of journal citation info, and a couple others. Not sure if I've seen a license example). Clearly this involves identifying the ontology for most terms here (though dublin core may cover most of it).

Some of these (license, timestamp) we might consider adding by default(?).

A good implementation will allow the user to pass R's native objects where relevant (e.g. person, citation) to be included.

import "phylo" and "multiPhylo" classes

Think we need an importFrom(ape, "phylo") etc in the NAMESPACE not quite sure if that's correct and what the roxygen would look like.

Resolution of this issue should deal with the warnings on document and install (on clean R session without ape already loaded) about these classes not being defined.

Coercion between S4 and S3 classes feels a bit funny, but it seems such methods are provided in phylobase...

Interpreting comparative data in nexml files

I'm intrigued by @rvosa's suggestion of showing how RNeXML can be used to document both the character data and phylogenetic data used in comparative phylogenetics, which accounts for many of the R package use cases.

However, there may be a bit of a cultural "stereotype" to overcome in pitching this use case. From my own interactions I'm under the impression that most researchers assume that any character data in a NeXML file is that which is coded for and used for phylogenetic inference of the tree below it. I am afraid researchers might be hesitant to write comparative trait data to a nexml file for fear of making it look like their beautiful tree was inferred from some tiny character data set, instead some big file of sequence data.

I'll ask some fellow practitioners about this, but perhaps there is something we can add metadata-wise to indicate that the phylogeny was/wasn't inferred from the character data provided?

Any thoughts on this?

Understanding set and referencing

@rvosa @hlapp I'm not really clear what attributes belong to Base. I've stuck xsi:type there because it seemed to make sense....

I'm still confused about set. I don't see why I'd want to use this; maybe I need a more flushed-out example?

I'm also not exactly sure how I tell if something is "referenced" anyway: I realize that <node otu = "t1"/> "references" <otu id="t1"/>, so that the otu element must appear first; correct?

But what is it that makes this a reference? I can think of two possible differing definitions:

The fact that node has both an attribute matching the name of an element and the attribute has a value matching the attribute of an element of that name?
Or is it sufficient just to have any attribute whose value matches the id attribute of another element (e.g. does <foo bar="t1"/> reference the <otu id ="t1"/> element?)

rooted vs unrooted trees

rooted trees can become unrooted in converting from ape to nexml and back. make sure to handle this more carefully!

How would we express a simmap tree in NeXML?

Tree "paintings" indicating different evolutionary regimes/modes along different parts of a phylogeny are a particularly common use case for comparative phylogenetics in R.

The current 'standard' seems to be simmap's representation, e.g. see Liam's description: http://blog.phytools.org/2010/12/reading-simmap-trees-into-r.html

Presumably this would be done with meta annotations on each edge indicating the state and length of time in that state along each branch?

Any ideas or existing examples?

Don't error if we cannot convert only some <tree> elements in a list of <trees> to a multiPhylo

As documented in #15, we should probably only provide a warning and just drop the trees that cannot parse rather than failing to convert the whole collection.

This happens, for instance, when some of the trees contain only metadata (no nodes or edges).

Flushing out the remaining classes

Lots of classes / XML node types not yet written. Would be good to write these for some practice (@rvosa might be able to suggest best which ones to prioritize? Or just look at the sample nexml files in inst/examples/ to see any nodes we haven't defined ...) In principle, once Duncan has the XMLSchema package robustly working, we can generate these classes, along with coercion methods to write them back to XML etc, automatically, which will save a lot of effort. But since it's quite mechanical writing them by hand doesn't take very long.

online validator tests?

@rvosa Is the validator tool on the nexml website, http://www.nexml.org/nexml/phylows/validator, doing more checks than the standard XML schema validation? If so, perhaps we could wrap this as an R function.

i.e. this is true of the EML validator, http://knb.ecoinformatics.org/emlparser/, which checks that any "references" node matches some corresponding "id", etc. For that reason we wrapped the validator in R using RHTMLForms, though a programmatic interface to the validator with structured output would be preferable.

On the other hand, if it's just validating against the NeXML schema we can do that internally with R XML tools.

Implement metadata extraction/parsing tools

We will want either/both of:

A function to extract all the metadata nodes from the S4 object and summarize this data neatly (more obvious for top-level metadata, less clear how to present the annotations of individual nodes)
RDFa based tool for extraction/reasoning on the ontological terms?

Additionally, might allow some automatic calculation of metadata such as the number of trees in a file, number of taxa, names of taxa, etc (e.g. along the lines of the TreeBase metadata) / PhyloWS terms. Perhaps we should add this summary data in meta nodes or is that asking for trouble?

Really need to enumerate the use-cases for leveraging this metadata (may involve thinking more about additional metadata we want to add).

concatenate method for meta elements

Should be able to concat meta elements instead of new("ListOfmeta", list(meta1, meta2))

Reach out to R and NeXML lists

Now that we can do the basics (substantial testing #15 aside, ) it might be good to reach out to r-sig-phylo and the nexml-discuss mailing lists for input, critique and feature requests? What exactly do we want to write that would be concise and specific enough to perhaps get some reply? Do we need any prequel about the motivation for the package (particularly about nexml to the R list?) Should that just appear in the package README?

phylo to tree methods handling when phylo doesn't have edge lengths

Pretty sure current setAs("phylo", "tree") will be unhappy if phylo doesn't have edge.length defined. Fix and write unit test showing this works.
Likewise, write unit test for tree to phylo when we don't have length data in the nexml.
nexml probably supports missing data better, e.g. lengths on only some edges. RNeXML needs to do something intelligent (e.g. NAs and warning?) on coercing this case to phylo (with unit test).

Extraction and writing of mixed-type character data frames

If we extract a list of character matrices for the same OTU set, it might make sense to combine them into the same data.frame (would have to be a data.frame and not a matrix, since some columns might be continuous traits and some might be discrete traits). Currently, extracting the character matrix from a file like comp_analysis.xml returns a list with 2 data frames, one of the continuous traits and 1 of the discrete traits. While this maps cleanly to the nexml, it feels a bit cumbersome on the R end. @rvosa how would you feel about collapsing these into a single data.frame whenever the characters blocks reference the same otus block?

Likewise we have the reverse issue, where a user may have a data.frame with both continuous and discrete characters for the same set of taxa. We would have to break this into two separate characters nodes in serializing to nexml, right?

Lastly is the same issue we had with multiple trees blocks. A call to characters(nexml) will return a list of data.frames (or matrices), usually of length 1 unless their are multiple otus blocks. Seems sure to be annoying to get a length 1 list when a user might rightfully expect a data.frame or a matrix.

(I'm leaning to returning a data.frame rather than a matrix to represent the character matrix, as this seems most consistent with the phylogenetics R use, though many functions, like geiger::fitContinuous take either structure.)

handle polymorphic states appropriately

We parse polymorphic states into the S4 version fine, but do not convert them into a matrix.

I think StandardCells types with polymorphisms could cause get_characters to error. Not clear what the return data-type should be for the polymorphic/uncertain character state anyhow.

Not sure that this is a high priority, not sure how common this is in character data (e.g. vs just having a single explicit uncertain state).