Giter VIP home page Giter VIP logo

emld's Introduction

lifecycle Travis-CI Build Status AppVeyor build status Coverage Status CRAN_Status_Badge DOI DOI

emld

The goal of emld is to provide a way to work with EML metadata in the JSON-LD format. At it’s heart, the package is simply a way to translate an EML XML document into JSON-LD and be able to reverse this so that any semantically equivalent JSON-LD file can be serialized into EML-schema valid XML. The package has only three core functions:

  • as_emld() Convert EML’s xml files (or the json version created by this package) into a native R object (an S3 class called emld, essentially just a list).
  • as_xml() Convert the native R format, emld, back into XML-schema valid EML.
  • as_json() Convert the native R format, emld, into json(LD).

Installation

You can install emld from github with:

# install.packages("devtools")
devtools::install_github("ropensci/emld")

Motivation

In contrast to the existing EML package, this package aims to a very light-weight implementation that seeks to provide both an intuitive data format and make maximum use of existing technology to work with that format. In particular, this package emphasizes tools for working with linked data through the JSON-LD format. This package is not meant to replace EML, as it does not support the more complex operations found in that package. Rather, it provides a minimalist but powerful way of working with EML documents that can be used by itself or as a backend for those complex operations. Version 2.0 of the EML R package uses emld under the hood.

Note that the JSON-LD format is considerably less rigid than the EML schema. This means that there are many valid, semantically equivalent representations on the JSON-LD side that must all map into the same or nearly the same XML format. At the extreme end, the JSON-LD format can be serialized into RDF, where everything is flat set of triples (e.g. essentially a tabular representation), which we can query directly with semantic tools like SPARQL, and also automatically coerce back into the rigid nesting and ordering structure required by EML. This ability to “flatten” EML files can be particularly convenient for applications consuming and parsing large numbers of EML files. This package may also make it easier for other developers to build on the EML, since the S3/list and JSON formats used here have proven more appealing to many R developers than S4 and XML serializations.

library(emld)
library(jsonlite)
library(magrittr) # for pipes
library(jqr)      # for JQ examples only
library(rdflib)   # for RDf examples only

Reading EML

The EML package can get particularly cumbersome when it comes to extracting and manipulating existing metadata in highly nested EML files. The emld approach can leverage a rich array of tools for reading, extracting, and manipulating existing EML files.

We can parse a simple example and manipulate is as a familiar list object (S3 object):

f <- system.file("extdata/example.xml", package="emld")
eml <- as_emld(f)
eml$dataset$title
#> [1] "Data from Cedar Creek LTER on productivity and species richness\n  for use in a workshop titled \"An Analysis of the Relationship between\n  Productivity and Diversity using Experimental Results from the Long-Term\n  Ecological Research Network\" held at NCEAS in September 1996."

Writing EML

Because emld objects are just nested lists, we can create EML just by writing lists:

me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))

eml <- list(dataset = list(
              title = "dataset title",
              contact = me,
              creator = me),
              system = "doi",
              packageId = "10.xxx")

ex.xml <- tempfile("ex", fileext = ".xml") # use your preferred file path

as_xml(eml, ex.xml)
#> NULL
eml_validate(ex.xml)
#> [1] TRUE
#> attr(,"errors")
#> character(0)

Note that we don’t have to worry about the order of the elements here, as_xml will re-order if necessary to validate. (For instance, in valid EML the creator becomes listed before contact.) Of course this is a very low-level interface that does not help the user know what an EML looks like. Creating EML from scratch without knowledge of the schema is a job for the EML package and beyond the scope of the lightweight emld.

Working with EML as JSON-LD

For many applications, it is useful to merely treat EML as a list object, as seen above, allowing the R user to leverage a standard tools and intuition in working with these files. However, emld also opens the door to new possible directions by thinking of EML data in terms of a JSON-LD serialization rather than an XML serialization. First, owing to it’s comparative simplicity and native data typing (e.g. of Boolean/string/numeric data), JSON is often easier for many developers to work with than EML’s native XML format.

As JSON: Query with JQ

For example, JSON can be queried with with JQ, a simple and powerful query language that also gives us a lot of flexibility over the return structure of our results. JQ syntax is both intuitive and well documented, and often easier than the typical munging of JSON/list data using purrr. Here’s an example query that turns EML to JSON and then extracts the north and south bounding coordinates:

hf205 <- system.file("extdata/hf205.xml", package="emld")

as_emld(hf205) %>% 
  as_json() %>% 
  jq('.dataset.coverage.geographicCoverage.boundingCoordinates | 
       { northLat: .northBoundingCoordinate, 
         southLat: .southBoundingCoordinate }') %>%
  fromJSON()
#> $northLat
#> [1] "+42.55"
#> 
#> $southLat
#> [1] "+42.42"

Nice features of JQ include the ability to do recursive descent (common to XPATH but not possible in purrr) and specify the shape of the return object. Some prototype examples of how we can use this to translate between EML and https://schema.org/Dataset representations of the same metadata can be found in https://github.com/ropensci/emld/blob/master/notebook/jq_maps.md

As semantic data: SPARQL queries

Another side-effect of the JSON-LD representation is that we can treat EML as “semantic” data. This can provide a way to integrate EML records with other data sources, and means we can query the EML using semantic SPARQL queries. One nice thing about SPARQL queries is that, in contrast to XPATH, JQ, or other graph queries, SPARQL always returns a data.frame which is a particularly convenient format. SPARQL queries look like SQL queries in that we name the columns we want with a SELECT command. Unlike SQL, these names act as variables. We then use a WHERE block to define how these variables relate to each other.

f <- system.file("extdata/hf205.xml", package="emld")
hf205.json <- tempfile("hf205", fileext = ".json") # Use your preferred filepath

as_emld(f) %>%
  as_json(hf205.json)

prefix <- paste0("PREFIX eml: <eml://ecoinformatics.org/", eml_version(), "/>\n")
sparql <- paste0(prefix, '

  SELECT ?genus ?species ?northLat ?southLat ?eastLong ?westLong 

  WHERE { 
    ?y eml:taxonRankName "genus" .
    ?y eml:taxonRankValue ?genus .
    ?y eml:taxonomicClassification ?s .
    ?s eml:taxonRankName "species" .
    ?s eml:taxonRankValue ?species .
    ?x eml:northBoundingCoordinate ?northLat .
    ?x eml:southBoundingCoordinate ?southLat .
    ?x eml:eastBoundingCoordinate ?eastLong .
    ?x eml:westBoundingCoordinate ?westLong .
  }
')
  
rdf <- rdf_parse(hf205.json, "jsonld")
df <- rdf_query(rdf, sparql)
df
#> # A tibble: 0 x 0

Please note that the emld project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

ropensci_footer

emld's People

Contributors

amoeba avatar cboettig avatar jeanetteclark avatar jeroen avatar mbjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

emld's Issues

additional validation errors

I found additional validation errors in addition to #46.

  1. The packageId is not used as an identifier for checking uniqueness with other id values and for resolving references.
  2. Errors are thrown for annotation elements in additionalMetadata because their parent doesn't contain an id. This should be ok, because in additionalMetadata the annotation subject is determined via the describes fields.
  3. The annotation/@references attribute values are not checked during validation to be sure they resolve to an id -- only the //references elements are checked.

I've begun implementing fixes for these in my fork here: https://github.com/mbjones/emld/tree/46-validation-errors

<references> nodes should be turned into @id

Currently package generates JSON-LD such as:

"creator": {
        "@id": "clarence.lehman",
        "individualName": {
          "givenName": "Clarence",
          "surName": "Lehman"
        }
      },
      "contact": {
        "references": "clarence.lehman"
      }

This should really be turned into:

"creator": {
        "@id": "clarence.lehman",
        "individualName": {
          "givenName": "Clarence",
          "surName": "Lehman"
        }
      },
      "contact": {
        "@id": "clarence.lehman"
      }

Or perhaps more formally:

     "contact": {
        "@id": "clarence.lehman"
       "@type": "@id"
      }

Going in reverse, some care will need to be taken to put the <references> tag back in only when the @type is @id or the @id is the only property.

rOpenSci review: Kelly

The main issues I noticed were in the documentation, especially to correct the SPARQL query in the README/vignette so that a blank dataframe is not returned.

Thanks! README example hadn't been updated when we switched the default from 2.2.1 version to the 2.2.0 version of the EML spec (as you already figured out).

Documentation

  1. Example needed for as_emld().

added!

  1. The example for eml_locate_schema() requires having the EML package installed. Consider adding this package to Suggests in package documentation.

Fixed, this example can run using a file from the emld package instead. Also, I think there's not a good use case to justify exporting this function; so I've made it a non-exported file.

  1. I was unable to locate a CONTRIBUTING file.

added.

  1. I was unable to locate the maintainer in the DESCRIPTION file.

This is in Authors@R, indicated by the "cre" role

Suggestions for improving package documentation:

  1. Typo in package description: "faciliate" should be "facilitate" on line 18 of DESCRIPTION.

thanks, fixed!

Suggestions for improving function documentation:

  1. For as_emld, consider providing information about the return value of the function.

done.

  1. The Description could be improved for as_json, as_xml, and template documentation.

done

  1. For template, consider providing more options for the object argument may be helpful.
  2. For template, the Value text refers to usage in vignettes however I was unable to locate that.

I'm kind of on the fence as to whether template is really a useful function. It used to be in the README but I dropped it to focus on the three as_* functions as really the only core functions in the package. Curious for feedback on this. Generally, I think it is trying to fill a role that is almost always better served by a more feature-rich package like EML.

Suggestions for improving README:

  1. Typo in SPARQL queries section - I believe "names are act as variables" should be "names act as variables"

thanks, done!

  1. Consider indenting as_json("hf205.json") line after pipe in SPARQL queries section for readability
    done
  1. The SPARQL query in the example returns nothing as written. I believe the eml PREFIX should be updated to <eml://ecoinformatics.org/eml-2.2.0/>.
    yes, done!
  2. I was able to run everything in the vignette without separately loading the library maggritr. Is this required?
    I believe it is needed for the examples that use %>%

Functionality

  1. The naming scheme is slightly different than the object_verb format suggested in rOpenSci packaging guidelines, however it is consistent with other conversion-based functions. My only concern is that while the function names are not in conflict, they are similar to non-eml specific conversion functions jsonlite::toJSON and xml2::as_xml_document. The emld::as_xml details clarify that the funciton is specific to EML and emld objects -- perhaps it would be useful to clarify this in the as_json documentation as well?

Yeah, this is a great question. As you note, conversion / coercion methods and file IO (read/write) functions tend to reverse this (using verb_object, like read_xml / write_xml). So I think the as_emld etc pattern makes sense.

Good point about as_xml / as_json being ambiguous, but these are defined as S3 generics (like print and plot) providing methods for the emld type specifically, so this isn't really a collision.

  1. Although it is rather intuitive, please add the type of return object to the documentation for the functions as_emld, as_json, and as_xml to keep with rOpenSci packaging guidelines (1.6).

Nice, done now!

  1. Consider adding top-level documentation to return from ?emld in keeping with rOpenSci packaging guidelines (1.6).

Good catch! Done, thanks.

  1. Are functions in as_eml_document.R, validate_units.R, etc. internal functions that should have #' @noRd as per rOpenSci packaging guidelines (1.6)?

I think this only makes sense if the internal function actually has roxygen documentation notes, to prevent the confusion of the function becoming documented in the package manual, etc. Having an internal function with only a @noRd roxygen tag and no documentation doesn't produce any different behavior than omitting it, and could just be confusing.

  • On devtools::check() I encountered the following notes:
❯ checking package dependencies ... NOTE
  Package suggested but not available for checking: ‘spelling’

❯ checking top-level files ... NOTE
  Non-standard file/directory found at top level:
    ‘LICENSE.md’

LICENSED.md now added to .Rbuildignore. using install_github("cboettig/emld", dependencies=TRUE) should install all packages listed in Suggests, including spelling.

  • the output of goodpractice:gp() similarly suggested that spelling was not available, however this message disappeared after I installed the spelling package.

yup. I wish goodpractice::gp() would direct users to use devtools::install(dependencies=TRUE) so users know they need the Suggested packages installed as well as the Imported packages in order to run tests etc.

Consider changing default `xsi:schemaLocation` value to point to a remote copy of the eml.xsd

As discussed in ropensci/EML#292, the current default behavior of emld is to set the xsi:schemaLocation to values like https:://ecoinformatics.org/eml-2.2.0/ eml.xsd which matches the examples in https://github.com/NCEAS/eml but isn't useful for validation tools that follow remote URIs to find a copy of the schema. @scelmendorf had to set a custom value to get validation to work which I'd argue isn't desirable.

We should discuss whether it makes sense to set a value like https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd instead of just eml.xsd so, by default and with no tweaking from the user, validation tools like Oxygen or any XML-aware text editors can do validation of documents EML/emld produces.

What do others think about this?

naming in the function API

@amoeba Any suggestions on function names for the main API?

I'm currently thinking that the trio:

  • as_emld()
  • as_xml()
  • as_json()

form the core API, with as_emld() working on json, xml_document or external files with .json, .xml extensions and returning our S3 object; while the other two functions reverse that accordingly.

This would replace the existing duo API, xml_to_json and json_to_xml, which basically skipped over having a user-facing R list object. (and where as_emld is basically a generalization of parse_eml()).

I'm hoping this should emphasize that there are really 3 relevant object types (the R list, xml, and json).

The idea would then be to build the user friendly tooling around whatever type or combination of types is best suited for the task.

Avoid # and @ on attribute names in list objects

The emld list objects should not prefix attribute names with #. There's no need for this, and it's cumbersome (particularly since it means named arguments need to be quoted).

likewise should not prefix id as @id -- though the default frame and context should define "id": "@id" so the JSON-LD functions interpret this correctly, as is already done in schema.org.

Thanks @amoeba for making me realize this.

Should `validate_units` be reworked to use the right version of the unitDictionary?

@jeanetteclark ran into a situation earlier where she was working with an EML 2.1.1 doc but had her emld_db option set as eml-2.2.0.

As you can see below,

unit_valid <- validate_units(doc, encoding = encoding)

emld/R/validate_units.R

Lines 15 to 19 in 4e6db62

standard <- xml2::read_xml(system.file("tests",
getOption("emld_db"),
"eml-unitDictionary.xml",
package = "emld"
))

With the new machinery in place to automatically detect the appropriate EML schema to validate with from the input doc, this causes her 2.1.1 doc to have its units validated with the EML 2.2.0 unitDictionary.

I consider this just an oversight on my part and would be happy to change this behavior so that the version of the unitDictionary is dependent on the input document and not the global.

What do others think?

PS: eml_additional_validation is also not version-specific but I think the rules are backwards compatible so that's fine enough for now I think.

Better name for package

Original name: emljson

Consider:

  • emon: Ecological Metadata Object Notation (already taken on CRAN)
  • emlon: Ecological Metadata Language Object Notation
  • emonld Ecological Metadata Object Notation for Linked Data
  • emlildon ecological metadata language in linked data object notation
  • ...

Consider template() function to generate empty named lists for objects

Just added a prototype of this method. Returns a list object which I preview here using the yaml layout.

> template("contact", recursive = TRUE) %>% as.yaml() %>% cat()
individualName:
  salutation: {}
  givenName: {}
  surName: {}
organizationName: {}
positionName: {}
address:
  deliveryPoint: {}
  city: {}
  administrativeArea: {}
  postalCode: {}
  country: {}
phone: {}
electronicMailAddress: {}
onlineUrl: {}
userId: {}

So far this function is just a simple wrapper around the extracted eml_db database of the slots for all classes from EML package.

Users could fill the R list directly, or maybe serialize it out to a JSON or YAML file and fill that out and then read it back in. Maybe that is too cumbersome though.

Define Frame & Context

Currently context is just a default vocabulary

  • Context should be expanded to include any xmlns prefix definitions.

  • We should frame before serializing JSON-LD into XML to make sure it is properly nested (i.e. not flattened, etc).

(Though in this case compacting may be sufficient since it is possible there are not many tree representations possible. <references> elements should become explicit as @ids. Though in EML both the referenced and embedded versions are valid.)

Use case: Constructing EML from a triplestore of existing EML components

Illustrate how we can compose new EML metadata from reusable elements in existing EML records (coverage of field sites, personnel, etc) by extracting those bits from RDF using SPARQL and recasting into JSON-LD objects that we can serialize into new valid EML. Also illustrate tabular representations of the JSON-LD

EML -> RDF -> SPARQL -> EML

Take another look at new `eml_version` behavior

I found a situation that's making me rethink the new behavior we added to eml_version that prompts when interactive.

This code, from the EML package README:

me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))
my_eml <- list(dataset = list(
              title = "A Minimal Valid EML Dataset",
              creator = me,
              contact = me)
            )


write_eml(my_eml, "ex.xml")

triggers the prompt which I think is highly undesirable. emld uses calls like eml_version() all over the place and in places that will get called indirectly when interactive.

I think that having a package helper like eml_version() around is very useful and that we might want to remove the interactive prompt portion and make the calling semantics:

  1. eml_version() returns the current emld_db value, always
  2. eml_version(foo) sets the emld_db value to foo. Can error if foo is not set correctly

rOpenSci review: Peter

In the Motivations section of README.md the purpose and method of "extending EML with other semantic vocabularies" isn't clear (to me). Consider adding a section such as "Extending EML" with an example, if this is within the scope of the package.

Good point, I've removed this now since it is merely confusing. What I had in mind here would be contingent on EML 2.2.0's support for arbitrary RDF annotations in EML. In principle, a user could just add another term, say, schema:editor onto an eml:dataset object, and emld would translate it into the appropriate (and potentially less intuitive to construct) EML annotation element. But this isn't implemented yet (see issue #2), and would probably be more trouble than help.

The help text for the as_xml Description contains only the text as_xml and does not describe what the function does. Also, the return type is not specified. The same is true for both as_emld and as_json, template.

fixed.

There is no documentation index is available for the package.

not actually sure to what this refers? complete pdf & pkgdown-based versions of the package manual / docs should now be in place.

There is no documentation for ?emld' or ?emld-package.

Thanks, fixed!

This may be outside the scope of the package, but no functions, strategies or examples are presented in the documentation for simple editing of EML data, i.e. (reading, simple edit, write out). This would be very useful, but if this functionality presents to much overlap with the EML package, pleas disregard this comment.

Right, this is really the scope of the EML package. As the README shows, it is possible to a list-based approach to create a simple (schema-valid) EML file, or make a minor edit, but this is unlikely to scale well without richer functions in EML.

Functionality

The as_json, as_emld, as_xml functions have a clear purpose and work as expected.

In addition to testing using the provided code samples, the following checks for a complex EML document were performed for a couple of EML documents such as https://goa.nceas.ucsb.edu/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171:

  • verified that as_json produced valid JSON-LD
  • verified that the following commands produced the original EML file (round trip):
x <- as_emld('metadata.xml') # using
as_xml(x, "new.xml")

yay, thanks for checking this!

R Source

The following potentially unresolved items exist, as show from a quick scan of the source code:

$ grep FIXME *
as_emld.R:    ## FIXME technically this assumes only our context
as_xml.R:## FIXME drop NAs too?
eml_validate.R:    ##  FIXME shouldn't have to write to tempfile,
eml_validate.R:  # FIXME technically must be unique inside parent system only...
emld-methods.R:## FIXME: Print method should drop context

Good call, these have now been resolved and the notes removed. (some had already been resolved, or weren't quite accurate)

Misc

devtools::check() reports non-standard file LICENSE.md

fixed (added to .Rbuildignore)

`as_eml_document`doesn't account for multiple types of `coverage` element

Over in EML we got a bug report that the EML package wasn't recognizing valid EML as being valid.

The problem stemmed from the element:

<sampling>
    <spatialSamplingUnits>
          <coverage>
            <geographicDescription>
...

After digging into the eml-2.x.x.json file, which is the backbone of the serialization process within as_eml_document, it seems that there is only one flavor of coverage included, of the eml-dataset:coverage variety, which requires a geographicCoverage prior to the geographicDescription element.

In the above example, however, the coverage shown above is eml-methods:coverage, which does not require the geographicCoverage child, since it has a geographicCoverage type.

This is definitely a bug, since the document was indeed valid, though I'm unsure as to how to proceed with fixing it. Any input from @cboettig or @mbjones would be most welcome

Add raw support for as_emld

Since we often get EML from dataone/etc as raw, it would be great to support that functionality with read_eml/as_emld. We typically read it in this way (eml <- read_eml(rawToChar(dataone::getObject(mn, "doi:10.18739/A2K86G")))), though I'm now realizing that the rawToChar part isn't necessary (anymore?).

It seems that read_eml/as_emld could be expanded with:

  • this addition to as_emld.character: if (grepl("\\.xml$", x) | grepl("<|>", x)) (since read_xml already handles it)
  • an as_emld.raw method, which could also piggyback off of read_xml:
as_emld.raw <- function(x){
        x <- xml2::read_xml(x)
        as_emld.xml_document(x)
}

Happy to PR it if you think this makes sense

Feedback on using emld to ease creation of helper methods

Hey @cboettig I carved out some time to look at re-writing some EML helpers in emld to get a sense of how the two packages compare. The main thing I'm coming away with so far is that my helpers nearly melt away. I think this is evidence that it will be easier to write a full suite of generic helpers by hand, and also any special case helpers the community needs.

I started out with re-writing the helpers related to parties in arcticdatautils. Two major differences emerged:

  • I no longer have to do the NULL checking because emld handles the propagation of NULLs gracefully
  • I'm no longer wrapping things in ListOfs. This is always a pain for new users.
  • Sub-elements of parties, like email addresses, are way easier to create

I think these were all benefits you hoped we'd see.

Take a helper for creating a simple contact. It's only real purpose is to provide the user some autocompletion and roxygen docs:

set_contact <- function(givenNames = NULL, surName, email = NULL) {
  list(individualName = list(givenName = givenNames,
                             surName = surName),
       electronicMailAddress = email)
}

Adding support for another (simple non-nested) attribute of the party like phone number is an easy change:

set_contact <- function(givenNames = NULL, surName, email = NULL, phone = NULL) {
  list(individualName = list(givenName = givenNames,
                             surName = surName),
       electronicMailAddress = email,
       phone = phone)
}

Good start so far.

Add warning/error when the user sets an invalid EML version string

Ran into this one today and I'll admit it got me too. To toggle between EML versions, we provide the API:

emld::eml_version("eml-2.1.1")

If you forgot this exact pattern (eml- followed by the version), you might try

emld::eml_version("2.1.1")

which will cause you grief. I think a warning is in order. Maybe an error?

e.g.,

> eml_version("2.1.1")
[1] "2.1.1"
Warning message:
In eml_version("2.1.1") :
  Your provided version of '2.1.1' does not look like a valid version string. Be sure it starts with 'eml-' and ends with the schema version. e.g., for EML 2.1.1, use 'eml-2.1.1'.

Do others think this should be a warning, or maybe think it should be an error instead? Now that I'm asking this, I'm leaning towards error because you get errors in functions like eml_ns when the version string is invalid.

> doc_info <- read_eml(getObject(mn, "foo"))
Error in if (parts[1] <= 2 && parts[2] < 2) { : 
  missing value where TRUE/FALSE needed

Additional validation

EML validation imposes additional constraints beyond the XML schema validation. I believe these can be summarized as follows:

  • combination of id+system attributes on any element should be unique.
  • an element cannot have both child element <references>/<describes> and an attribute called id.
  • semantic annotation needs a subject: either from the id of its parent or as a child <references> of the annotation, but not both!

eml_validate() should enforce both of these conditions as well. (Should be possible as simple xpath?)

Migrated from ropensci/EML#244.

New release?

This is just a follow-up on PR #40 to discuss issues for a new release.

  • Need to think what the right version bump ought to be and update DESCRIPTION & NEWS.md (Maybe 0.2.1 or 0.3.0?, but really the namespace is a breaking change, and changing the default to 2.2.0 would be a significant shift as well...)

  • Need to update NEWS.md to reflect PR #40 change in NAMESPACE

  • @jeanetteclark notes: Before cutting the new release, we might want to switch back to setting the default EML version to 2.2.0 now that this is officially out. I created an issue (#31) requesting the switch to 2.1.1 until 2.2.0 got released.

fix support for remote context, xml-serialization of external names

{
  "@context": "https://raw.githubusercontent.com/cboettig/emld/master/inst/context/eml-context.json",
  "@type": "EML",
  "title": "Sample Dataset Description",
  "creator": {
    "id": "23445",
    "scope": "document",
    "individualName": {
      "surName": "Smith"
    },
    "https://schema.org/birthDate": {"@value": "1980-02-02", "@type": "xsi:Date"}
  },
  "contact": [
    {
      "individualName": {
        "surName": "Johnson"
      }
    },
    {
      "@id": "23445"
    }
  ]
}

Should render into EML without issues

Consider as_emld taking an explicit type

S3 class detection should not be then only way to determine class type. This strategy is fine for distinguishing xml_document from json, but really should take and use an explicit argument whenever possible. This is particularly relevant for working with input streams (file path, character literal, url, raw vectors).

Don't depend on `schemaLocation` to find the schema file to validate with

As discussed in #45, a better approach to finding the schema file to validate a document with is to look at the QName on the root and defined namespaces on the document to come up with the namespace on the root. The current approach hopes xsi:schemaLocation is set on the root and we go to there to find the schema. However, xsi:schemaLocation isn't a required element so we can't really trust it'll be there.

I'll rewrite the logic in eml_validate to find the schema as described above.

Change return type or behavior of as_xml?

I'm so used to this that I forget that it might be confusing to new users. EML::write_xml returns NULL when you save to disk because of the behavior of emld's as_xml method returns the raw result from xml2::write_xml which returns invisible(). One of our team caught this over on NCEAS/datateam-training#229.

The relevant code is:

## Serialize to file if desired
if(!is.null(file)){
  xml2::write_xml(xml, file)
} else {
  xml
}

I wasn't sure what the best return type is here. The readr functions, which are common to most users, seem to return the input (to support piping). This method currently returns either an xml_document or NULL. Maybe the best thing is just to wrap the write_xml in invisible() and avoid making any big changes to the return type.

Thoughts @cboettig?

EML attributes dropped from tags silently

New to EML 2.2.0 is support for ids in taxonomicClassification

I want to be able to set the attributes of tags in order to include <taxonId> in taxonomic coverage. The attribute is dropped silently when i use emld::as_xml(). Is there a way to do this?

x <- read_xml("<taxonId>12345</taxonId>")
# {xml_document}
# <taxonId>
x

xml2::xml_set_attr(x, "provider", "ITIS")
# {xml_document}
# <taxonId provider="ITIS">
x

xml2::as_list(x) %>% 
  emld::as_xml()
# {xml_document}
# <eml xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2">
# [1] <taxonId>12345</taxonId>

Either `as_emld` or `as_xml` not handling docs with submodule at root

Round-tripping EML (e.g., <eml:eml...) docs works correctly but doesn't for documents with elements from submodules at their root (e.g., <dataset:dataset...). I think the reason this doesn't break the round-tripping tests is that they don't check to this fine a level of detail (same QName on root before and after).

An example of the correct behavior, exhibited when round-tripping a document with eml:eml at the root:

inpath <- "inst/tests/eml-2.2.0/eml-sample.xml"
doc <- xml2::read_xml(inpath)
> doc
{xml_document}
<eml packageId="doi:10.xxxx/eml.1.1" system="knb" schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 xsd/eml.xsd" xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1">

emld <- as_emld(doc)
as_xml(emld, outpath)
> readLines(outpath)[[2]]
[1] "<eml:eml xmlns:eml=\"https://eml.ecoinformatics.org/eml-2.2.0\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:stmml=\"http://www.xml-cml.org/schema/stmml-1.2\" packageId=\"doi:10.xxxx/eml.1.1\" xsi:schemaLocation=\"https://eml.ecoinformatics.org/eml-2.2.0 xsd/eml.xsd\" system=\"knb\">"

An example of the incorrect behavior with a dat:dataset element at the root:

outpath <- tempfile(fileext = ".xml")
inpath <- "inst/tests/eml-2.2.0/eml-dataset.xml"
doc <- xml2::read_xml(inpath)
doc
{xml_document}
<dataset system="KNB" schemaLocation="https://eml.ecoinformatics.org/dataset-2.2.0          eml-dataset.xsd" xmlns:ds="https://eml.ecoinformatics.org/dataset-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

emld <- emld::as_emld(doc)
emld::as_xml(emld, outpath)
readLines(outpath)[[2]]
[1] "<eml:eml xmlns:eml=\"https://eml.ecoinformatics.org/eml-2.2.0\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:stmml=\"http://www.xml-cml.org/schema/stmml-1.2\" xmlns:ds=\"https://eml.ecoinformatics.org/dataset-2.2.0\" xsi:schemaLocation=\"https://eml.ecoinformatics.org/dataset-2.2.0          eml-dataset.xsd\" system=\"KNB\">"

See how the root elements name changes? I think I what's going on and I think the patch for now is to bring some of my helpers from #53 over into as_xml.emld.

Look into making use of schema catalogs to validate

Stems from #53:

The way eml_validate works now is that it chooses a root XML schema to validate the input document against based upon the root of the document. This results in the input being validated by a single XSD (and any of its locally-available imports, e.g., eml and eml-dataset, etc.) but isn't really full schema validation.

What we're doing now is good enough for 99% of use cases but we really oughta be taking a schema catalog approach to validate inputs dynamically depending on how they're structured by letting libxml2 figure it out instead of explicitly setting the schema.

As of yet, it's unclear if libxml2 or the xml2 package supports schema catalogs but we should look into this in the future to tidy up validation.

See @mbjones 's detailed comment at #53 (comment) for more info.

Release 0.5.0

0.5.0 includes some nice QOL fixes but also a pretty important update in #60 to make EML 2.2.0 docs validate using the correct unit list. I think @cboettig had asked if this was a good time for a release and I think so. I've rcmdcheck'd my way around and I think things are good for a release whenever you can sit down to do it, @cboettig. Please see my PR in #64 before releasing and then I think this is good to go.

Create lookup table for ordering of elements

Eliminate dependency on EML package by creating a list object that lists the ordered slot names for each object defined in the EML package. (Should be easy to generate and save as.rda given a list of all object names).

Issue with textType serialization

@srearl popped in to NCEAS EML Slack today with some weird emld behavior:

writeLines(
  as.character(
    emld::as_xml(
      list(additionalInfo = list(
        section = list(para = "some para"))))))

produces

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1/ eml.xsd">
  <additionalInfo>
    <section>
      <section>some para</section>
    </section>
  </additionalInfo>
</eml:eml>

Instead of the intended

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1/ eml.xsd">
  <additionalInfo>
    <section>
      <para>some para</para>
    </section>
  </additionalInfo>
</eml:eml>

I poked around at the source a bit and didn't quite see what's up but wanted to file an issue. I can look again later on this week I bet.

Looking for feedback

@maelle

Hey Maëlle,

When you have a moment, I wanted to get your feedback on this little project. My goal is to create something for EML that is a bit more friendly/less clumsy than the the S4 construction in the original EML R package: no more new, ListOf*, [[1]]@stuff[[1]]@@@argh! nonsense, objects are just lists. For instance, here's creating a fully valid EML file:

me <- list(individualName = list(
                 givenName = "Carl", 
                 surName = "Boettiger"))
eml <- list(dataset = list(
                  title = "dataset title",
                  contact = me,
                  creator = me),
                   "#system" = "doi",
                   "#packageId" = "10.xxx")
as_xml(eml, "ex.xml")

I think there's still a need for higher-level constructor functions (the set_* methods we have in EML) for creating common patterns quickly, but I think these will be somewhat easier to write without the S4 overhead. I've started drafting a few of these in eml2, which should be a drop-in replacement for the higher-level tools in EML.

Meanwhile, this emld package just focuses on the fundamental mechanics of going between list and XML formats. The list representation is motivated to correspond to valid JSON-LD: I'm finding that JSON is indeed a lot more developer-friendly than XML, and the -LD part can be particularly handy thanks to the clever JSON-LD algorithms (which we now have in R thanks to @jeroen's jsonld package. I've put a few examples in the README but they are pretty sketchy so far.

I think this kind of functionality should make it a lot easier to actually do interesting data-mining things with a large number of EML files without having to inspect and extract particular data from each one. (For instance, you could just render all the EML files into a triplestore and query that with SPARQL, which isn't nearly as bad as it sounds! Basically it just rectangularizes EML so you don't spend so much effort dealing with nesting structure issues and can just read the data).

Anyway, would love to hear any initial reactions you have. If this looks promising, an might be fun for us (pulling in @amoeba as well) to work out some use cases & write up as a short paper.

validation produces spurious errors

Validating the eml-sample.xml file that ships with EML 2.2.0 produces spurious errors.

> eml_validate("eml-sample.xml", schema = NULL)
[1] FALSE
attr(,"errors")
[1] "parent of any annotation must have id unless annotation contains a references attribute"
Warning message:
In eml_additional_validation(doc, encoding = encoding) :
  Document is invalid. Found the following errors:
 parent of any annotation must have id unless annotation contains a references attribute

`as_emld` bug?

The bug seems to have something to do with the context part of as_emld. I think this code used to work, but it doesn't anymore. Within eml_get, it looks good up until the last step, when robj is passed back into as_emld.

eml_path <- system.file("example-dataset/broodTable_metadata.xml", package = "dataspice")
eml <- eml2::read_eml(eml_path)
eml2::eml_get(eml, "physical")
#> {}

validate_units edge case

One of our EML tools generated some seemingly valid (though unusual) EML but validate_units didn't think so. I think the document is technically valid but validation fails:

> emld:::eml_validate("~/Downloads/science_metadata (1).xml")
[1] FALSE
attr(,"errors")
[1] "not all 'custom units are defined."
Warning message:
In validate_units(doc, encoding = encoding) :
  Document is invalid. Found the following errors:
 not all 'custom units are defined.

But the XML has:

...
<metadata>
  <stmml:unitList xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1">
    <unit id="micromolesPerGram" abbreviation="umol/g" name="micromolesPerGram" parentSI="molesPerGram" unitType="amountOfSubstanceWeight" multiplierToSI="1e-06">
...

Which looks fine but you can see the mixing of namespaces there. The unitList elements uses a local namespace and the unit element uses the global namespace. The XPath validate_units uses to extract the custom units doesn't like this situation:

> c(
+     xml2::xml_attr(xml2::xml_find_all(
+         doc,
+         "//stmml:unitList/stmml:unit"
+     ), "id"),
+     xml2::xml_attr(
+         xml2::xml_find_all(doc, "//unitList/unit"),
+         "id"
+     )
+ )
character(0)

Two alternatives that would work in this case are:

xml2::xml_find_all( doc, "//stmml:unitList/unit")
# or
xml2::xml_find_all(doc, "//*[local-name()='unitList']/*[local-name()='unit']")

Questions I have and could use some feedback on are:

  • What's the most XML-ish way to handle this?

  • Is it worth modify the routine to accommodate this situation?

  • Should eml_validate be enforcing schema validity as per the schema? The metadata element says:

    Any well-formed XML-formatted metadata may be inserted at this location in the EML document. If an element inserted here contains a reference to its namespace, and if there is an association between that namespace and an XML Schema that can be located by the processing application, then the processing application must validate the metadata element. If these conditions are not met, then validation need not occur.

    I think this is a bit of an odd rule but I think the intent was to be helpful to clients that want to be schema valid which makes sense.

Example EML (science_metadata (1).xml above)
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1" packageId="somepackage" system="somesystem" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
  <dataset>
    <title>title</title>
    <creator>
      <individualName>
        <surName>creator</surName>
      </individualName>
    </creator>
    <contact>
      <individualName>
        <surName>contact</surName>
      </individualName>
    </contact>
    <dataTable id="urn:uuid:5571f67b-6d0f-4d24-8349-3d8d70e008d5">
      <entityName>someentity</entityName>
      <attributeList>
        <attribute>
          <attributeName>...</attributeName>
          <attributeDefinition>...</attributeDefinition>
          <measurementScale>
            <ratio>
              <unit>
                <customUnit>micromolesPerGram</customUnit>
              </unit>
              <numericDomain>
                <numberType>real</numberType>
              </numericDomain>
            </ratio>
          </measurementScale>
        </attribute>
      </attributeList>
    </dataTable>
  </dataset>
  <additionalMetadata>
    <metadata>
      <stmml:unitList xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1">
        <unit id="micromolesPerGram" abbreviation="umol/g" name="micromolesPerGram" parentSI="molesPerGram" unitType="amountOfSubstanceWeight" multiplierToSI="1e-06">
          <description>micromoles per gram</description>
        </unit>
      </stmml:unitList>
    </metadata>
  </additionalMetadata>
</eml:eml>
devtools::session_info()
> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       macOS Mojave 10.14.4        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Juneau              
 date     2019-03-04                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
 backports     1.1.3   2018-12-14 [1] CRAN (R 3.5.0)
 callr         3.1.1   2018-12-21 [1] CRAN (R 3.5.0)
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
 curl          3.3     2019-01-10 [1] CRAN (R 3.5.2)
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
 devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.2)
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)
 emld        * 0.2.0   2019-02-27 [1] local         
 fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)
 glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)
 jsonld        2.1     2019-02-05 [1] CRAN (R 3.5.2)
 jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
 pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)
 prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
 processx      3.2.1   2018-12-05 [1] CRAN (R 3.5.0)
 ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.0)
 R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.2)
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
 remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)
 rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.2)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
 rstudioapi    0.9.0   2019-01-09 [1] CRAN (R 3.5.2)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)
 testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.0)
 usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)
 V8            2.0     2019-02-07 [1] CRAN (R 3.5.2)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
 xml2          1.2.0   2018-01-24 [1] CRAN (R 3.5.0)
 yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)

[1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Naturalize semantic annotations

Sounds like the next EML release should include extensions through semantic annotations, potentially something like:

<attribute id = "att.12">
       ...
        <annotation>
            <propertyURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Characteristic</propertyURI>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass</termURI>
            <termLabel>Mass</termLabel>
        </annotation>

We would want a native JSON-LD translation to embed these directly in the natural linked data format. For example the above would simply be:

"attribute": {
  "@id": "att.12",
  "oboe:Characteristic": "oboe:Mass"
}

with "oboe": "http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#" added to the JSON-LD @context (along with a context for native EML terms). Should probably get a @type element too.

In the case of oboe, looks like this may require some finessing of the namespace, since that's clearly just oboe-characteristics and not all of oboe at that address. Ideally this information will be in the EML, but maybe would need to be resolved by dereferencing the URIs.

A better JSON-LD entry would type the value appropriately as a UID instead of a text string:

"attribute": {
 "@id": "att.12",
 "oboe:Characteristic": {"@id: "oboe:Mass" },
  ...
}

and this could all be made more concise by bumping it up to the context file:

"@context:" {
...,
"oboe-characteristics": "http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#",
"Characteristic": { "@id": "oboe-characteristics:Characteristic"},
"Mass": { "@id": "oboe-characteristics:Mass"}
},

"attribute": {
  "@id": "att.12",
  "Characteristic": "Mass"
}

In principle, we should be able to round-trip any of this back to EML by JSON-LD framing using the EML context alone and then converting any terms that are not fully compacted (e.g. not EML) into semantic <annotation> elements.

cc @amoeba

change default EML version to 2.1.1 until 2.2.0 is officially released

I know there are workarounds for this, but I would rather not have to set global options every time I spin up an R session (or change my .Rprofile) for this. For those of us (which may just be me!) testing this package regularly it would be nice to have the default version be 2.1.1 until 2.2.0 is officially out and supported by DataONE

Dealing with node values that can take optional attributes

if attributes become properties, node content needs a property (content?).

i.e. <url name="A">http://... becomes:"url": {"name": "A", "content": "http://..."}

but <url>http://... becomes:"url": "http"

The use of @value would be closer to the second case, but in JSON-LD a node cannot have an @value and any other property (other than a data type).

Edge case: TextType

EML's TextType nodes will be super tricky to round-trip if they are using inline XML markup such as:

this sentence has a <strong>bold</strong> word.

Obviously it makes no sense to try and strip this into different key-value pairs. The "right" thing is probably to merely preserve the entire string, markup and all, as the a text-valued string.

Incidentally, I think this is a good example of where the complexity/richness of XML is actually making it harder to work with as a metadata description than JSON-LD. The flexibility to use nodes in the middle of related content like this, as Markup language and not just an object language, really overload the concepts. Even despite the richness of XML tooling, these patterns have proven hard to work with for developers, as evidenced by the general lack of full support for textType markup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.