ropensci-archive / nbar Goto Github PK

View Code? Open in Web Editor NEW

3.0 12.0 2.0 6.14 MB

:warning: ARCHIVED :warning: R client library for the Netherlands Biodiversity Api (NBA)

License: Other

R 92.50% Shell 0.41% Java 1.32% Mustache 5.71% TeX 0.05%

r biodiversity-data api-client

nbar's Introduction

nbaR

This package has been archived. The former README is now in README-not.

nbar's People

Contributors

Stargazers

Watchers

Forkers

mbjoseph kevcaz

nbar's Issues

Schematic?

It might be nice in the getting started vignette to provide a schematic of the query steps to highlight the commonalities (e.g. $new, $resultset … etc.)


# specify data service
client <- DATASERVICE$new()
# create query conditions
condition1 <- QueryCondition$new()
condition2 <- QueryCondition$new()
...
# specify query
query <- QuerySpec$new(conditions(list(condition1, condition2 ...))
# search
res <- res$query(querySpec = query)

The code has a lot of repetition, and is not particularly human readable presumably because it is generated automatically from the API. I'm not sure how to evaluate this part of the package: on the one hand, it is awesome that the code is programmatically generated to deal with future changes in the API, but on the other hand, there are repetitive blocks of code like the following, which itself is repeated twice in the definition of the Specimen class:

self[["sourceSystemId"]] <-
  SpecimenList[["sourceSystemId"]]
self[["recordURI"]] <-
  SpecimenList[["recordURI"]]
self[["id"]] <-
  SpecimenList[["id"]]
self[["unitID"]] <-
  SpecimenList[["unitID"]]
self[["unitGUID"]] <-
  SpecimenList[["unitGUID"]]
self[["collectorsFieldNumber"]] <-
  SpecimenList[["collectorsFieldNumber"]]
self[["assemblageID"]] <-
  SpecimenList[["assemblageID"]]
self[["sourceInstitutionID"]] <-
  SpecimenList[["sourceInstitutionID"]]
self[["sourceID"]] <-
  SpecimenList[["sourceID"]]
self[["previousSourceID"]] <-
  SpecimenList[["previousSourceID"]]
self[["owner"]] <-
  SpecimenList[["owner"]]
self[["licenseType"]] <-
  SpecimenList[["licenseType"]]
self[["license"]] <-
  SpecimenList[["license"]]
self[["recordBasis"]] <-
  SpecimenList[["recordBasis"]]
self[["kindOfUnit"]] <-
  SpecimenList[["kindOfUnit"]]
self[["collectionType"]] <-
  SpecimenList[["collectionType"]]
self[["sex"]] <-
  SpecimenList[["sex"]]
self[["phaseOrStage"]] <-
  SpecimenList[["phaseOrStage"]]
self[["title"]] <-
  SpecimenList[["title"]]
self[["notes"]] <-
  SpecimenList[["notes"]]
self[["preparationType"]] <-
  SpecimenList[["preparationType"]]
self[["previousUnitsText"]] <-
  SpecimenList[["previousUnitsText"]]
self[["numberOfSpecimen"]] <-
  SpecimenList[["numberOfSpecimen"]]
self[["fromCaptivity"]] <-
  SpecimenList[["fromCaptivity"]]
self[["objectPublic"]] <-
  SpecimenList[["objectPublic"]]
self[["multiMediaPublic"]] <-
  SpecimenList[["multiMediaPublic"]]

One reason to prefer DRY code is to avoid making the same change over and over, but if a machine is tasked with making those changes, maybe DRY matters less? (As an aside: I was somewhat unsure in reviewing whether my suggestions relate more to swagger or nbaR!) The structure of the code might make it difficult for users to glean much information by reading the source code, relative to some other packages. From my perspective this could be fine, as long as there is good documentation (for humans) that gives users the information they need.

Pkgdown issues

Warning: In '_pkgdown.yml', topic must be a valid R expression.
Problem topic: `\`Element\``
Warning: In '_pkgdown.yml', topic must be a valid R expression.
Problem topic: `\`GeoAreaClient\``
Warning: In '_pkgdown.yml', topic must be a valid R expression.
Problem topic: `\`Element\``
Warning: In '_pkgdown.yml', topic must be a valid R expression.
Problem topic: `\`GeoAreaClient\``
Warning: Topics missing from index: chronos_calib, geo_age
These above errors can be fixed by updating the pkgdown yaml. (Unless you are using tic to auto-generate the yaml?)

Additionally, it might be nice to organise the functions by utility to a user (e.g. SpecimentClient, TaxonClient should take precedence over Agent). In packages I have written in the past I have organised the functions/classes by public and private categories.

Also, Changelog leads to a 404.

Specimen class docs

Further fields include (among others) information about Finding place, identification, and multimedia content.

It seems like "Finding" should not be capitalized (but maybe this is a swagger issue?)

DwCA download services

It would be worth suppressing the output of the download to improve readability.

'application note' vignette

Draft a vignette (±2-3 pages) that outlines the main functionality as relevant to a scientific end user: what data can I get and how do I get it?

Package vignette

Make package vignette describing workflows currently possible

Travis CI integration

Add travis CI integration

"advanced" data munging examples

Seeing that the data structures that one gets back are somewhat complex, it would be nice to see a few examples to show how to extract useful data from queries. For example: how to make a data frame with useful information coming out of a series of queries (like, a data frame with fossil calibration points for primate taxa).

Named list

Named lists in query conditions do not work?

library(nbaR)
sc <- SpecimenClient$new()
field <- 'identifications.defaultClassification.genus'
qc <- QueryCondition$new(field = field, operator = 'EQUALS', value = 'Solanum')
# success
qs <- QuerySpec$new(conditions = list(qc))
res <- sc$query(querySpec = qs)
res$content$totalSize
# fail
qc_list <- list('solanum' = qc)
qs <- QuerySpec$new(conditions = qc_list)
res <- sc$query(querySpec = qs)
res$content$totalSize

Rd documentation for classes

Preferably with swagger/roxygen

Test results

When I first downloaded the repo, I got no errors when testing the package. But since 8 Nov 2018, I started getting the error and warnings below. This may just be due to the NBA API itself, but is it possible for the package be able to note the difference between API failure and internal failure?

test-multimediaClient-metadata.R:36: warning: Settings work
Status code:500
Internal Server Error
Exception: Invalid NBA setting: 50000
Exception type: java.lang.IllegalArgumentException
Full stack trace stored in response object
test-specimenClient-find.R:57: warning: Test error handling in find functions
404 (NOT FOUND)
No Specimen exists with ID XXX
test-specimenClient-misc.R:90: failure: getDistinctValuesPerGroup works
res$content has length 3, not length 2.
test-specimenClient-query.R:149: warning: Errors and warnings work
Status code:500
Internal Server Error
Exception: Field associatedMultiMediaUris.accessUri cannot be queried: field is not indexed
Exception type: nl.naturalis.nba.api.InvalidConditionException
Full stack trace stored in response object
test-taxonClient-find.R:35: warning: Test error handling in find functions
404 (NOT FOUND)
No Specimen exists with ID XXX
test-taxonClient-query.R:80: warning: Errors and warnings work
Status code:500
Internal Server Error
Exception: Invalid element "somefield" in path "somefield"
Exception type: nl.naturalis.nba.api.InvalidConditionException
Full stack trace stored in response object
══ Results ════════════════════════════════════════
Duration: 85.4 s

OK:       1420
Failed:   1
Warnings: 8
Skipped:  2

Function wrappers

I have a thought that for the most basic of queries you could simply have wrapper functions, e.g. specimen_search or taxon_search, which would take just names/ids and return R data structures that the majority of R users would understand. These wrappers would likely lead to much less coding on the part of the end-user and, although less flexible, they should make the package much more accessible.

E.g. this below wrapper let's a user run of the examples from "getting started".

specimen_search <- function(value, field) {
  qcs <- vector(mode = 'list', length = length(value))
  for (i in seq_along(value)) {
    qcs[[i]] <- QueryCondition$new(field = field[[i]], operator = 'EQUALS',
                                   value = value[[i]])
  }
  qs <- QuerySpec$new(conditions = qcs)
  sc <- SpecimenClient$new()
  res <- sc$query(querySpec = qs, size = 100)
  lapply(X = res$content$resultSet, FUN = function(x) {
    x$item$toList()
  })
}

res <- specimen_search(value = c('female', 'species', "Equidae"),
                       field = c('sex', "identifications.taxonRank",
                                 "identifications.defaultClassification.family"))

Rd documentation for endpoint methods

At the moment, class methods are documented within the documentation of a class; e.g. ?Specimen would give all fields in that class and list all objects. It would be good to have separate documentation about a single methods, such as ?Specimen$query. Documentation is generated with roxygen, and at the moment, this does not seem to be possible for R6 classes...

Split main vignette

There is a lot of documentation already, and it was really critical for me to be able to start making more of my own queries. I think though I would have liked a little more organisation in the “getting started” article and more real-world examples. I think the shark vignette is great because it provides a real world use for the package, additional articles need not be so long though. In particular I think it would be great if the “Getting started” article were split up (e.g. into real basics and then independent articles for each data domain). I think a visual schematic showing the basic class structures would really help as well. Finally, I was interested in the operators (equals, not equals, contains….) but found these were not sufficiently explained. More examples would help.

Typos in the swagger README

(https://github.com/naturalis/nbaR/blob/master/other/swagger/README.md)
"NAB" should be "NBA"
"overwiev"
"contatining"

vignette client variable

ook in het vignette, er staat eerst client <- SpecimenClient$new() en res <- client$query(), maar verderop gaat het verder met res <- sc$query(querySpec=qs) (dus sc ipv client). is denk ik handiger als dat consequent is.

Typos shark vignette

"Baeysian"
"Maximum-Likelihood" can be lowercase: "maximum likelihood" or "maximum-likelihood"
"For many specimen," should be "specimens"
"levels; If ...", capitalization of "If" is not necessary with the semicolon
"...there are sufficient specimen determined", should be plural "specimens"

Add classes necessary for Specimen objects

documented unit tests

The best examples are working code samples, so it would be great if the unit tests could function in that way, i.e. they really exercise the code in realistic cases, with ample comments to talk the innocents through the code.

Unit tests for query endpoint

get_distinct_values_per_group and count_distinct_values_per_group

Both of these return complex nested lists, that a typical user might not immediately know how to use. Would it be possible to return a data frame here instead?

Explaining warnings

https://github.com/naturalis/nbaR/blob/master/vignettes/sharks.Rmd#L190

This command raised two warnings:

times <- geo_age(unique(c(as.character(data$youngChronoName), as.character(data$oldChronoName))))
Warning messages:
1: In FUN(X[[i]], ...) :
Could not retreive values for geo unit "Mioceen" from earthlifeconsortium.org
2: In FUN(X[[i]], ...) :
Could not retreive values for geo unit "Unspecified age" from earthlifeconsortium.org
It would be nice to have some text here for users to explain what these warnings are, and whether they matter.

Spell check

I highlight a few spelling errors (en.US):

Baeysian                          sharks.Rmd:22
conventience                      sharks.Rmd:160
digitised                         nbaR.Rd:12
                                         description:1
donw                              tomato.Rmd:156
errenous                          sharks.Rmd:247
exaclty                           tomato.Rmd:62
expleore                          sharks.Rmd:147
objetcs                           sharks.Rmd:171
respose                           tomato.Rmd:47
retreive                          geo_age.Rd:19
                                  tomato.Rmd:28
whch                              tomato.Rmd:62
worning                           sharks.Rmd:236

ServiceAccessPoints not parsed in fromJSONString

When reading a multimedia record from JSON, it seems to skip the serviceAccessPoints. Testfile:

{"totalSize":1,"resultSet":[{"score":12.3626375,"item":{"sourceSystem":{"code":"OBS","name":"Observation.org - Nature observations"},"sourceSystemId":"18926816","recordURI":"https://observation.org/photo/18926816/","id":"18926816_img@OBS","sourceInstitutionID":"Observation International","sourceID":"Observation.org","owner":"https://observation.org/user/90734/","licenseType":"Copyright","license":"CC BY-NC","unitID":"OBS.18926816_img","collectionType":"Observations","serviceAccessPoints":[{"accessUri":"https://observation.org/photo/18926816.jpg","format":"image/jpeg","variant":"Best Quality"}],"type":"StillImage","taxonCount":0,"associatedSpecimenReference":"164118412@OBS","phasesOrStages":["unknown"],"sexes":["undetermined"],"gatheringEvents":[{"continent":"Europe","country":"The Netherlands","iso3166Code":"NL","provinceState":"Zuid-Holland","locality":"Leiden - Bioscience Park Leiden","dateTimeBegin":"2018-10-31T00:00:00+0000","dateTimeEnd":"2018-10-31T23:59:59+0000","siteCoordinates":[{"longitudeDecimal":4.45,"latitudeDecimal":52.15,"coordinateErrorDistanceInMeters":5000,"spatialDatum":"WGS84","geoShape":{"type":"Point","coordinates":[4.45,52.15]}}]}],"identifications":[{"taxonRank":"multispecies","scientificName":{"fullScientificName":"Geometridae indet."},"defaultClassification":{"kingdom":"Animalia","family":"Geometridae"},"vernacularNames":[{"name":"Unidentified Geometer moth","language":"en","preferred":true}]}]}}]}

Final notes

[Done] In the en tibi vignette, the first paragraph is repeated.
[Done] Shark vignette: get_age is not a function. Should be geo_age?
[Done] Shark vignette: broken images?
[Done] Might be better to save the shark phylogeny as an R object (see save and use xz compression for CRAN ). That way the file can be compressed and loading is easier for a user: data(shark_phylogeney). See: http://r-pkgs.had.co.nz/data.html
[Done] It is good practice to spell out logicals: T and F should be TRUE and FALSE. (T and F are not protected)
[Done] Keep 80 column width in vignette examples please.

Pretty printing?

I wonder whether it would be better to have some more human readable representation when a client query result is printed? The current output is:


> # specify two query conditions
> l <- list(identifications.typeStatus="holotype", sex="female")
> # run query
> res <- client$query(queryParams=l)
> res
<Response>
  Public:
    clone: function (deep = FALSE)
    content: QueryResult, R6
    initialize: function (content, response)
    response: response

Users might expect something that would tell how many results were returned by the query, and maybe some information about the first couple of results. A public print method that provides some of this information would be nice. Note that this might not be necessary to add to these classes if there are wrapper functions that users are calling instead.

data frames

One potential solution that retains robustness to changes in the API would be to include wrapper functions that ingest the R6 objects that are currently created, and from those objects, create more commonly used objects (e.g., data frames, lists, and/or vectors). In particular, functionality to make data frames from query results might be nice, as data frames will generally be a preferred format for people wanting to analyze data. One expectation for new users might be that each record is a row in a data frame. The nested nature of the data does not immediately admit simply generating a table of records, but list columns might help to make this data structure (one row per record) more manageable for users to understand. Adding some functionality like this could make the package more accessible, and would probably be my biggest suggestion for improvement.

error messages

volgens mij hadden we het er al over gehad, maar de errors zijn generiek ("API server error") en soms is voor de leek bijna niet na te gaan wat er aan de hand is, zoals deze:
q1 <- QueryCondition$new(field = "associatedMultiMediaUris.accessUri", operator="EQUALS", value="some value")

geeft een error omdat de URI's niet geïndexeerd zijn. uiteindelijk vind je dat denk ik wel terug in de documentatie, maar dat vereist wel diep graven.

sc?

At this point in the text ….
Note: The function getFieldInfo on a certain field lists which operators are allowed for that field (e.g. sc$get_field_info()$content$unitID$allowedOperators).
I got confused as to what sc was. (Also is it getFieldInfo or get_field_info?) I think this section could be improved. Perhaps a better example showing how a user might look up all the available query options for the specimen service.

allowed_operators <- sc$get_field_info()$content
names(allowed_operators)

Or, perhaps a wrapper function?

Documentation

I found the default R help files (e.g., ?ChronoStratigraphy) to be not super helpful in trying to understand how to use the package. In particular, the "Usage" section usually contains the name of the class, so that users might need to read through the fields and methods to understand how to use the class. This could be hard for typical R users, particularly if they are not familiar with OOP in R6, and it also means that the help pages do not illustrate how the classes fit together. So, one option would be to hand-edit these autogenerated "Usage" sections, but that doesn't seem ideal if the API is going to change frequently. The current solution of leaning more heavily on the vignettes to demonstrate how all of these classes fit together makes sense from the perspective of automation, and it might even be worth linking users to relevant vignettes from the help files. If (human-authored) wrapper functions are added to the package, then those help files could have more detailed "Usage" sections.

On to the vignettes! As the package is currently written, these vignettes are a key resource for users. I really appreciated that the vignettes seemed complete in the sense of covering major functionality, and both of the case study vignettes were fun to work through and will be useful for users who are trying to solve those types of problems. That said, there may be some better ways to structure the vignettes. The "Get started" vignette currently is long, which might deter users who are looking for something like a "Quickstart" guide. Would it make sense to break the current vignette into several smaller vignettes on the basis of which *Client class they use? For example, vignettes for SpecimenClient, TaxonClient, MultimediaClient, GeoClient, and MetadataClient, with names that reflect common use cases that users will recognize, e.g., "Querying specimen records", "Searching by taxon", "Finding and accessing multimedia", etc.?

done

Goodpractice results

By removing the vignettes from the R package itself, you may also fix these issues goodpractice raised:

  ✖ write unit tests for all functions, and all package code in general.
    93% of code lines are covered by test cases.

    R/Feature.r:86:NA
    R/Feature.r:111:NA
    R/Feature.r:112:NA
    R/GeoArea.r:122:NA
    R/GeoArea.r:173:NA
    ... and 628 more lines

  ✖ fix this R CMD check WARNING: LaTeX errors when creating PDF version.
    This typically indicates Rd problems.
  ✖ fix this R CMD check ERROR: Re-running with no redirection of
    stdout/stderr. Hmm ... looks like a package You may want to clean up by 'rm -Rf
    /var/folders/ps/g89999v12490dmp0jnsfmykm0043m3/T//Rtmpn15kYK/Rd2pdf11fbd7d461aa9'
  ✖ fix this R CMD check NOTE: installed size is 6.3Mb sub-directories of
    1Mb or more: R 1.8Mb doc 4.3Mb

Size

I may have missed something here, but for me the size argument didn’t work -- the results were always stuck at 10.

library(nbaR)
sc <- SpecimenClient$new()
qc <- QueryCondition$new(field = 'identifications.defaultClassification.genus',
                         operator = 'EQUALS', value = 'Solanum')
qc2 <- QueryCondition$new(field = 'identifications.defaultClassification.specificEpithet',
                          operator = 'EQUALS', value = 'lycopersicum')
qs <- QuerySpec$new(conditions = list(qc, qc2))
res <- sc$query(querySpec = qs, size = 1000)
res$content$totalSize == length(res$content$resultSet)

find_by_ids

https://github.com/naturalis/nbaR/blob/master/vignettes/nbaR.Rmd#L333-L339

This method could be more idiomatic if it took a vector of ids rather than one string of ids separated by a comma.

Shark vignette: Explaining the size argument for new users

https://github.com/naturalis/nbaR/blob/master/vignettes/sharks.Rmd#L89-L92

The comment here suggests that size is important - it could be nice to add some text explaining why this is necessary in this case, and how users can determine whether it is necessary in their own applications.

Map of tomatoes

There is code, some of which was commented out, placed at the bottom of the En Tibi vignette without any accompanying text. Running this code I was able to generate a map of collected S. lycopersicum specimens. Is there a reason for it to be excluded? I am a little confused with the query for ‘gatheringEvent.siteCoordinates’ -- they are set to ‘NOT_EQUALS’ yet return only specimens which contain values in these slots.

Advanced queries

It would be great if users could use more idiomatic R code for multiple queries, e.g., something like r 'sex == female & identifications.taxonRank == species' rather than needing to create multiple QueryCondition objects and combine them in a QuerySpec object. I'm not sure how easy this would be to implement, but it could increase usability.

Element.R

A file Element.R gets generated by swagger codegen but should not.

QuerySpec size parameter

waarschijnlijk heb je het gewoon nog niet geïmplementeerd, maar in het vignette staat als voorbeeld al ```
qs <- QuerySpec$new(conditions=list(qc), size=1000), maar size doet het niet (het zijn altijd 10 resultaten):
q1 <- QueryCondition$new(field = "sex", operator="EQUALS", value="female")
qs <- QuerySpec$new(conditions=list(q1))
res <- sc$query(querySpec=qs, size=1000)
res$content$resultSet[[11]]$item
Error in res$content$resultSet[[11]] : subscript out of bounds

get_distinct_values

https://github.com/naturalis/nbaR/blob/master/vignettes/nbaR.Rmd#L353

There is a typo here: "thy data"

Check results

Warning: @nord [/Users/djb208/Coding/nbaR/R/Utils.r#27]: has no parameters
Warning: @nord [/Users/djb208/Coding/nbaR/R/Utils.r#205]: has no parameters

checking installed package size ... NOTE
installed size is 5.9Mb
sub-directories of 1Mb or more:
R 1.4Mb
doc 4.3Mb
checking top-level files ... NOTE
Non-standard file/directory found at top level:
‘tic.R’
If the authors wish to submit to CRAN they will likely need to ensure that the package size is less than 5MB. The excessive size of the package is likely due to the images in the vignettes. They could prevent this by halting the building of vignettes in the R package itself or removing the images. I prefer the former option as I feel they could, in fact do with more use-cases (as explained above and below), and they can always refer users to the online documentation in the locally available R help files. If they choose the former, they should refer to this guide: https://pkgdown.r-lib.org/articles/pkgdown.html#articles

Help files

In one of the vignettes, it was suggested I “Have a look at ?Response to see what this object contains.” -- yet the help file had little to no information. I think the level of information found in this help file is not too different from that of others.

I imagine that much of the documentation will have been generated with the code generator but i think it may be necessary to provide more information particularly for the more commonly used functions/classes. In particular, there are very few examples provided for the majority of the help files. I think it is required that all public functions/classes come with examples for any ROpenSci package.

With roxygen, examples can be stored in an examples/ folder and called in using roxygen syntax.

Character length in example code (vignettes)

Some of these code chunks have very long lines (over 80 char). It would be great to have shorter lines so that users don't need to scroll right.

Add function for query endpoint

Differentiating among clients

https://github.com/naturalis/nbaR/blob/master/vignettes/nbaR.Rmd#L474

It would be great to have some text here to explain how the taxonomic data services and TaxonClient use cases differ from the SpecimenClient use cases described earlier in the vignette. (This would be helpful regardless of whether the "Get started" vignette stays as one big document, or is split up.) Same comment for the geographic data services: https://github.com/naturalis/nbaR/blob/master/vignettes/nbaR.Rmd#L616

associated multimedia objects

en een ding dat ik niet helemaal begrijp: in specimen-resultaten met multimedia krijg ik naast associatedMultiMediaUris ook associatedMultiMediaObjects terug (zonder iets er in). wat zijn dat?

Empty query example*

https://github.com/naturalis/nbaR/blob/master/vignettes/nbaR.Rmd#L103-L104

Starting with an empty query might confuse new users (I was confused when I submitted an empty query and got back some results). Would it make more sense to begin with a common query operation?

Unit tests for classes

More information on operators

It would be great if the different operators could be demonstrated. For example, I wanted to download the taxonomic records for a given list of genera. Initially I was trying to compile multiple query conditions using ‘OR’. But I since discovered that it could be easily done using the ‘IN’ operator with value as a vector of genera.

I noticed there are lots of operators available… would it be too much to ask for more examples that involve other operators other than just EQUALS?

Query/filter specimen that have value in certain field

One aspect I wasn't able to do was call a specimen given that it had data within a certain field. E.g If i want to call specimens of a particular species given that "$item$gatheringEvent$siteCoordinates" has a value.
How would I go about doing that?