ropensci / taxadb Goto Github PK

:package: Taxonomic Database

Home Page: https://docs.ropensci.org/taxadb

License: Other

R 50.62% TeX 49.38%

taxadb's Introduction

taxadb

The goal of taxadb is to provide fast, consistent access to taxonomic data, supporting common tasks such as resolving taxonomic names to identifiers, looking up higher classification ranks of given species, or returning a list of all species below a given rank. These tasks are particularly common when synthesizing data across large species assemblies, such as combining occurrence records with trait records.

Existing approaches to these problems typically rely on web APIs, which can make them impractical for work with large numbers of species or in more complex pipelines. Queries and returned formats also differ across the different taxonomic authorities, making tasks that query multiple authorities particularly complex. taxadb creates a local database of most readily available taxonomic authorities, each of which is transformed into consistent, standard, and researcher-friendly tabular formats.

Install and initial setup

To get started, install from CRAN

install.packages("taxadb")

or install the development version directly from GitHub:

devtools::install_github("ropensci/taxadb")

library(taxadb)
library(dplyr) # Used to illustrate how a typical workflow combines nicely with `dplyr`

Create a local copy of the (current) Catalogue of Life database:

td_create("col")

Read in the species list used by the Breeding Bird Survey:

bbs_species_list <- system.file("extdata/bbs.tsv", package="taxadb")
bbs <- read.delim(bbs_species_list)

Getting names and ids

Two core functions are get_ids() and get_names(). These functions take a vector of names or ids (respectively), and return a vector of ids or names (respectively). For instance, we can use this to attempt to resolve all the bird names in the Breeding Bird Survey against the Catalogue of Life:

birds <- bbs %>% 
  select(species) %>% 
  mutate(id = get_ids(species, "col"))
#> Joining with `by = join_by(scientificName)`

head(birds, 10)
#>                          species        id
#> 1         Dendrocygna autumnalis COL:34Q2Z
#> 2            Dendrocygna bicolor COL:34Q32
#> 3                Anser canagicus      <NA>
#> 4             Anser caerulescens      <NA>
#> 5  Chen caerulescens (blue form)      <NA>
#> 6                   Anser rossii      <NA>
#> 7                Anser albifrons COL:679WV
#> 8                Branta bernicla  COL:N749
#> 9      Branta bernicla nigricans      <NA>
#> 10             Branta hutchinsii  COL:N74B

Note that some names cannot be resolved to an identifier. This can occur because of miss-spellings, non-standard formatting, or the use of a synonym not recognized by the naming provider. Names that cannot be uniquely resolved because they are known synonyms of multiple different species will also return NA. The filter_name filtering functions can help us resolve this last case (see below).

get_ids() returns the IDs of accepted names, that is dwc:AcceptedNameUsageIDs. We can resolve the IDs into accepted names:

birds %>% 
  mutate(accepted_name = get_names(id, "col")) %>% 
  head()
#>                         species        id          accepted_name
#> 1        Dendrocygna autumnalis COL:34Q2Z Dendrocygna autumnalis
#> 2           Dendrocygna bicolor COL:34Q32    Dendrocygna bicolor
#> 3               Anser canagicus      <NA>                   <NA>
#> 4            Anser caerulescens      <NA>                   <NA>
#> 5 Chen caerulescens (blue form)      <NA>                   <NA>
#> 6                  Anser rossii      <NA>                   <NA>

This illustrates that some of our names, e.g. Dendrocygna bicolor are accepted in the Catalogue of Life, while others, Anser canagicus are known synonyms of a different accepted name: Chen canagica. Resolving synonyms and accepted names to identifiers helps us avoid the possible miss-matches we could have when the same species is known by two different names.

Taxonomic Data Tables

Local access to taxonomic data tables lets us do much more than look up names and ids. A family of filter_* functions in taxadb help us work directly with subsets of the taxonomic data. As we noted above, this can be useful in resolving certain ambiguous names.

For instance, Agrostis caespitosa does not resolve to an identifier in ITIS:

get_ids("Agrostis caespitosa", "itis") 
#> Joining with `by = join_by(scientificName)`
#> Warning:   Found 5 possible identifiers for Agrostis caespitosa.
#>   Returning NA. Try filter_name('Agrostis caespitosa', '') to resolve manually.
#> [1] NA

Using filter_name(), we find this is because the name resolves not to zero matches, but is a known synonym to more than one accepted name (as indicated by the accepted name usage id)

filter_name('Agrostis caespitosa', 'itis')
#> # A tibble: 6 × 15
#>   taxonID     scien…¹ taxon…² accep…³ taxon…⁴ updat…⁵ kingdom phylum class order
#>   <chr>       <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  <chr> <chr>
#> 1 ITIS:785430 Agrost… species ITIS:5… synonym 2010-1… Plantae <NA>   Magn… Poal…
#> 2 ITIS:785431 Agrost… species ITIS:4… synonym 2010-1… Plantae <NA>   Magn… Poal…
#> 3 ITIS:785432 Agrost… species ITIS:4… synonym 2010-1… Plantae <NA>   Magn… Poal…
#> 4 ITIS:785433 Agrost… species ITIS:7… synonym 2010-1… Plantae <NA>   Magn… Poal…
#> 5 ITIS:785434 Agrost… species ITIS:5… synonym 2010-1… Plantae <NA>   Magn… Poal…
#> 6 ITIS:785435 Agrost… species ITIS:7… synonym 2010-1… Plantae <NA>   Magn… Poal…
#> # … with 5 more variables: family <chr>, genus <chr>, specificEpithet <chr>,
#> #   infraspecificEpithet <chr>, vernacularName <chr>, and abbreviated variable
#> #   names ¹scientificName, ²taxonRank, ³acceptedNameUsageID, ⁴taxonomicStatus,
#> #   ⁵update_date

We can resolve the scientific name to the acceptedNameUsage using get_names() on the accepted IDs: (These also correspond to the genus and specificEpithet column, as the classification is always given only based on acceptedNameUsageID).

filter_name("Agrostis caespitosa")  %>%
  mutate(acceptedNameUsage = get_names(acceptedNameUsageID)) %>% 
  select(scientificName, taxonomicStatus, acceptedNameUsage, acceptedNameUsageID)
#> # A tibble: 6 × 4
#>   scientificName      taxonomicStatus acceptedNameUsage          acceptedNameU…¹
#>   <chr>               <chr>           <chr>                      <chr>          
#> 1 Agrostis caespitosa synonym         Deschampsia cespitosa      ITIS:502001    
#> 2 Agrostis caespitosa synonym         Agrostis stolonifera       ITIS:40400     
#> 3 Agrostis caespitosa synonym         Agrostis stolonifera       ITIS:40400     
#> 4 Agrostis caespitosa synonym         Calamagrostis preslii      ITIS:782718    
#> 5 Agrostis caespitosa synonym         Muhlenbergia torreyi       ITIS:503886    
#> 6 Agrostis caespitosa synonym         Muhlenbergia quadridentata ITIS:783883    
#> # … with abbreviated variable name ¹acceptedNameUsageID

Similar functions filter_id, filter_rank, and filter_common take IDs, scientific ranks, or common names, respectively. Here, we can get taxonomic data on all bird names in the Catalogue of Life:

filter_rank(name = "Aves", rank = "class", provider = "col")
#> # A tibble: 10,598 × 25
#>    taxonID   accepte…¹ scien…² taxon…³ taxon…⁴ kingdom phylum class order family
#>    <chr>     <chr>     <chr>   <chr>   <chr>   <chr>   <chr>  <chr> <chr> <chr> 
#>  1 COL:59ZVZ COL:59ZVZ Tyto l… accept… species Animal… Chord… Aves  Stri… Tyton…
#>  2 COL:64X3G COL:64X3G Aegoli… accept… species Animal… Chord… Aves  Stri… Strig…
#>  3 COL:5XGW7 COL:5XGW7 Celeus… accept… species Animal… Chord… Aves  Pici… Picid…
#>  4 COL:4NGTM COL:4NGTM Psepho… accept… species Animal… Chord… Aves  Psit… Psitt…
#>  5 COL:3C9TF COL:3C9TF Eulamp… accept… species Animal… Chord… Aves  Apod… Troch…
#>  6 COL:6CZN3 COL:6CZN3 Discos… accept… species Animal… Chord… Aves  Apod… Troch…
#>  7 COL:4S4FP COL:4S4FP Rhaphi… accept… species Animal… Chord… Aves  Apod… Apodi…
#>  8 COL:7TCSL COL:7TCSL Dryoba… accept… species Animal… Chord… Aves  Pici… Picid…
#>  9 COL:3HT3X COL:3HT3X Gymnop… accept… species Animal… Chord… Aves  Colu… Colum…
#> 10 COL:3Z72J COL:3Z72J Melane… accept… species Animal… Chord… Aves  Pici… Picid…
#> # … with 10,588 more rows, 15 more variables: genus <chr>,
#> #   specificEpithet <chr>, infraspecificEpithet <chr>, cultivarEpithet <chr>,
#> #   datasetID <chr>, namePublishedIn <chr>, nameAccordingTo <chr>,
#> #   taxonRemarks <chr>, nomenclaturalStatus <chr>, nomenclaturalCode <chr>,
#> #   parentNameUsageID <chr>, originalNameUsageID <chr>,
#> #   `dcterms:references` <chr>, language <chr>, vernacularName <chr>, and
#> #   abbreviated variable names ¹acceptedNameUsageID, ²scientificName, …

Combining these with dplyr functions can make it easy to explore this data: for instance, which families have the most species?

filter_rank(name = "Aves", rank = "class", provider = "col") %>%
  filter(taxonomicStatus == "accepted", taxonRank=="species") %>% 
  group_by(family) %>%
  count(sort = TRUE) %>% 
  head()
#> # A tibble: 6 × 2
#> # Groups:   family [6]
#>   family           n
#>   <chr>        <int>
#> 1 Tyrannidae     401
#> 2 Thraupidae     374
#> 3 Psittacidae    370
#> 4 Trochilidae    361
#> 5 Columbidae     344
#> 6 Muscicapidae   314

Using the database connection directly

filter_* functions by default return in-memory data frames. Because they are filtering functions, they return a subset of the full data which matches a given query (names, ids, ranks, etc), so the returned data.frames are smaller than the full record of a naming provider. Working directly with the SQL connection to the MonetDBLite database gives us access to all the data. The taxa_tbl() function provides this connection:

taxa_tbl("col")
#> # Source:   table<v22.12_dwc_col> [?? x 25]
#> # Database: DuckDB 0.7.0 [unknown@Linux 5.17.15-76051715-generic:R 4.2.2/:memory:]
#>    taxonID   accepte…¹ scien…² taxon…³ taxon…⁴ kingdom phylum class order family
#>    <chr>     <chr>     <chr>   <chr>   <chr>   <chr>   <chr>  <chr> <chr> <chr> 
#>  1 COL:3L3RS COL:3L3RS Hersil… accept… species Animal… Arthr… <NA>  Aran… Hersi…
#>  2 COL:6MTNS COL:6MTNS Idiotr… accept… species Animal… Arthr… Inse… Hemi… Helot…
#>  3 COL:39VC7 COL:39VC7 Enitha… accept… species Animal… Arthr… Inse… Hemi… Noton…
#>  4 COL:6LHWM COL:6LHWM Heleoc… accept… species Animal… Arthr… Inse… Hemi… Nauco…
#>  5 COL:38PQV COL:38PQV Ectemn… accept… species Animal… Arthr… Inse… Hemi… Corix…
#>  6 COL:73VN6 COL:73VN6 Neomac… accept… species Animal… Arthr… Inse… Hemi… Nauco…
#>  7 COL:6MKPW COL:6MKPW Hydrot… accept… species Animal… Arthr… Inse… Hemi… Helot…
#>  8 COL:5FMGW COL:5FMGW rotumai accept… subspe… <NA>    <NA>   <NA>  <NA>  <NA>  
#>  9 COL:3L5C7 COL:3L5C7 Hesper… accept… species Animal… Arthr… Inse… Hemi… Corix…
#> 10 COL:SVTT  COL:SVTT  Cercot… accept… species Animal… Arthr… Inse… Hemi… Nepid…
#> # … with more rows, 15 more variables: genus <chr>, specificEpithet <chr>,
#> #   infraspecificEpithet <chr>, cultivarEpithet <chr>, datasetID <chr>,
#> #   namePublishedIn <chr>, nameAccordingTo <chr>, taxonRemarks <chr>,
#> #   nomenclaturalStatus <chr>, nomenclaturalCode <chr>,
#> #   parentNameUsageID <chr>, originalNameUsageID <chr>,
#> #   `dcterms:references` <chr>, language <chr>, vernacularName <chr>, and
#> #   abbreviated variable names ¹acceptedNameUsageID, ²scientificName, …

We can still use most familiar dplyr verbs to perform common tasks. For instance: which species has the most known synonyms?

taxa_tbl("itis") %>% 
  count(acceptedNameUsageID, sort=TRUE)
#> # Source:     SQL [?? x 2]
#> # Database:   DuckDB 0.7.0 [unknown@Linux 5.17.15-76051715-generic:R 4.2.2/:memory:]
#> # Ordered by: desc(n)
#>    acceptedNameUsageID     n
#>    <chr>               <dbl>
#>  1 ITIS:50               462
#>  2 ITIS:983681           303
#>  3 ITIS:983691           286
#>  4 ITIS:983714           237
#>  5 ITIS:983710           231
#>  6 ITIS:798259           145
#>  7 ITIS:24921            144
#>  8 ITIS:527684           134
#>  9 ITIS:505191           126
#> 10 ITIS:504874           123
#> # … with more rows

However, unlike the filter_* functions which return convenient in-memory tables, this is still a remote connection. This means that direct access using the taxa_tbl() function (or directly accessing the database connection using td_connect()) is more low-level and requires greater care. For instance, we cannot just add a %>% mutate(acceptedNameUsage = get_names(acceptedNameUsageID)) to the above, because get_names does not work on a remote collection. Instead, we would first need to use a collect() to pull the summary table into memory. Users familiar with remote databases in dplyr will find using taxa_tbl() directly to be convenient and fast, while other users may find the filter_* approach to be more intuitive.

Learn more

See richer examples the package Tutorial.
Learn about the underlying data sources and formats in Data Sources
Get better performance by selecting an alternative database backend engines.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

taxadb's People

Contributors

Stargazers

Watchers

Forkers

iamciera annakrystalli karinorman kkougiou ldecicco-usgs rneuhausler cnyuanh mgirlich kguidonimartins jhpoelen rnaimehaom jasono6359 mattiaghilardi

taxadb's Issues

`td_create` Failed to connect

I tried also specifying version=2019 but I got the same error.

R version: 4.0.5
RStudio version: 1.4.1106

Link to the published paper and CITATION file

Hello taxadb maintainers (mainly @cboettig I suppose)! 👋
I really like the approach proposed by taxadb and realized the paper published in Methods in Ecology and Evolution is not cited anywhere in the package (even though the actual paper is part of the package) and there is no explicit CITATION file.

What are you takes on this?

I'll be happy to submit the PR that shows an explicit CITATION file with both the package and the paper mentioned, that adds the paper in the README and that mentions it in some vignettes.

Thanks again for the amazing tool 👏

MonetDBLite removed from CRAN

Hi Carl and collaborators, I was watching the recent rOpenSci community call and was very excited to give taxadb a whirl.

It seems there is a dependency on MonetDBLite and it is not available for R 3.5.3 (or 3.6 I guess). The CRAN status on the package reveals that it has been removed from CRAN and archived in April 2019 at the request of the maintainer.

I see now that you have spotted this here: MonetDB/MonetDBLite-R#38 and mentioned the development of duckdb as an alternative.

Leaving this here for others who will have this question. taxadb looks great from Kari's presentation and I'll look forward to using taxadb when there has been some time to address the dependency issue.

All the best,

Paul

read_only connections

Support for temporary tables for joins, mutates etc on read-only connections should be possible.

Looks like only the mutate_db tests may need to be fixed first...

See experimental version on branch: https://github.com/ropensci/taxadb/tree/patch-duckdb

Initial reviewer response

Hi @lindsayplatt & @mcsiple, thanks for taking the time to review and for your helpful comments! A couple general responses below about issues you both ran into with more detailed and individualized responses to come!

Vignettes

Thanks @ldecicco-USGS for pointing out the vignette location. They are currently in the articles subdirectory to prevent timing out on CRAN (pkgdown still detects and builds vignettes in subdirectories). Prebuilt vignettes are linked at the bottom of the README, but we could also put a lighter weight version in vingettes/ so they're accessible by browseVignettes(package="taxadb").

Backend Documentation

It's clear from both of your experiences with the package that an upfront description of the package backends (i.e. database hosts) is necessary. We will expand on that in the schema.rmd vignette, but also wanted to point to the article draft in paper/manuscript.rmd, which we would also love feedback on! It lays out the backends in a lot more detail and should lend some clarity.

@lindsayplatt, backend problems are likely the source of your performance issues. td_create("col") only installs the database for Catalogue of Life, whereas many of the examples reference other providers (such as ITIS, which is the default provider). If you try to use a provider that has not been installed the default is to install that database on the fly, which takes a while! Running td_create("all") will install all databases ahead of time. This is obviously an issue of clarity on our part in the README, but just wanted to explain what was likely going on there.

Also, the freezing behavior Lindsay observed is almost surely due to the computer running low on memory. We suspect this is because RSQLite or MonetDBLite were not installed? We made RSQLite an optional dependency and allowed taxadb to work in-memory if it was not found, but this example demonstrates that was a poor choice, since users can easily run out of memory with these giant data tables! We've making RSQLite a required dependency now, and adding a vignette to explain how to choose between these. (Basically RSQLite is easier to install but not as fast as MonetDBLite!) You can check sessionInfo() or just run taxadb::td_connect() to see what database connection is being used.

by_common should search all known common names

Currently by_common just uses the first available common name which we've grabbed. When multiple common names are available, this doesn't always give the best choice, so it would be better to check all common names. (For instance, this makes ITIS use man as the vernacular for Homo sapiens, and rock cod instead of Atlantic cod for Gadus morhua).

Consider a "clean" name column for matching against

@karinorman & @sckott

Our current strategy matches requested names directly against the scientificName field, which is whatever name we get from the original names provider. I'm thinking we should create a new column of 'cleaned' names that would potentially not be part of the actual schema, but might be easier to match against in JOINS. The trick is to do this in a way that is still true to the original database and doesn't implicitly introduce some hidden assumptions.

For example:

In the lowercase branch, I've added a step which first does a mutate(input = tolower(scientificName)) on the stored database, simply to create a lowercase version of all the scientificNames (at any rank), so that we can do case-insensitive joins by also lowercasing the input query before running the join. Currently this is done when ids() is called, so it doesn't involve altering the Darwin Core records from data-raw. That adds un-necessary overhead computation, though only a second or so since mutate with tolower is actually pretty fast here: when the data is in an external database, dplyr is translating this in the SQL tolower method, not applying the base R method.

This makes joins case-insensentive, but we still have other cases where we will want a join to succeed but it doesn't. For instance, OTT has the synonym Chondria tenuissima (Withering) C.Agardh, 1817, which means that Chondria tenuissima fails to return a match. However, just hacking off anything after the first two words might be implicitly introducing taxonomic assumptions that are not guaranteed (compared to being case- insensitive, which seems like a safer assumption).

The function clean_names() is intended to be applied to input names, and it can optionally do things like binomial-ize names to make this matching easier, but to actually get matches it should probably be done to species names as well (as an additional column). Opening this for us to think more about this issue.

taxadb breaks CI for RNeXML

That's because it fails with a 404 to download https://github.com/boettiger-lab/taxadb-cache/releases/download/dwc/dwc.2fcommon_itis.tsv.bz2 here:

https://github.com/cboettig/taxadb/blob/aad9991492e1ec29a5533db1b704fb1a90382134/R/taxa_tbl.R#L37-L39

Supposedly, that's because it fails to follow redirection, because otherwise the download actually works?

Support queries across all authorities simultaneously

A user should be able to

resolve a list of taxonIDs drawn from all of the authorities into species names
crosswalk a list of mixed identifiers into identifiers of a given authority (introducing NAs where no match is found).
query a species list to return all possible matching identifiers. Possibly as a table of species name and a column for each authority, which should make it easy to subsequently determine the best combination of coverage. Ideally could also provide a collapsed version, with a single ID per species, choosing one authority when multiple are available in a manner that reflects a preference list or minimizes the number of different authorities used?
...

duckdb version

Is there a problem with duckdb versions ?
If I say,

library(taxadb)
a <- taxadb::get_ids (names=c("homo sapiens"))

I get

Error in initialize(value, ...) : 
  duckdb_startup_R: Failed to open database: IO Error: Trying to read a database file with version number 21, but we can only read version 25.
The database file was created with an older version of DuckDB.

Of course, one can Sys.setenv(TAXADB_DRIVER="RSQLite") to avoid the issue.

create_db should permit opt-in selection of authorities

Currently create_db pulls in all data (though it has an argument to specify the authorities). Need to add a filename filter so it actually selects only the requested authorities. Documentation might want to provide some guidance on which are the largest databases too.

If we narrow down the focal set of table schema this might be faster anyway. e.g. maybe long table can always be omitted, should just be the union of taxonid and heirarchy_long tables...

Restructure taxonid table

@karinorman as discussed today for taxonid tables:

id column should refer to the unique id of the accepted name or synonym. (it's possible some databases do not have synonym ids?)
We should add a column, accepted_id corresponding to the mapped id (e.g. same as id if the name is accepted, otherwise, the id of the accepted name and not that of the synonym).
ids() function should then be updated to return the accepted_id by default.

(Users will only see the true id then if they set pull=FALSE, since then they'd get back the table with both id and accepted_id. I'm still on the fence about the whole pull setup, maybe ids() should always return a table. For compatibility with taxize, I've added a separate get_ids function which always returns a vector, though that vector can be in a variety of formats, using bare id, prefix id, or as a URI. Having separate functions to get back data.frame vs vector objects might ultimately be cleaner than the current pull=TRUE behavior -- generally it's best functions have consistent return types.... any thoughts on this?

fn names, defaults, and data locations, for database-backed packages

Current behavior: calling the function create_db() with no arguments results in a data being installed to a default location of ~/.taxald. This default can be changed with an environmental variable, TAXALD_HOME, or specified directly by providing a location to the dbdir argument of the function.

i.e. create_db() has the argument dbdir default to:

Sys.getenv("TAXALD_HOME", file.path(path.expand("~"), ".taxald"))

many questions:

is create_db() a poor function name? (e.g. other packages might use this same pattern of creating a local database, could lead to name conflict).
Is an env var preferable to an option? (Env vars are often used to set home locations for databases when software installs, often defaulting to user's home directory. But R packages tend to use options for user settings other than generic things like credentials. Technically this database could be accessed by non-R use, which is maybe a case for using env vars).
In this setup, having a default value for TAXALD_HOME is technically a CRAN violation. Should we just do away with the default and have the function error if users don't set an env var first? It does make the workflow more cumbersome.
If so, is it worth having a package function to set the env var? I mean, calling Sys.getenv("TAXALD_HOME"=path.expand("~/.taxald")) is easy enough for most users, but it's not a documented function task that way. Maybe taxald_home("~/.taxald") would be preferable?
Additional thoughts on naming things? is TAXALD_HOME a good name?
... other related potential issues?

Note that this same pattern could be followed by packages that don't need a db but memoise calls that download and parse data files (like rfishbase 3.0).

Collapse synonym table into taxonid

For instance:

all_itis <- 
  taxa_tbl("itis", "synonyms") %>% 
  select(name, id, rank, update_date) %>% 
  mutate(type = "synonym") %>% 
  union( select(taxa_tbl("itis", "taxonid"))%>% 
           mutate(type = "accepted") )

Then to get the accepted names, we need another function or another option on ids().

If df is the return from joining all_itis with my species, then:

all_itis %>% 
  filter(type == "accepted") %>% 
  select(id, accepted_name = name) %>% 
  right_join(df)

Testing ID conditions / assumptions across providers

We should assemble a list of conditions / assumptions we might want to assert across provider DBs. These would be the kind of operations that one might implicitly assume, but don't always hold -- sometimes for good reasons, sometimes not so.

For instance, one might think that scientificName-taxonID pairs would be unique (for non-missing taxonID's).

But some DBs have multiple accepted names:

dups <-
taxa_tbl("col") %>%
 select(taxonID, scientificName) %>%
 distinct() %>%
 count(taxonID, sort=TRUE) %>%
 filter(n > 1) 

semi_join(taxa_tbl("col")  %>% select(taxonID, scientificName) , dups) %>% 
arrange(taxonID)

In this case, this is because COL recognizes both with and without the subsp. as an accepted name, (while it would have been more reasonable to denote one of these forms as a synonym):

 1 COL:10750775 Adelocephala (Oiticicia) purpurascens subsp. intensiva
 2 COL:10750775 Adelocephala (Oiticicia) purpurascens intensiva       
 3 COL:10750776 Adelocephala (Oiticicia) jucundoides subsp. eolus     
 4 COL:10750776 Adelocephala (Oiticicia) jucundoides eolus

synonyms() should return an column of the inputs

Currently the output from synonyms() does not explicitly show which of the input names the a given row corresponds to. The input name could be either an acceptedName or synonym or not preserved at all. For example synonyms(c("Eutamias dorsalis", "Saxicola torquata"), "ott") returns:

acceptedNameUsage synonym taxonRank acceptedNameUsageID

1 Tamias dorsalis Eutamias dorsalis species OTT:428073
2 Tamias dorsalis Neotamias dorsalis species OTT:428073
3 Tamias dorsalis Eutamias canescens species OTT:428073
4 Saxicola torquata Saxicola torquatus species OTT:825641

The first input name explicitly matches to the first row, and implicitly matches to the second and third (by having the same acceptedNameUsage as the first row), and the fourth row explicitly matches the acceptedNameUsage. An input column that links each row to the input column would resolve that ambiguity.

NCBI taxonomy - missing records

Hi,

I'm playing around with taxadb and noticed that only 300K of the 2 million NCBI taxa are in the taxadb NCBI database. Any idea why the only 15% of the NCBI taxa are included in taxadb NCBI database?

scientific names, canonical names, species names ...

Currently, most table represent species names as canonical/scientific names Genus species, e.g. Homo sapiens rather than just sapiens. Does this make sense? What about epithets?

I've tried to standardize this across the datasets (e.g. for the species column of the hierarchy tables. GBIF drops the rank 'species' all together in preference for specificEpithet.

The same question arises in the taxonid tables: e.g. at species rank (which some but not all taxonid tables are restricted to anyway), should we use Genus species or Genus species epithet? Or something else? what about authorities that define multiple epithets?

Handling multiple matching synonyms

It would be nice for the id table to return precisely one row for each name in the input query vector. Unfortunately, some recognized synonyms are synonyms to two accepted names, and thus cannot be resolved automatically. For instance, in ITIS, 'Trochalopteron henrici gucenense' is a synonym for both 'Trochalopteron elliotii' and also for 'Trochalopteron henrici'.

What should the function do in this case? Clearly user input is needed to ultimately resolve these names, and the user should be notified, but still unclear what the best return structure should be (in a way that best favors automatic pipelines and reasoning -- e.g. not an interactive prompt, and not just a warning that has to be parsed; better to capture all possible cases in the return data structure natively). Perhaps an additional column(s) indicating the multiple matches?

Case-sensitive "input" column for by_ functions

Currently by_ functions using the ignore_case = TRUE argument return an input column that is not case sensitive. They should return the same column that is given as input whether or not case_sensitive matching is used.

tpl database

Hi,

A small detail in the tpl database. A space between the genus and specific epithet is missing in the column "scientificName".

fuzzy match names?

Ability to fuzzy match names?

Should support "contains": filter(name %like% "%string%")
and "starts_with": filter(name %like% "string%")
Should handle at least some name parsing? Possibly better as pre-query cleaning functions. e.g. reduce to a canonical form, at least for sp. suffix, reducing to binomial names, etc. See Rees algorithm and https://doi.org/10.1016/j.tree.2010.09.004

Consider strategies to subset first since filters on full db are much slower than joins? e.g. strsplit species names to genus, filtering-join all matching genus from hierarchy table, filtering-join all matching ids on taxonid table, then filter.

(no luck on experiments with fuzzyjoin)

Error message when calling filter_by

Dear @cboettig and other taxadb developers 👋

I've started getting a error message when using the filter_by function. The following is a reproducible example:

get_gbif <- as.data.frame(taxadb::filter_by(spp_BioT, by= "scientificName", provider = "gbif"))
#> Error in initialize(value, ...): duckdb_startup_R: Failed to open database

^{Created on 2021-03-04 by the reprex package (v1.0.0)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       Europe/Berlin               
#>  date     2021-03-04                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source                          
#>  arkdb         0.0.8   2020-11-04 [1] CRAN (R 4.0.3)                  
#>  askpass       1.1     2019-01-13 [1] CRAN (R 4.0.3)                  
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.3)                  
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.0.3)                  
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.0.3)                  
#>  blob          1.2.1   2020-01-20 [1] CRAN (R 4.0.3)                  
#>  cachem        1.0.4   2021-02-13 [1] CRAN (R 4.0.4)                  
#>  cli           2.3.1   2021-02-23 [1] CRAN (R 4.0.4)                  
#>  contentid     0.0.9   2021-01-15 [1] CRAN (R 4.0.4)                  
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.4)                  
#>  curl          4.3     2019-12-02 [1] CRAN (R 4.0.3)                  
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.0.4)                  
#>  dbplyr        2.1.0   2021-02-03 [1] CRAN (R 4.0.4)                  
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)                  
#>  dplyr         1.0.4   2021-02-02 [1] CRAN (R 4.0.4)                  
#>  duckdb        0.2.4   2021-02-02 [1] CRAN (R 4.0.4)                  
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.3)                  
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.3)                  
#>  fansi         0.4.2   2021-01-15 [1] Github (cran/fansi@39c8fbb)     
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.4)                  
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.3)                  
#>  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.3)                  
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.3)                  
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.3)                  
#>  hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.4)                  
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.4)                  
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.3)                  
#>  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.3)                  
#>  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.4)                  
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.4)                  
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)                  
#>  memoise       2.0.0   2021-01-26 [1] CRAN (R 4.0.4)                  
#>  openssl       1.4.3   2020-09-18 [1] CRAN (R 4.0.3)                  
#>  pillar        1.5.0   2021-02-22 [1] CRAN (R 4.0.4)                  
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)                  
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.3)                  
#>  progress      1.2.2   2019-05-16 [1] CRAN (R 4.0.3)                  
#>  ps            1.6.0   2021-02-28 [1] CRAN (R 4.0.4)                  
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.3)                  
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.0.3)                  
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.0.3)                  
#>  R.utils       2.10.1  2020-08-26 [1] CRAN (R 4.0.3)                  
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)                  
#>  rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.0.4)                  
#>  Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.0.4)                  
#>  readr         1.4.0   2020-10-05 [1] CRAN (R 4.0.3)                  
#>  reprex        1.0.0   2021-01-27 [1] CRAN (R 4.0.4)                  
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)                  
#>  rmarkdown     2.7     2021-02-19 [1] CRAN (R 4.0.4)                  
#>  RSQLite       2.2.3   2021-01-24 [1] CRAN (R 4.0.4)                  
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.3)                  
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.3)                  
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.3)                  
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.3)                  
#>  taxadb        0.1.2   2021-03-04 [1] Github (ropensci/taxadb@b3991e9)
#>  tibble        3.1.0   2021-02-25 [1] CRAN (R 4.0.3)                  
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.3)                  
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 4.0.3)                  
#>  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.3)                  
#>  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.4)                  
#>  xfun          0.21    2021-02-10 [1] CRAN (R 4.0.4)                  
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.3)                  
#> 
#> [1] C:/Users/ro03deda/Documents/R/win-library/4.0
#> [2] C:/Users/ro03deda/Documents/R/R-4.0.4/library

which produces the following output:

generate citations to original references by identifier

Some (most?) of the database providers include a references table connecting taxonIDs to the original references. Parsing some of those into valid citations might be a nightmare, but if we have some well-formed references we can export from tabular format to bibtex that might be nice.

In COL, looks like the references table is just URLs to the COL entries (though those do display the original citation information, we just don't seem to have it). GBIF seems similar. ITIS and NCBI may have more fully referenced data...

Anyway, just putting this here to hold the thought.

differences in case prevent matching

Consider option / strategy to make all joins independent of case.

Could add tolower as an option in clean_names() and fuzzy_names(), but should also be done for joins.
Should darwin core records (database tables) enforce tolower() on any/all names? Probably not, better to represent data as it came and handle this in the join method.

Consider quickstart DB option

It would be nice to have the ability to quickly deploy the package without importing all datasets up front. A very light-weight version could consider just pulling in a single table directly into memory (like maybe the itis_hierarchy table, for instance, via read_tsv) and having all functions run off that? Would be useful for the unit tests and simple applications.

Make data_raw scripts into first-class citizens

I've discovered that all of the original sources for the databases can actually be accessed in formats other than Postgres or MySQL dumps, which means that we can migrate the import/cleaning scripts from data-raw into proper first-class package functions without introducing a dependency on having an external MariaDB and Postgres server handy (and the Postgres and MariaDB client applications installed on the host; all of which would be significant show-stoppers for many users, imho).

I've taken a look at doing this for GBIF and COL, since Scott showed me how to get COL dumps for every year that already in DarwinCore tabular format, and thus require pretty minimal cleaning. Others (ITIS and NCBI in particular?) will need a lot of work, and some (like redlist) will still not be functions most users would ever actually want to run because looping over the API takes ~ 1.5 days solid to run with a good network connection. So I think we'd still provide caches and have create_td() work off the cached tidy tables, but the whole workflow would be more transparent.

IUCN database

Hello,
I am trying to work on the IUCN database with taxadb but seems like the available version of IUCN database is 2019.

> td_create("iucn")
could not find 2020_dwc_iucn, 2020_common_iucn 
  checking for older versions.
2020_dwc_iucn not available2020_common_iucn not available

> td_create("iucn",version = 2019)
Importing C:/Users/Cheng/AppData/Roaming/R/data/R/contentid/data/d9/1b/d91b51013b669a31fd268743cf2db866b0f3e7a7f1af78e60271fa5f137bd21e in 100000 line chunks:
[-] chunk 2	...Done! (in 10.51822 secs)
Importing C:/Users/Cheng/AppData/Roaming/R/data/R/contentid/data/30/51/30516362af0a394a7a78677ae129a95101cb852bda20ad872ae042ced43c463c in 100000 line chunks:
	...Done! (in 0.0776782 secs)
Warning messages:
1: In overwrite_db(con, tablename) : overwriting 2019_dwc_iucn
2: In read_chunked(con, lines, encoding) :
  connection has already been completely read

The real issue is when I check the IUCN table, there are no terrestrial species in it. The acceptedNameUsageID starts with SLB:. Isn't that another database (SeaLifeBase)?

`> taxa_tbl("iucn",version=2019)
# Source:   table<2019_dwc_iucn> [?? x 14]
# Database: duckdb_connection
   taxonID scientificName           taxonRank    acceptedNameUsageID taxonomicStatus kingdom   phylum    class          order      family       genus       specificEpithet vernacularName infraspecificEpit~
   <chr>   <chr>                    <chr>        <chr>               <chr>           <chr>     <chr>     <chr>          <chr>      <chr>        <chr>       <chr>           <chr>          <chr>             
 1 NA      Aaptos aaptos var. nigra variety      SLB:130062          synonym         Animalia  Mollusca  Gastropoda     Neogastro~ Conidae      Conus       nimbosus        NA             NA                
 2 NA      Aaptos adriatica         species      SLB:51720           synonym         Plantae   Rhodophy~ Florideophyce~ Gigartina~ Areschougia~ Erythroclo~ angustatum      NA             NA                
 3 NA      Aaptos chromis           species      SLB:130062          synonym         Animalia  Mollusca  Gastropoda     Neogastro~ Conidae      Conus       nobilis         NA             NA                
 4 NA      Aaptos lithophaga        species      SLB:130708          synonym         NA        NA        NA             NA         NA           NA          NA              NA             NA                
 5 NA      Abacola holothuriae      species      SLB:30411           synonym         Chromista Ochrophy~ Coscinodiscop~ Fragilari~ Fragilariac~ Fragilaria  constricta      NA             NA                
 6 NA      Abanericola affinis afr~ subspecies   SLB:142030          synonym         NA        NA        NA             NA         NA           NA          NA              NA             NA                
 7 NA      Abanericola claparedi    species      SLB:38944           synonym         NA        NA        NA             NA         NA           NA          NA              NA             NA                
 8 NA      Abarenicola affinis aff~ nominotypic~ SLB:142030          accepted name   NA        NA        NA             NA         NA           NA          NA              NA             NA                
 9 NA      Abarenicola affinis afr~ subspecies   SLB:142030          accepted name   NA        NA        NA             NA         NA           NA          NA              NA             NA                
10 NA      Abatus nimrodi           species      SLB:152043          synonym         NA        NA        NA             NA         NA           NA          NA              NA             NA                
# ... with more rows`

Not sure if you were able to reproduce this issue but here is my

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.5  taxadb_0.1.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.5.1      compiler_4.0.5    dbplyr_2.1.1      prettyunits_1.1.1 tools_4.0.5      
 [7] progress_1.2.2    contentid_0.0.9   bit_4.0.4         memoise_2.0.0     jsonlite_1.7.2    RSQLite_2.2.5    
[13] lifecycle_1.0.0   tibble_3.1.0      pkgconfig_2.0.3   rlang_0.4.10      cli_2.4.0         DBI_1.1.1        
[19] curl_4.3          fastmap_1.1.0     duckdb_0.2.5      arkdb_0.0.12      httr_1.4.2        generics_0.1.0   
[25] fs_1.5.0          vctrs_0.3.7       askpass_1.1       hms_1.0.0         rappdirs_0.3.3    bit64_4.0.5      
[31] tidyselect_1.1.0  glue_1.4.2        R6_2.5.0          fansi_0.4.2       purrr_0.3.4       readr_1.4.0      
[37] blob_1.2.1        magrittr_2.0.1    ellipsis_0.3.1    assertthat_0.2.1  utf8_1.2.1        stringi_1.5.3    
[43] openssl_1.4.3     cachem_1.0.4      crayon_1.4.1

Thank you very much for the help.

Reviewer to do

General

deal with vignettes, either by checking that existing vignettes will pass CRAN, or creating another version with a toy ITIS that will run
add contribution guidelines (w/ URL, BugReports, and Maintainer)

Reviewer 1

examples for mutate_db and taxa_tbl, update examples for td_connect and td_create
backend descriptions in schema.rmd, and briefly in README? include info on how often database will be updated and general DWC column descriptions
add provider descriptions to README w/citations
add tests for mutate_db(), quick_db(), and has_tbl()
change by_* naming convention to filter_*
suppress warnings from td_create
delete code for null_tibble()
option for manually closing database connection (this probably already exists but is also probably different depending on provider?)

Reviewer 2

explicitly state location of database in README? (or just point them to td_create documentation)
test again on windows, can we replicate the issues? (install and td_create errors)
add clean common names example to mutate_db documentation

error in create_taxadb

create_taxadb()
#>   |==== ... many progress bars =| 100%
#> Error: [ENOENT] Failed to search directory 'data': no such file or directory

traceback()
#>
#> 5: (function (..., call. = TRUE, domain = NULL)
#>    {
#>     ...
#> 4: dir_map_(old, fun, all, sum(directory_entry_types[type]), recursive)
#> 3: dir_map(old, identity, all, recursive, type)
#> 2: fs::dir_ls("data/", glob = "*.tsv.bz2")
#> 1: create_taxadb()

by_common for providers w/out common names

by_common should return a meaningful error for providers that don't have common names. Providers w/common names should be listed in the function documentation (itis, col, gbif, ncbi, fb, slb iucn).

IO Error: Trying to read a database file with version number 11, but we can only read version 15.

While running EML::set_taxonomicCoverage with expand=TRUE, I ran into an error which can be reproduced from taxadb directly:

> library(taxadb)
> get_ids("Trochalopteron henrici gucenense")
Error in initialize(value, ...) :
  duckdb_startup_R: Failed to open database: IO Error: Trying to read a database file with version number 11, but we can only read version 15.
The database file was created with an older version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

I did a bit of looking and it seems like there's some infrastructure I don't entirely understand that generates a database dump that might just need to get re-run? Any ideas @cboettig?

session_info() output

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.0 (2021-05-18)
 os       macOS Big Sur 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Juneau
 date     2021-06-02

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 arkdb         0.0.12  2021-04-05 [1] CRAN (R 4.1.0)
 askpass       1.1     2019-01-13 [1] CRAN (R 4.1.0)
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
 bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.0)
 blob          1.2.1   2020-01-20 [1] CRAN (R 4.1.0)
 cachem        1.0.5   2021-05-15 [1] CRAN (R 4.1.0)
 callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.0)
 cli           2.5.0   2021-04-26 [1] CRAN (R 4.1.0)
 contentid     0.0.10  2021-04-27 [1] CRAN (R 4.1.0)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
 curl          4.3.1   2021-04-30 [1] CRAN (R 4.1.0)
 DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
 dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
 desc          1.3.0   2021-03-05 [1] CRAN (R 4.1.0)
 devtools      2.4.1   2021-05-05 [1] CRAN (R 4.1.0)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
 dplyr         1.0.6   2021-05-05 [1] CRAN (R 4.1.0)
 duckdb        0.2.6   2021-05-09 [1] CRAN (R 4.1.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
 EML         * 2.0.5   2021-02-27 [1] CRAN (R 4.1.0)
 emld          0.5.1   2020-09-27 [1] CRAN (R 4.1.0)
 evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
 fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
 fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
 generics      0.1.0   2020-10-31 [1] CRAN (R 4.1.0)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
 hms           1.1.0   2021-05-17 [1] CRAN (R 4.1.0)
 htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
 httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
 jqr           1.2.1   2021-05-06 [1] CRAN (R 4.1.0)
 jsonld        2.2     2020-05-27 [1] CRAN (R 4.1.0)
 jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.1.0)
 knitr         1.33    2021-04-24 [1] CRAN (R 4.1.0)
 lazyeval      0.2.2   2019-03-15 [1] CRAN (R 4.1.0)
 lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.1.0)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
 memoise       2.0.0   2021-01-26 [1] CRAN (R 4.1.0)
 openssl       1.4.4   2021-04-30 [1] CRAN (R 4.1.0)
 pillar        1.6.1   2021-05-16 [1] CRAN (R 4.1.0)
 pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.1.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
 pkgload       1.2.1   2021-04-06 [1] CRAN (R 4.1.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.1.0)
 processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.0)
 progress      1.2.2   2019-05-16 [1] CRAN (R 4.1.0)
 ps            1.6.0   2021-02-28 [1] CRAN (R 4.1.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
 R6            2.5.0   2020-10-28 [1] CRAN (R 4.1.0)
 rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.1.0)
 Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.1.0)
 readr         1.4.0   2020-10-05 [1] CRAN (R 4.1.0)
 remotes       2.4.0   2021-06-02 [1] CRAN (R 4.1.0)
 rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
 rmarkdown     2.8     2021-05-07 [1] CRAN (R 4.1.0)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.1.0)
 RSQLite       2.2.7   2021-04-22 [1] CRAN (R 4.1.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
 stringi       1.6.2   2021-05-17 [1] CRAN (R 4.1.0)
 taxadb      * 0.1.3   2021-04-27 [1] CRAN (R 4.1.0)
 testthat      3.0.2   2021-02-14 [1] CRAN (R 4.1.0)
 tibble        3.1.2   2021-05-16 [1] CRAN (R 4.1.0)
 tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
 usethis       2.0.1   2021-02-10 [1] CRAN (R 4.1.0)
 utf8          1.2.1   2021-03-12 [1] CRAN (R 4.1.0)
 uuid          0.1-4   2020-02-26 [1] CRAN (R 4.1.0)
 V8            3.4.2   2021-05-01 [1] CRAN (R 4.1.0)
 vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
 withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
 xfun          0.23    2021-05-15 [1] CRAN (R 4.1.0)
 xml2          1.3.2   2020-04-23 [1] CRAN (R 4.1.0)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)

[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

Broken provenance entries

Is it possible that the provenance file is a bit broken ?
If I do,

  library(taxadb)
  a <- taxadb::get_ids (names=c("homo sapiens"))

```I get,

Error in switch(compression, gzip = gzfile(path, ...), bz2 = bzfile(path, :
EXPR must be a length 1 vector
In addition: Warning message:
In FUN(X[[i]], ...) :
No sources found for hash://sha256/ae98e3de1cadd69c064aa7aeb26b89251b49926207d41f8185af28d5f7a8853d

I think there are two broken entries,
   2019_dwc_ncbi
   2021_common_itis

Does this then lead to the switch error, that seems to come from dbpylr falling over ?

Add function for handling synonyms

Create synonyms table for all databases that provide them:
- COL
- NCBI
- ITIS
- FB (done in rfishbase synonyms() / validate_names(), needs porting here)
- ....

Join all synonyms tables together for a more comprehensive list(?). Will require de-duplication.

Create a function that will query the name against synonym list and return the correct match. Note: unlike a generic 'fuzzy match' or 'spell check', most synonym tables provide specific matches to accepted IDs/names. This is nice as it permits automation where we can auto-correct names without having to request user input, but a fuzzy-match behavior might be a useful further addition.

get_names Error: no such function: top_n_rank

Hi, I'm running the folowing comand:

found_name <-taxadb::filter_name("Leporinus reinhardti", provider = 'fb')
taxadb::get_names(found_name$acceptedNameUsageID, 'fb')

But I get the error: no such function: top_n_rank

sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base

other attached packages:
[1] dbplyr_1.4.4 Hmisc_4.4-0
[3] Formula_1.2-3 survival_3.1-12
[5] lattice_0.20-41 flora_0.3.4
[7] data.table_1.12.8 rgnparser_0.1.0.91
[9] here_0.1 vroom_1.2.1
[11] forcats_0.5.0 stringr_1.4.0
[13] dplyr_1.0.0 purrr_0.3.4
[15] readr_1.3.1 tidyr_1.1.0
[17] tibble_3.0.1 ggplot2_3.3.1
[19] tidyverse_1.3.0 taxadb_0.1.0

loaded via a namespace (and not attached):
[1] colorspace_1.4-1 ellipsis_0.3.1
[3] class_7.3-16 rprojroot_1.3-2
[5] htmlTable_1.13.3 base64enc_0.1-3
[7] fs_1.4.1 rstudioapi_0.11
[9] remotes_2.1.1 bit64_0.9-7
[11] fansi_0.4.1 lubridate_1.7.8
[13] xml2_1.3.2 codetools_0.2-16
[15] splines_4.0.0 knitr_1.28
[17] pkgload_1.1.0 jsonlite_1.6.1
[19] broom_0.5.6 cluster_2.1.0
[21] png_0.1-7 compiler_4.0.0
[23] httr_1.4.1 backports_1.1.7
[25] assertthat_0.2.1 Matrix_1.2-18
[27] cli_2.0.2 acepack_1.4.1
[29] htmltools_0.4.0 prettyunits_1.1.1
[31] tools_4.0.0 gtable_0.3.0
[33] glue_1.4.1 rappdirs_0.3.1
[35] Rcpp_1.0.4.6 cellranger_1.1.0
[37] raster_3.1-5 vctrs_0.3.1
[39] countrycode_1.2.0 nlme_3.1-147
[41] xfun_0.14 ps_1.3.3
[43] testthat_2.3.2 rvest_0.3.5
[45] lifecycle_0.2.0 sys_3.3
[47] devtools_2.3.0 stringdist_0.9.5.5
[49] scales_1.1.1 hms_0.5.3
[51] parallel_4.0.0 RColorBrewer_1.1-2
[53] curl_4.3 memoise_1.1.0
[55] gridExtra_2.3 rpart_4.1-15
[57] latticeExtra_0.6-29 stringi_1.4.6
[59] RSQLite_2.2.0 desc_1.2.0
[61] e1071_1.7-3 checkmate_2.0.0
[63] pkgbuild_1.0.8 rlang_0.4.6
[65] pkgconfig_2.0.3 sf_0.9-3
[67] htmlwidgets_1.5.1 bit_1.1-15.2
[69] tidyselect_1.1.0 processx_3.4.2
[71] magrittr_1.5 R6_2.4.1
[73] generics_0.0.2 DBI_1.1.0
[75] arkdb_0.0.5 pillar_1.4.4
[77] haven_2.3.1 foreign_0.8-78
[79] withr_2.2.0 units_0.6-6
[81] sp_1.4-2 nnet_7.3-13
[83] modelr_0.1.8 crayon_1.3.4
[85] KernSmooth_2.23-16 utf8_1.1.4
[87] usethis_1.6.1 jpeg_0.1-8.1
[89] progress_1.2.2 rnaturalearth_0.1.0
[91] grid_4.0.0 readxl_1.3.1
[93] blob_1.2.1 callr_3.4.3
[95] reprex_0.3.0 digest_0.6.25
[97] classInt_0.4-3 munsell_0.5.0
[99] sessioninfo_1.1.1

Thank you.

Add function for searching / returning common names

Not all authorities provide common names, and some authorities provide many common names for a given taxonID (even within a given language).

rather nicely and I think uniquely, NCBI includes common names associated with higher ranks, like "Fish".

Should construct a separate common-names table with columns of common name, language, scientific species name, and taxonID. Then create helper functions to query this table; ideally with a way of addressing the multiple-matches issues.

scientific name authors?

The data-sources vignette mentions

While DWC encourages the use of authorship citations, these are intentionally omitted in most tables as inconsistency in abbreviations and formatting make names with authors much harder to resolve. When available, this information is provided in the additional optional columns using the corresponding Darwin Core terms.

However, although most of the data sources supported by taxadb do have scientific author name data, it does not seem to be provided in all of the taxadb databases. I have been able to verify this in ITIS and NCBI at a minimum. Futhermore, it is not clear if authorship was available what field it would show up in.

Although the vignette cites the presence of author names making name resolution more difficult as the reason not to include them, the opposite is also true. The author of a scientific name can be very important for resolving names, particularly in the case of ambiguous synonyms: names that are synonyms (thus pointing to different names) and have identical genus and specific epithet, but different authors. There is no way to distinguish these without author. And what is worse, code that matches on identical scientific names could lead one to completely different entities.

Would it be possible to add scientificNameAuthorship? That way there would be a standardized way to provide authorship data without polluting scientificName.

(related to #11)

use a GitHub release

so that the update will be picked up by the next rOpenSci newsletter :-)

get_names result order

Hello,

My issue is quite simple: the get_names function gives me the correct names when I feed it with a species code (here "itis", but it's the same with "col") but the result is in a weird order. For example here "ITIS:715228", which gives the species Megapodius decollatus, appears as first element in the second request, although it should be second. This problem does not occur with the get_ids function which gives me the right order.

library(tidyverse)
library(taxadb)
td_create("itis")
get_names("ITIS:715228")
[1] "Megapodius decollatus"
get_names(c("ITIS:553896", "ITIS:715228", NA))
[1] "Megapodius decollatus" "Falcipennis canadensis" NA

Thank you for your help with this issue.

For info, my sessionInfo() gives out:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4 readr_1.3.1 tidyr_1.1.1 tibble_3.0.3
[8] ggplot2_3.3.2 tidyverse_1.3.0 taxadb_0.1.0

loaded via a namespace (and not attached):
[1] progress_1.2.2 tidyselect_1.1.0 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2
[7] yaml_2.2.1 blob_1.2.1 rlang_0.4.7 pillar_1.4.6 glue_1.4.1 withr_2.2.0
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.2 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
[19] lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] curl_4.3 fansi_0.4.1 broom_0.7.0 arkdb_0.0.5 Rcpp_1.0.5 backports_1.1.8
[31] scales_1.1.1 jsonlite_1.7.0 fs_1.5.0 bit_4.0.4 hms_0.5.3 digest_0.6.25
[37] stringi_1.4.6 duckdb_0.2.1 grid_4.0.2 cli_2.0.2 tools_4.0.2 magrittr_1.5
[43] RSQLite_2.2.0 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 prettyunits_1.1.1
[49] reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11 R6_2.4.1
[55] compiler_4.0.2

Consider MonetDBLite for faster backend alternative to SQLite

https://github.com/hannesmuehleisen/MonetDBLite-R#painless-startup

error: could not resolve host hash-archive.org

Hello!

I am currently having an issue with taxadb version 0.1.2. Here's a MRE from the documentation:

library(taxadb)
get_ids("Trochalopteron henrici gucenense")

which errors out with Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: hash-archive.org

Session info:

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] taxadb_0.1.2   devtools_2.4.0 usethis_2.0.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.6.0      compiler_4.0.5    dbplyr_2.1.1
 [5] progress_1.2.2    prettyunits_1.1.1 remotes_2.3.0     tools_4.0.5
 [9] bit_1.1-15.2      testthat_3.0.2    contentid_0.0.9   pkgbuild_1.2.0
[13] pkgload_1.2.1     RSQLite_2.2.0     jsonlite_1.7.2    tibble_3.1.1
[17] memoise_2.0.0     lifecycle_1.0.0   pkgconfig_2.0.3   rlang_0.4.10
[21] cli_2.5.0         DBI_1.1.0         curl_4.3          fastmap_1.0.1
[25] duckdb_0.2.5      arkdb_0.0.12      withr_2.4.2       httr_1.4.2
[29] dplyr_1.0.5       rappdirs_0.3.1    hms_1.0.0         desc_1.3.0
[33] generics_0.0.2    fs_1.5.0          vctrs_0.3.7       askpass_1.1
[37] bit64_0.9-7       tidyselect_1.1.0  rprojroot_1.3-2   glue_1.4.1
[41] R6_2.4.1          processx_3.5.1    fansi_0.4.1       sessioninfo_1.1.1
[45] blob_1.2.1        readr_1.4.0       callr_3.7.0       purrr_0.3.4
[49] magrittr_2.0.1    backports_1.1.8   ps_1.6.0          ellipsis_0.3.1
[53] assertthat_0.2.1  utf8_1.1.4        stringi_1.4.6     openssl_1.4.3
[57] cachem_1.0.4      crayon_1.4.1

Latest ITIS table drops ID's

The ITIS table created by the current script in data-raw drops some ids. CSV contains some example scientific names that could previously be resolved.
ex_missing_ids.csv.zip

OpenTreeOfLife taxonomy

Consider adding OpenTreeofLife as another authority?

Take a look at the flat files here:
https://tree.opentreeoflife.org/about/taxonomy-version/ott3.0

Like COL, this is technically another 'derived' merger of existing authorities.... See scripts at https://github.com/OpenTreeOfLife/reference-taxonomy

Fix ITIS postgres export

data-raw/taxizedb itis export may have been run with an older (<= 0.4) version of arkdb which dropped the headers.

safe_right_join consuming memory

I am using filter_name function but all my memory (7GB) has been consumed by dplyr::left_join in safe_right_join function. I am using the following example:

taxadb::filter_name(c("Ocotea odorifera", "Ocotea odorifere"), "gbif")

There is a way to limit the memory usage, I tried options(duckdb_memory_limit=1) supposing it is a duckdb issue, but it did not work.

Att,

rOpenSci onboarding to do

To-dos:

Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I'm getting the rOpenSci folks the info get a "taxadb" repo set up here. You'll be made admin that's set up and you accept and invitation to join a taxadb team.
Add the rOpenSci footer to the bottom of your README
" "
Fix all links to the GitHub repo to point to the repo under the ropensci organization.
If you already had a pkgdown website, fix its URL to point to https://docs.ropensci.org/package_name and deactivate the automatic deployment you might have set up, since it will not be built centrally like for all rOpenSci packages, see http://devdevguide.netlify.com/#docsropensci. In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
Add a mention of the review in DESCRIPTION via rodev::add_ro_desc().
Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be .
We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Use DWC terms only in defining schemas?

Currently schemas use an essentially arbitrary set of names for columns. All column names could be defined instead as properties of a Taxon in Darwin Core (dwc: http://rs.tdwg.org/dwc/terms/):

id -> taxonID
name_type -> taxonomicStatus
name -> scientificName
species -> specificEpithet
accepted_id -> acceptedNameUsageID
rank -> taxonRank
common -> vernacularName

More significantly, the database could be reduced to a single flat file rather than separate hierarchy, taxonid, and synonyms tables potentially without requiring too many convoluted transformations to become useful to the user. For instance, the most common ranks can be given explicitly. alternately, can use parentID, and pipe-delimited higherClassification.

Warning message when calling get_ids

Hi,

I've started getting a warning when using the get_ids function (or other "get" functions). The following is a reproducible example:

taxadb::get_ids("Toxostoma rufum", db = "itis")

which produces the following output:

[1] "ITIS:178627" Warning message: ORDER BY is ignored in subqueries without LIMIT i Do you need to move arrange() later in the pipeline or use window_order() instead?

I don't think this is actually impacting the results, which look fine.

Session info:
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rgdal_1.5-19 sp_1.4-5 taxadb_0.1.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.3
[7] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3 tidyverse_1.3.0

Function interface -- better function names etc

Time to revisit and solidify the user interface for the package.

get_ids() works ~ as it does in taxize, returning a vector of IDs of the same length as the vector of names it is given. This works nicely with mutate() to add an ids column. Unlike taxize, the return format is consistent over all providers. Non-uniquely-matched names as well as unmatched names return as NA.
ids() Currently a function for doing a filtering join of scientificNames against the ids. Needs a better name that sounds like a verb or indicates that it returns a table.
get_names() the inverse function to get_ids(), doesn't exist yet but should.
synonyms() A helper utility to return all the synonyms of a queried name, using repeated rows for each new synonym of a given accepted name, like so:

input	acceptedNameUsage	synonym	acceptedNameUsageID
spatula cyanoptera	Spatula cyanoptera	Anas cyanoptera	IUCN:22680233
antrostomus vociferus	Antrostomus vociferus	Caprimulgus vociferus	IUCN:22736393
antrostomus arizonae	Antrostomus arizonae	Caprimulgus vociferus	IUCN:22736398
antrostomus arizonae	Antrostomus arizonae	Caprimulgus arizonae	IUCN:22736398
trochilid	NA	NA	NA

It's not entirely clear to me that this wider format is better format than filtering the original dwc format on the accepted ids, e.g:

scientificName	taxonomicStatus	acceptedNameUsageID	taxonRank	taxonID
Spatula cyanoptera	accepted	IUCN:22680233	species	IUCN:22680233
Antigone canadensis	accepted	IUCN:22692078	species	IUCN:22692078
Antrostomus vociferus	accepted	IUCN:22736393	species	IUCN:22736393
Antrostomus arizonae	accepted	IUCN:22736398	species	IUCN:22736398
Anas cyanoptera	synonym	IUCN:22680233	species	NA
Grus canadensis	synonym	IUCN:22692078	species	NA
Caprimulgus vociferus	synonym	IUCN:22736393	species	NA
Caprimulgus vociferus	synonym	IUCN:22736398	species	NA
Caprimulgus arizonae	synonym	IUCN:22736398	species	NA

descendants() Currently Like ids but takes name+rank (class="Aves") as the filter. Also currently overloaded so it can filter on a list of ids, but that role should be taken by a separate function. (also, currently that assumes taxonID, while it sometimes makes more sense to match to acceptedNameUsageID instead).
common_names() not implemented
clean_names() operates on a vector of names, performs a series of transformations by default. Should probably have the more aggressive of these (binomial) be opt-in instead of opt-out. May also need more methods to drop things that are abbreviations, digits, non-alpha chars, or given in () or [] etc.

Additional Authorities?

IUCN Red list, see rredlist. Includes vernacular and synonyms
WoRMS see worrms package. Include vernacular and synonyms
BOLD ?
EOL
wikidata: currently have taxon ids, but not synonyms, vernacular, or sameAs mappings.

see INaturalist on taxon frameworks