Giter VIP home page Giter VIP logo

taxizedb's Introduction

taxizedb

status cran checks R-check codecov rstudio mirror downloads Total Downloads cran version DOI

taxizedb - Tools for Working with Taxonomic Databases

Docs: https://docs.ropensci.org/taxizedb/

taxizedb is an R package for interacting with taxonomic databases. Its functionality can be divided in two parts: 1. You can download the databases to your platform 2. You can query the downloaded databases to retrieve taxonomic information.

This two step approach is different from tools which interact with web services for each query, and has a number of advantages:

  • Once you download a database you can work with it offline
  • Once you download a database querying it is super fast
  • As long as you store your database files all the queries in your analysis will be fully reproducible

Data sources

When you download a database with taxizedb it will automatically convert it to SQLite and then all query functions will interact with this SQLite database. However, not all taxonomic databases are publicly available, or can be converted to SQLite. The following databases are supported:

Get in touch in the issues with any ideas on new data sources.

Package API

This package for each data sources performs the following tasks:

  • Downloaded taxonomic databases db_download_*
  • Create dplyr SQL backend via dbplyr::src_dbi - src_*
  • Query and get data back into a data.frame - sql_collect
  • Manage cached database files - tdb_cache
  • Retrieve immediate descendents of a taxon - children
  • Retrieve the taxonomic hierarchies from local database - classification
  • Retrieve all taxa descending from a vector of taxa - downstream
  • Convert species names to taxon IDs - name2taxid
  • Convert taxon IDs to species names - taxid2name
  • Convert taxon IDs to ranks - taxid2rank

You can use the src connections with dplyr, etc. to do operations downstream. Or use the database connection to do raw SQL queries.

Installation

CRAN version

install.packages("taxizedb")

dev version

remotes::install_github("ropensci/taxizedb")

Citation

To cite taxizedb in publications use:

Meta

ropensci

taxizedb's People

Contributors

arendsee avatar cboettig avatar gpli avatar maelle avatar rekyt avatar sckott avatar stitam avatar tdjames1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

taxizedb's Issues

Docker image for taxize bundled with SQL Dbs

perhaps we can provide two docker images,

  • just SQL DBs for all those available (e.g., NCBI, ITIS, COL, theplantlist)
  • an image depending on above + rocker/ropensci or similar for the R environment all set up as well (w/ ropensci pkgs as needed, dplyr, DBI, sql connector pkgs)

Requires user to be familiar with docker, but if they are, then they don't have to deal with downloading/loading up each SQL DB

moved from ropensci/taxize#408

Is there a `parents` function?

I am in need of accessing the name of a parent at a specific rank and I was wondering if there is a built-in or a better way to achieve this. What I'm currently doing is the following:

tax_ids <- c(186803, 541000, 216572, 186804, 31979,  186806)
taxonomy <- classification(tax_ids)

get_rank_name <- function(hierarchy, rank_name) {
  if (is.atomic(hierarchy)) {
    return(NA_character_)
  }
  result <- hierarchy %>% dplyr::filter(
    rank == rank_name
    ) %>% dplyr::pull(name) %>% unique()
  if (length(result) == 0) {
    return(NA_character_)
  }
  return(result)
}

rank_names <- purrr::map_chr(taxonomy, get_rank_name, rank_name = "order")

         186803          541000          216572          186804           31979          186806 
"Clostridiales" "Clostridiales" "Clostridiales" "Clostridiales" "Clostridiales" "Clostridiales"

The above kinda works but it'd be neat to have this built into taxize/taxizedb. Also, if the requested rank does not exist exactly, it'd be nice to get an inbetween rank, for example, superorder or suborder when requesting order.

Downstream intermediates argument?

Hi all, first off thanks for the tremendous work on both taxize and taxizedb. This is not technically an issue but an ask about an important function.

I was wondering if the 'intermediates' argument is able to be incorporated into downstream function in taxizedb, or whether I am able to specify my taxize queries to the local SQLite database? This function is very useful, but quite slow on NCBI after having to incorporate some kind of sys.sleep method (I still need this despite my entrez key).

I am working on trying to assign all genera to intermediate clades for a few large families, and am having a heck of a time trying to create a nice dataframe of relationships.

db_load_* functions don't take host address

Would be good to add a ... to pass additional arguments for DBIConnect, e.g. in particular, support for passing host address.

pd <- DBI::dbConnect(RPostgreSQL::PostgreSQL(), 
                     host = "postgres",
                     user = "postgres",
                     password = "password"
)

Looks like you make some calls to system, which would also need to support these options. Or maybe we can avoid the system call and do everything over the DBI connection?

404 error when downloading COL

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] taxizedb_0.1.4    R.utils_2.9.0     R.oo_1.22.0       R.methodsS3_1.7.1 jsonlite_1.6     
 [6] lemon_0.4.3       openxlsx_4.1.0.1  timeDate_3043.102 askpass_1.1       lubridate_1.7.4  
[11] RODBC_1.3-15      scales_1.0.0      ggplot2_3.2.0     dplyr_0.8.3      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        ape_5.3           lattice_0.20-38   xlsxjars_0.6.1    zoo_1.8-6        
 [6] assertthat_0.2.1  zeallot_0.1.0     digest_0.6.20     foreach_1.4.7     R6_2.4.0         
[11] plyr_1.8.4        backports_1.1.4   RSQLite_2.1.2     pillar_1.4.2      rlang_0.4.0      
[16] lazyeval_0.2.2    RPostgreSQL_0.6-2 curl_4.0          rstudioapi_0.10   data.table_1.12.2
[21] blob_1.2.0        labeling_0.3      RMySQL_0.10.17    stringr_1.4.0     bit_1.1-14       
[26] munsell_0.5.0     compiler_3.6.0    xfun_0.8          pkgconfig_2.0.2   tidyselect_0.2.5 
[31] tibble_2.1.3      gridExtra_2.3     httpcode_0.2.0    codetools_0.2-16  reshape_0.8.8    
[36] hoardr_0.5.2      dbplyr_1.4.2      crayon_1.3.4      withr_2.1.2       rappdirs_0.3.1   
[41] crul_0.8.0        grid_3.6.0        nlme_3.1-140      gtable_0.3.0      DBI_1.0.0        
[46] magrittr_1.5      zip_2.0.3         cli_1.1.0         stringi_1.4.3     reshape2_1.4.3   
[51] xml2_1.2.0        vctrs_0.2.0       iterators_1.0.12  tools_3.6.0       bold_0.9.0       
[56] bit64_0.9-7       glue_1.3.1        purrr_0.3.2       parallel_3.6.0    colorspace_1.4-1 
[61] memoise_1.1.0     rJava_0.9-11      knitr_1.23 

When I attempt to download Catalog of Life using the 'db_download_col()' function I get an error message that reads:

Error in curl::cur_download(db_url, db_path, quiet = TRUE) : HTTP error 404

Query works with taxize but doesn't work with taxizedb

Hi,

This function from the taxize package works:

taxize::classification("podoviridae", "ncbi")
#> No ENTREZ API key provided
#>  Get one via taxize::use_entrez()
#> See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
#> ══  1 queries  ═══════════════
#> 
#> Retrieving data for taxon 'podoviridae'
#> ✓  Found:  podoviridae
#> ══  Results  ═════════════════
#> 
#> • Total: 1 
#> • Found: 1 
#> • Not Found: 0
#> No ENTREZ API key provided
#>  Get one via taxize::use_entrez()
#> See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
#> $podoviridae
#>             name         rank      id
#> 1        Viruses superkingdom   10239
#> 2  Duplodnaviria        clade 2731341
#> 3 Heunggongvirae      kingdom 2731360
#> 4    Uroviricota       phylum 2731618
#> 5 Caudoviricetes        class 2731619
#> 6   Caudovirales        order   28883
#> 7    Podoviridae       family   10744
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"

The equivalent from taxizedb does not:

taxizedb::classification("podoviridae")
#> Error: Problem with `summarise()` column `taxids`.
#> ℹ `taxids = paste(.data$tax_id, collapse = "|")`.
#> x Column `tax_id` not found in `.data`
#> ℹ The error occurred in group 1: name = "podoviridae".

session_info:

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                         
#>  version  R version 4.1.2 (2021-11-01)  
#>  os       Debian GNU/Linux 11 (bullseye)
#>  system   x86_64, linux-gnu             
#>  ui       X11                           
#>  language en_GB:en                      
#>  collate  en_GB.UTF-8                   
#>  ctype    en_GB.UTF-8                   
#>  tz       Europe/Budapest               
#>  date     2021-11-10                    
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.0)
#>  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
#>  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.1)
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.1.1)
#>  desc          1.2.0   2018-05-01 [3] CRAN (R 4.0.0)
#>  devtools      2.3.2   2020-09-18 [3] CRAN (R 4.0.2)
#>  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.1)
#>  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.0)
#>  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
#>  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.1.1)
#>  highr         0.8     2019-03-20 [3] CRAN (R 4.0.0)
#>  htmltools     0.5.1.1 2021-01-22 [3] CRAN (R 4.0.3)
#>  knitr         1.31    2021-01-27 [3] CRAN (R 4.0.3)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.1)
#>  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
#>  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
#>  pkgload       1.1.0   2020-05-29 [3] CRAN (R 4.0.1)
#>  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.0)
#>  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
#>  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
#>  remotes       2.2.0   2020-07-21 [3] CRAN (R 4.0.2)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.1)
#>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.1)
#>  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
#>  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
#>  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.0)
#>  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.0)
#>  testthat      3.0.1   2020-12-17 [3] CRAN (R 4.0.3)
#>  usethis       2.0.0   2020-12-10 [3] CRAN (R 4.0.3)
#>  withr         2.4.1   2021-01-26 [3] CRAN (R 4.0.3)
#>  xfun          0.26    2021-09-14 [1] CRAN (R 4.1.1)
#>  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.0)
#> 
#> [1] /home/tamas/R/x86_64-pc-linux-gnu-library/4.1
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
packageVersion("taxize")
#> [1] '0.9.99'
packageVersion("taxizedb")
#> [1] '0.3.0'

Other taxa seem to be working. Any ideas what may be causing this strange behavior? Many thanks.

possibly use cached `taxize` objects for test suite?

This might be a bit tricky, but I guess whenever there's a new version of taxize on cran, then we can re-create the cached objects from taxize - then the test suite here won't fail due to failures in taxize (a lot of failures happen in taxize due to one or more of the many web services being down)

thoughts @arendsee ?

How to unify the list generated after classification?

Hi!

I am struggling to put together the output of classification().

I have a list of IDs (ids) that I wanted to get the different taxonomic levels from, so I ran the code:

taxa <- classification(ids , rank = "genus", db= "ncbi")

This worked completely fine, but it generated a list of data frames. One dataframe per ID.

I would like to put together all the dataframes, and obtain a table that has these columns: ID, phylum, order, class, family and genus but I do not know how to merge them.

Thanks for the suggestions!

names2taxid issue with catalogue of life

Hi I am trying to use names2taxid and using catalogue of life as the database, but I get as error "Error: no such table: taxa". I get this for even simple queries likes this one

name2taxid("Bombus",db="col")

Any ideas?

  • Session info -------------------------------------------------------------------------
    setting value
    version R version 4.0.2 (2020-06-22)
    os Windows 10 x64
    system x86_64, mingw32
    ui RStudio
    language (EN)
    collate English_Germany.1252
    ctype English_Germany.1252
    tz Europe/Berlin
    date 2021-03-18

How to get ta taxonomy table from taxizedb?

Hello,

In January I encountered a problem with taxize API due to my number of bacterial taxa from witch I want to retrieve taxonomy (10k+) (I posted about my problem here : ropensci/taxize#907)

People advised me to use taxizedb, it works offline and should fix my problem. However, when I try to apply a simple command as:

test = classification(name2taxid(c(taxa$specie_ID)))

taxa is a dataframe with only one collumn named specie_ID, as flolow:

> head(taxa$specie_ID) [1] "Staphylococcus sp." "Acinetobacter sp." "Cutibacterium sp." "Sphingomonas sp." "Paenarthrobacter sp." [6] "Paracoccus sp."

However, I receive an error:

> test = classification(name2taxid(c(taxa$specie_ID))) Error in name2taxid(c(taxa$specie_ID)) : Some of the input names are ambiguous, try setting out_type to 'summary'

When I set out_type to summary; I got that:

> test = classification(name2taxid(c(taxa$specie_ID), out_type="summary")) Error in dplyr::summarize(): ℹ In argument: taxids = paste(.data$tax_id, collapse = "|"). ℹ In group 1: name = "Morganella sp.". Caused by error in .data$tax_id: ! Column tax_idnot found in.data`.
Backtrace:

  1. taxizedb::classification(name2taxid(c(taxa$specie_ID), out_type = "summary"))
  2. rlang:::abort_data_pronoun(x, call = y)`

Apparently Morganella sp. is not recognized by taxizedb. I'm not particularly familiar with dplyr of with taxize. So I just would like to know, how I could retrieve the taxonomy for each of my species of bacteria, preferentially in the form of a table with collumns like that:

Specie_ID Kindom Phyllum Class Order family genus

downstream possible bug

via ropensci/taxize#727

x <- downstream("Bacteria", db = "ncbi", downto="species")
#> Error in name2taxid(x[is_named], db = "ncbi") :
#>.   Some of the input names are ambiguous, try setting out_type to 'summary'

but also

x <- downstream("Bacteria", db = "ncbi", downto="species", out_type = "summary")
#> Error in name2taxid(x[is_named], db = "ncbi") :
#>.   Some of the input names are ambiguous, try setting out_type to 'summary'

not sure what problem is yet

load_gbif incorrectly claims 'sqlite3 not available'

I can use SQLite through the usual DBI / dplyr methods, e.g.

sqlite <- DBI::dbConnect(RSQLite::SQLite(), path = "ex.sqlite")
sqlite_src <- dbplyr::src_dbi(sqlite)

but this errors:

 gbif <- db_load_gbif()
checking if SQLite installed...
Error in db_installed("sqlite3") : 
sqlite3 not found on your computer
Install the missing tool(s) and try again

sessionInfo:

devtools::session_info()
Session info ---------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.442)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       UTC                         
 date     2018-03-26                  

Packages -------------------------------------------------------------------------------------------
 package     * version date       source        
 assertthat    0.2.0   2017-04-11 CRAN (R 3.4.3)
 base        * 3.4.3   2018-03-13 local         
 bindr         0.1.1   2018-03-13 CRAN (R 3.4.3)
 bindrcpp      0.2     2017-06-17 CRAN (R 3.4.3)
 bit           1.1-12  2014-04-09 CRAN (R 3.4.3)
 bit64         0.9-7   2017-05-08 CRAN (R 3.4.3)
 blob          1.1.0   2017-06-17 CRAN (R 3.4.3)
 compiler      3.4.3   2018-03-13 local         
 curl          3.1     2017-12-12 CRAN (R 3.4.3)
 datasets    * 3.4.3   2018-03-13 local         
 DBI           0.8     2018-03-02 CRAN (R 3.4.3)
 dbplyr      * 1.2.1   2018-02-19 CRAN (R 3.4.3)
 devtools      1.13.5  2018-02-18 CRAN (R 3.4.3)
 digest        0.6.15  2018-01-28 CRAN (R 3.4.3)
 dplyr       * 0.7.4   2017-09-28 CRAN (R 3.4.3)
 glue          1.2.0   2017-10-29 CRAN (R 3.4.3)
 graphics    * 3.4.3   2018-03-13 local         
 grDevices   * 3.4.3   2018-03-13 local         
 hoardr        0.2.0   2017-05-10 CRAN (R 3.4.3)
 magrittr      1.5     2014-11-22 CRAN (R 3.4.3)
 memoise       1.1.0   2017-04-21 CRAN (R 3.4.3)
 methods     * 3.4.3   2018-03-13 local         
 pillar        1.2.1   2018-02-27 CRAN (R 3.4.3)
 pkgconfig     2.0.1   2017-03-21 CRAN (R 3.4.3)
 R6            2.2.2   2017-06-17 CRAN (R 3.4.3)
 rappdirs      0.3.1   2016-03-28 CRAN (R 3.4.3)
 Rcpp          0.12.16 2018-03-13 CRAN (R 3.4.3)
 rlang         0.2.0   2018-02-20 CRAN (R 3.4.3)
 RMySQL        0.10.14 2018-02-26 CRAN (R 3.4.3)
 RPostgreSQL   0.6-2   2017-06-24 CRAN (R 3.4.3)
 RSQLite       2.0     2017-06-19 CRAN (R 3.4.3)
 rstudioapi    0.7     2017-09-07 CRAN (R 3.4.3)
 stats       * 3.4.3   2018-03-13 local         
 taxizedb    * 0.1.4   2017-06-20 CRAN (R 3.4.3)
 tibble        1.4.2   2018-01-22 CRAN (R 3.4.3)
 tools         3.4.3   2018-03-13 local         
 utils       * 3.4.3   2018-03-13 local         
 withr         2.1.2   2018-03-15 CRAN (R 3.4.3)
 yaml          2.1.18  2018-03-08 CRAN (R 3.4.3)

Can't download ITIS db

Session Info
R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bdc_1.1.1       lubridate_1.8.0 ggspatial_1.1.5 forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
 [8] readr_2.1.2     tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6   tidyverse_1.3.1 sf_1.0-7        raster_3.5-15  
[15] sp_1.4-7       

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3         ellipsis_0.3.2           class_7.3-20             rgdal_1.5-30            
  [5] rprojroot_2.0.3          snakecase_0.11.0         fs_1.5.2                 rstudioapi_0.13         
  [9] proxy_0.4-26             scico_1.3.0              bit64_4.0.5              DT_0.23                 
 [13] fansi_1.0.3              rgnparser_0.2.5.91       xml2_1.3.3               codetools_0.2-18        
 [17] doParallel_1.0.17        contentid_0.0.15         cachem_1.0.6             knitr_1.39              
 [21] jsonlite_1.8.0           broom_0.8.0              dbplyr_2.2.0             rgeos_0.5-9             
 [25] oai_0.3.2                hoardr_0.5.2             compiler_4.2.0           httr_1.4.3              
 [29] backports_1.4.1          assertthat_0.2.1         fastmap_1.1.0            lazyeval_0.2.2          
 [33] cli_3.3.0                duckdb_0.3.2-2           htmltools_0.5.2          prettyunits_1.1.1       
 [37] tools_4.2.0              gtable_0.3.0             glue_1.6.2               rappdirs_0.3.3          
 [41] Rcpp_1.0.8.3             cellranger_1.1.0         CoordinateCleaner_2.0-20 vctrs_0.4.1             
 [45] conditionz_0.1.0         iterators_1.0.14         xfun_0.31                rvest_1.0.2             
 [49] lifecycle_1.0.1          sys_3.4                  terra_1.5-21             scales_1.2.0            
 [53] vroom_1.5.7              hms_1.1.1                parallel_4.2.0           taxizedb_0.3.0          
 [57] qs_0.25.3                yaml_2.3.5               curl_4.3.2               memoise_2.0.1           
 [61] geosphere_1.5-14         taxadb_0.1.5             stringi_1.7.6            RSQLite_2.2.14          
 [65] foreach_1.5.2            e1071_1.7-9              rgbif_3.7.2              measurements_1.4.0      
 [69] rlang_1.0.2              pkgconfig_2.0.3          evaluate_0.15            lattice_0.20-45         
 [73] htmlwidgets_1.5.4        bit_4.0.4                tidyselect_1.1.2         here_1.0.1              
 [77] plyr_1.8.7               magrittr_2.0.3           R6_2.5.1                 generics_0.1.2          
 [81] DBI_1.1.2                arkdb_0.0.15             pillar_1.7.0             haven_2.5.0             
 [85] whisker_0.4              withr_2.5.0              units_0.8-0              janitor_2.1.0           
 [89] modelr_0.1.8             crayon_1.5.1             uuid_1.1-0               KernSmooth_2.23-20      
 [93] utf8_1.2.2               RApiSerialize_0.1.0      rmarkdown_2.14           tzdb_0.3.0              
 [97] progress_1.2.2           rnaturalearth_0.1.0      grid_4.2.0               readxl_1.4.0            
[101] data.table_1.14.2        blob_1.2.3               reprex_2.0.1             digest_0.6.29           
[105] classInt_0.4-3           openssl_2.0.2            RcppParallel_5.1.5       munsell_0.5.0           
[109] stringfish_0.15.7        askpass_1.1            

Hi - I can't seem to download the ITIS db using taxizedb::db_download_itis(). I get the following error message: Error in curl::curl_download(db_url, db_path, quiet = TRUE) : transfer closed with 115534120 bytes remaining to read Seems like the error is with curl but I figured I posted this here in case anyone here has run into the same issue.

I tried dowloading the sqlite file directly from ITIS but then I get this error when I try to use it with taxizedb Error: database disk image is malformed.

Any help would be much appreciated.

Thank you!

Feedback

hey @Alectoria @Edild

moving SQL stuff here to this new pkg - because it's just too hard to integrate SQL stuff with such a big package as taxize

This pkg at least right now won't try to replicate the API based calls in taxize, but rather helps the user download data, load into a SQL database, then create an src class that can be plugged directly into dplyr for easy manipulation

let me know what you think

  • We do have some SQL queries for ITIS so we can replicate some of the API methods they provide, but not all
  • Theplantlist doesn't have a SQL db, I just made one from flat files
  • COL hasn't given me their SQL queries
  • NCBI isn't in here yet, they're DB setup is painful

Extend new NCBI functionality to the other databases

There are a lot of new functions for the NCBI database (children, classification, downstream, name2taxid, taxid2name, taxid2rank). However, none of these are implemented for the other supported databases:

  • ITIS
  • COL
  • Theplantlist
  • GBIF

Using taxize and taxizedb

Dear Scott,

I would like to know if it is possible to use the resolver taxonomic names functions of taxize (e.g., gnr_database) by using taxonomic databases downloaded using the function db_download_* from taxizedb package.

Best regards,
Bruno

db_load_itis() fails after successful install with "psql not found on your computer"

Presumably this is related to this statement in the current readme.md: "Remember to start your PostgreSQL database for ITIS". It is unclear, however, what steps need to be taken to perform this task. For example, installation of taxizedb's dependencies doesn't create the C:\Program Files\PostgreSQL directory referred to here. It's also unclear if the RPostgreSQL package would be helpful or whether use of taxizedb requires compilation of PostgreSQL as described in its installation instructions.

My use cases for taxizedb are mainly checking ITIS records at off-internet field sites and infrequent bulk operations to check if any of a couple thousand binomial names are no longer accepted ITIS names. #14 appears adjacent to this issue.

library('dplyr')
library('dbplyr')
library('taxizedb')
itisPath = db_download_itis()
db_load_itis(itisPath)
checking if path exists...
checking if Postgres installed...
Error in db_installed("psql") :
psql not found on your computer
Install the missing tool(s) and try again

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.13.6 dbplyr_1.2.1 dplyr_0.7.6 taxizedb_0.1.4

loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 magrittr_1.5 bindr_0.1.1 rappdirs_0.3.1
[5] tidyselect_0.2.4 munsell_0.4.3 bit_1.1-14 colorspace_1.3-2
[9] R6_2.2.2 rlang_0.2.0 hoardr_0.2.0 blob_1.1.1
[13] plyr_1.8.4 tools_3.5.1 grid_3.5.1 gtable_0.2.0
[17] DBI_1.0.0 withr_2.1.2 bit64_0.9-7 RMySQL_0.10.15
[21] lazyeval_0.2.1 RPostgreSQL_0.6-2 digest_0.6.15 assertthat_0.2.0
[25] tibble_1.4.2 bindrcpp_0.2.2 purrr_0.2.5 ggplot2_2.2.1
[29] curl_3.2 glue_1.2.0 memoise_1.1.0 RSQLite_2.1.1
[33] compiler_3.5.1 pillar_1.2.3 scales_0.5.0 pkgconfig_2.0.1

possibly have sqlite as an option

would be least cumbersome solution - I think RSQLite embeds sqlite - so shouldn't need to actually download sqlite

though not sure if the dumps for mysql/postgres will work loading into sqlite

from #4

Possible bug: using 'downstream' function with World Flora Online

I'm getting an error trying to use the 'downstream' function with World Flora Online. The example from the reference manual is included below with the resulting error message; I have had the same result with all other taxa I have tried. Thanks for any help you can provide!

#example from reference manual:
> id <- name2taxid('Pinaceae', db = "wfo"); downstream(id, db = "wfo", downto = "species")
Error in strsplit(z, split = ",") : non-character argument
> id; class(id)
[1] "wfo-7000000470"
[1] "character"
Session Info
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] taxizedb_0.2.2 dplyr_1.0.2    stringr_1.4.0  rgbif_3.4.2

downloading error in `db_download_*` functions (due to missing directory .taxize_local)

When I try to follow the README, none of the db_download_* functions work

> x <- db_download_tpl()
downloading...
Error in curl::curl_download(db_url, db_path, quiet = TRUE) : 
  Failed to open file /Users/dlebauer/.taxize_local/plantlist.zip.

This worked after I created the directory $HOME/.taxize_local.

dir.create("~/.taxize_local")

I'd submit a pull request, but would suspect that this could either be created on loading the library or on first running any of the download functions.

`name2taxid` includes some questionable material

This happens:

> name2taxid("s2")
"164330"
 > name2taxid("s2") %>% taxid2name
"Thauera aminoaromatica"

S2 is a strain name for this bacteria. See here.

The problem is that I allow matches against any name_class. Here are all the name classes in the database:

name_class count(name_class)
acronym 1167
anamorph 302
authority 410075
blast name 229
common name 14204
equivalent name 25058
genbank acronym 486
genbank anamorph 107
genbank common name 28182
genbank synonym 2958
in-part 628
includes 36595
misnomer 1386
misspelling 35975
scientific name 1689025
synonym 168033
teleomorph 179
type material 11449

So the question is, which of these should we include?

Most of them seem pretty reasonable. The problematic ones are type material and acronym. Perhaps we should allow the user to select which name classes to allow?

move away from RMySQL

IMPORTANT: the RMySQL package is being phased out and replaced by the RMariaDB package. All development is now focused on RMariaDB; the RMySQL package will only minimal maintenance.

Please consider migrating to RMariaDB and open new bugs and feature requests in the RMariaDB repository.

I get an error while installing the RMySQL package, and I probably can not get support for that package any more.

r-dbi/RMySQL#221

Data sources

  • ITIS: mostly sorted, just need to finish off the sql verions of each function
  • NCBI: got data, just need to get sorted in the pkg
  • COL: big dataset, using mysql for now, ask ing if they can provide postgresql downloads, still working on download scripts, asked if they can share SQL query code for their API methods to duplicate local queries...waiting
  • plantlist: have code for download data, create database, need to put those together - also, there's no API calls really to match, so just do a simple search I guess, or maybe try to copy SQL queries in some ITIS functions
  • GBIF - sqlite db from darwin core archive, hosting on S3

from ropensci/taxize#400

children test failing

the test at https://github.com/ropensci/taxizedb/blob/master/tests/testthat/test-children.R#L10-L13 is failing

I see

taxize::children(3701, db='ncbi')
#> $`3701`
#>    childtaxa_id                                                     childtaxa_name childtaxa_rank
#> 1       1837063                         Arabidopsis thaliana x Arabidopsis halleri        species
#> 2       1746102                                             Arabidopsis sp. hda9-2        species
#> 3       1547873                                           Arabidopsis sp. NH-2014a        species
#> 4       1547872                                              Arabidopsis umezawana        species
#> 5       1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica        species
#> 6       1240361                         Arabidopsis thaliana x Arabidopsis arenosa        species
#> 7        869751        Arabidopsis thaliana x Arabidopsis halleri subsp. gemmifera        species
#> 8        869750                          Arabidopsis thaliana x Arabidopsis lyrata        species
#> 9        412662                                            Arabidopsis pedemontana        species
#> 10       378006                         Arabidopsis arenosa x Arabidopsis thaliana        species
#> 11       347883                                              Arabidopsis arenicola        species
#> 12       302551                                              Arabidopsis petrogena        species
#> 13        97980                                               Arabidopsis croatica        species
#> 14        97979                                            Arabidopsis cebennensis        species
#> 15        81970                                                Arabidopsis halleri        species
#> 16        59690                                             Arabidopsis kamchatica        species
#> 17        59689                                                 Arabidopsis lyrata        species
#> 18        45251                                               Arabidopsis neglecta        species
#> 19        45249                                                Arabidopsis suecica        species
#> 20        38785                                                Arabidopsis arenosa        species
#> 21        29726                                                    Arabidopsis sp.        species
#> 22         3702                                               Arabidopsis thaliana        species
#> 
#> attr(,"class")
#> [1] "children"
#> attr(,"db")
#> [1] "ncbi"

taxizedb::children(3701, db='ncbi')
#> $`3701`
#>    childtaxa_id                                                     childtaxa_name childtaxa_rank
#> 1       1837063                         Arabidopsis thaliana x Arabidopsis halleri        species
#> 2       1547872                                              Arabidopsis umezawana        species
#> 3       1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica        species
#> 4       1240361                         Arabidopsis thaliana x Arabidopsis arenosa        species
#> 5        869750                          Arabidopsis thaliana x Arabidopsis lyrata        species
#> 6        412662                                            Arabidopsis pedemontana        species
#> 7        378006                         Arabidopsis arenosa x Arabidopsis thaliana        species
#> 8        347883                                              Arabidopsis arenicola        species
#> 9        302551                                              Arabidopsis petrogena        species
#> 10        97980                                               Arabidopsis croatica        species
#> 11        97979                                            Arabidopsis cebennensis        species
#> 12        81970                                                Arabidopsis halleri        species
#> 13        59690                                             Arabidopsis kamchatica        species
#> 14        59689                                                 Arabidopsis lyrata        species
#> 15        45251                                               Arabidopsis neglecta        species
#> 16        45249                                                Arabidopsis suecica        species
#> 17        38785                                                Arabidopsis arenosa        species
#> 18         3702                                               Arabidopsis thaliana        species
#> 
#> attr(,"class")
#> [1] "children"
#> attr(,"db")
#> [1] "ncbi"
:p> packageVersion('taxize')
[1] ‘0.9.4.9914:p> packageVersion('taxizedb')
[1] ‘0.1.7.9602

Release taxizedb 0.3.1

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

GBIF database empty

I used db_download_gbif() to download the GBIF database, but when I try to run name2taxid(<any string>, db = "gbif") it always returns NA. Following the file path the database was downloaded to, I find gbif.sqlite appears to be only 24KB. Downloading the file directly from the link it is hosted at (https://taxize-dbs.s3-us-west-2.amazonaws.com/gbif.zip) yields the same result.

(Let me know if I should open an issue over at https://github.com/sckott/gbif-backbone-sql instead.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.