Giter VIP home page Giter VIP logo

bdc's Introduction

bdc

A toolkit for standardizing, integrating, and cleaning biodiversity data

CRAN status downloads

R-CMD-check Codecov test coverage DOI License

Overview

Handle biodiversity data from several different sources is not an easy task. Here, we present the Biodiversity Data Cleaning (bdc), an R package to address quality issues and improve the fitness-for-use of biodiversity datasets. bdc contains functions to harmonize and integrate data from different sources following common standards and protocols, and implements various tests and tools to flag, document, clean, and correct taxonomic, spatial, and temporal data.

Compared to other available R packages, the main strengths of the bdc package are that it brings together available tools – and a series of new ones – to assess the quality of different dimensions of biodiversity data into a single and flexible toolkit. The functions can be applied to a multitude of taxonomic groups, datasets (including regional or local repositories), countries, or worldwide.

Structure of bdc

The bdc toolkit is organized in thematic modules related to different biodiversity dimensions.


⚠️ The modules illustrated, and functions within, were linked to form a proposed reproducible workflow (see vignettes). However, all functions can also be executed independently.



Standardization and integration of different datasets into a standard database.

  • bdc_standardize_datasets() Standardization and integration of different datasets into a new dataset with column names following Darwin Core terminology

Flagging and removal of invalid or non-interpretable information, followed by data amendments (e.g., correct transposed coordinates and standardize country names).

  • bdc_scientificName_empty() Identification of records lacking names or with names not interpretable
  • bdc_coordinates_empty() Identification of records lacking information on latitude or longitude
  • bdc_coordinates_outOfRange() Identification of records with out-of-range coordinates (latitude > 90 or -90; longitude >180 or -180)
  • bdc_basisOfRecords_notStandard() Identification of records from doubtful sources (e.g., fossil or machine observation) impossible to interpret and not compatible with Darwin Core recommended vocabulary
  • bdc_country_from_coordinates() Derive country name from valid geographic coordinates
  • bdc_country_standardized() Standardization of country names and retrieve country code
  • bdc_coordinates_transposed() Identification of records with potentially transposed latitude and longitude
  • bdc_coordinates_country_inconsistent() Identification of coordinates in other countries or far from a specified distance from the coast of a reference country (i.e., in the ocean)
  • bdc_coordinates_from_locality() Identification of records lacking coordinates but with a detailed description of the locality associate with records from which coordinates can be derived

Cleaning, parsing, and harmonization of scientific names against multiple taxonomic references.

  • bdc_clean_names() Name-checking routines to clean and split a taxonomic name into its binomial and authority components
  • bdc_query_names_taxadb() Harmonization of scientific names by correcting spelling errors and converting nomenclatural synonyms to currently accepted names.
  • bdc_filter_out_names() Function used to filter out records according to their taxonomic status present in the column “notes”. For example, to filter only valid accepted names categorized as “accepted”

Flagging of erroneous, suspicious, and low-precision geographic coordinates.

  • bdc_coordinates_precision() Identification of records with a coordinate precision below a specified number of decimal places
  • clean_coordinates() (From CoordinateCleaner package and part of the data-cleaning workflow). Identification of potentially problematic geographic coordinates based on geographic gazetteers and metadata. Include tests for flagging records: around country capitals or country or province centroids, duplicated, with equal coordinates, around biodiversity institutions, within urban areas, plain zeros in the coordinates, and suspect geographic outliers

5. Time

Flagging and, whenever possible, correction of inconsistent collection date.

  • bdc_eventDate_empty() Identification of records lacking information on event date (i.e., when a record was collected or observed)
  • bdc_year_outOfRange() Identification of records with illegitimate or potentially imprecise collecting year. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900)
  • bdc_year_from_eventDate() This function extracts four-digit year from unambiguously interpretable collecting dates

Aim to facilitate the documentation, visualization, and interpretation of results of data quality tests the package contains functions for documenting the results of the data-cleaning tests, including functions for saving i) records needing further inspection, ii) figures, and iii) data-quality reports.

  • bdc_create_report() Creation of data-quality reports documenting the results of data-quality tests and the taxonomic harmonization process
  • bdc_create_figures() Creation of figures (i.e., bar plots and maps) reporting the results of data-quality tests
  • bdc_filter_out_flags() Removal of columns containing the results of data quality tests (i.e., column starting with “.”) or other columns specified
  • bdc_quickmap() Creation of a map of points using ggplot2. Helpful in inspecting the results of data-cleaning tests
  • bdc_summary_col() This function creates or updates the column summarizing the results of data quality tests (i.e., the column “.summary”)

Installation

Gnparser installation

Previously to bdc installation is necessary to install GNparser. First, download the binary file of gnparser for your operational system. For example, download the file using R as follow:

download.file(url = "file_link", 
              destfile = "destination_path")

The downloaded file has extensions .zip or .gz.

Mac OS

Extract the binary file gnparser from .zip or .gz files and move it to the folder ~/Library/Application Support/. Move the file manually or using R:

# Extract gnparser file
untar("~/Downloads/gnparser-v1.9.1-linux.tar.gz")

# Move to the path
file.copy("./gnparser", "~/Library/Application Support/")

Linux

Extract the binary file gnparser from .zip or .gz files and move it to the folder ~/bin. Move the file manually or using R:

# Extract gnparser file
untar("~/Downloads/gnparser-v1.9.1-linux.tar.gz")

# Move to the path
file.copy("./gnparser",  "~/bin")

Windows

In Windows, extract the binary file gnparser from .zip. Then, move gnparser file to the folder Appdata. To find the Appdata path, run this in R:

# Unzip the downloaded file
unzip(gnparser.zip, exdir = "destination_path/gnparser")

# Find the AppData path
AppData_path <- Sys.getenv("AppData")

# Copy gnparser to AppData

file.copy("destination_path/gnparser", AppData_path, recursive = TRUE)

bdc installation

After installing Gnparser, you can install bdc from CRAN:

install.packages("bdc")
library(taxadb)

or the development version from GitHub using:

install.packages("remotes")
remotes::install_github("brunobrr/bdc")

Load the package with:

library(bdc)

Package website

See bdc package website (https://brunobrr.github.io/bdc/) for detailed explanation on each module.

Getting help

If you encounter a clear bug, please file an issue here. For questions or suggestion, please send us a email ([email protected]).

Citation

Ribeiro, BR; Velazco, SJE; Guidoni-Martins, K; Tessarolo, G; Jardim, Lucas; Bachman, SP; Loyola, R (2022). bdc: A toolkit for standardizing, integrating, and cleaning biodiversity data. Methods in Ecology and Evolution. doi.org/10.1111/2041-210X.13868

bdc's People

Contributors

black-snow avatar brunobrr avatar geiziane avatar kguidonimartins avatar lucas-jardim avatar sjevelazco avatar zanderaugusto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bdc's Issues

Problemas a serem resolvidos no pacote

  • Adicionar uma tabela (ou lista) das funções na página inicial do site (Bruno)
  • Criar faq (Bruno)
  • Conferir exemplos que não estão rodando: "devtools::run_examples(run_dontrun = TRUE)" (Lucas)
  • Atualizar o logo
  • Atualizar o site
  • Inserir função para criar diretório em todas funções que produzem output. bdc_clean_names, por exemplo
  • Remover argumento workflow_step das funções de report e figures

Versioning issues with DuckDB in bdc_query_names_taxadb()

It seems this may be an issue similar to other issues mentioned but when running bdc_query_name_taxadb() I get a long error message about reading database version numbers:


resolvedspplistitis <- bdc_query_names_taxadb(sci_name=specieslist, db = "itis", suggestion_distance = 0.9)

Error: rapi_startup: Failed to open database: IO Error: Trying to read a database file with version number 31, but we can only read version 33.
The database file was created with an older version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

My session info:

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] job_0.3.0         rgnparser_0.2.0   doParallel_1.0.17 iterators_1.0.14  foreach_1.5.2     bdc_1.1.2         ggbeeswarm_0.6.0 
 [8] magrittr_2.0.3    taxize_0.9.100    forcats_0.5.1     stringr_1.4.0     readr_2.1.2       tidyr_1.2.0       tibble_3.1.8     
[15] ggplot2_3.3.6     tidyverse_1.3.2   purrr_0.3.4       dplyr_1.0.9       readxl_1.4.0     

loaded via a namespace (and not attached):
  [1] googledrive_2.0.0        colorspace_2.0-3         ellipsis_0.3.2           class_7.3-20             rgdal_1.5-30            
  [6] rprojroot_2.0.3          fs_1.5.2                 httpcode_0.3.0           rstudioapi_0.13          proxy_0.4-27            
 [11] DT_0.23                  fansi_1.0.3              lubridate_1.8.0          xml2_1.3.3               codetools_0.2-18        
 [16] contentid_0.0.15         bold_1.2.0               knitr_1.39               jsonlite_1.8.0           broom_1.0.0             
 [21] dbplyr_2.2.1             rgeos_0.5-9              oai_0.3.2                compiler_4.2.1           httr_1.4.3              
 [26] backports_1.4.1          assertthat_0.2.1         fastmap_1.1.0            lazyeval_0.2.2           gargle_1.2.0            
 [31] cli_3.3.0                duckdb_0.4.0             prettyunits_1.1.1        htmltools_0.5.3          tools_4.2.1             
 [36] gtable_0.3.0             glue_1.6.2               rappdirs_0.3.3           Rcpp_1.0.9               cellranger_1.1.0        
 [41] CoordinateCleaner_2.0-20 raster_3.5-21            vctrs_0.4.1              crul_1.2.0               ape_5.6-2               
 [46] nlme_3.1-158             conditionz_0.1.0         xfun_0.31                rvest_1.0.2              lifecycle_1.0.1         
 [51] sys_3.4                  googlesheets4_1.0.0      terra_1.5-21             zoo_1.8-10               scales_1.2.0            
 [56] hms_1.1.1                qs_0.25.3                curl_4.3.2               geosphere_1.5-14         taxadb_0.1.5            
 [61] reshape_0.8.9            stringi_1.7.8            e1071_1.7-11             rgbif_3.7.2              rlang_1.0.4             
 [66] pkgconfig_2.0.3          evaluate_0.15            lattice_0.20-45          sf_1.0-7                 htmlwidgets_1.5.4       
 [71] tidyselect_1.1.2         here_1.0.1               plyr_1.8.7               R6_2.5.1                 generics_0.1.3          
 [76] DBI_1.1.3                arkdb_0.0.15             pillar_1.8.0             haven_2.5.0              whisker_0.4             
 [81] withr_2.5.0              units_0.8-0              sp_1.5-0                 modelr_0.1.8             crayon_1.5.1            
 [86] uuid_1.1-0               KernSmooth_2.23-20       utf8_1.2.2               RApiSerialize_0.1.0      tzdb_0.3.0              
 [91] progress_1.2.2           rnaturalearth_0.1.0      grid_4.2.1               data.table_1.14.2        reprex_2.0.1            
 [96] digest_0.6.29            classInt_0.4-7           openssl_2.0.2            RcppParallel_5.1.5       munsell_0.5.0           
[101] stringfish_0.15.7        beeswarm_0.4.0           vipor_0.4.5              askpass_1.1 

Any help would be appreciated!

Padronização de coordenadas

Oi pessoal, por ser um pacote desenvolvido por brasileiro, vou escrever em PT-BR para explicar melhor.
Esse não é um erro, apenas uma sugestão de função para padronização (ou harmonização) das coordenadas geográficas.
Desenvolvendo outro pacote que devo publicar em breve, percebi que diversas coordenadas acabam utilizando diferentes variações de caracteres (por exemplo ` e ' ), além de diversos outros casos específicos.
Após essa padronização, seria possível converter as coordenadas DMS em decimais, e vice versa.
Posso ajudar nesse trabalho pois estava utilizando isso dentro de uma função desse meu pacote.
Obrigado!

bdc_query_names_taxadb() Error in object[[name, exact = TRUE]] : subscript out of bounds

Hello,

I am experiencing issues when applying the bdc_query_names_taxadb() to a standard list of species names (keeping all other aguments as default, e.g.:

names_harmonization <- bdc_query_names_taxadb( species_NA$scientific_name )

I get the same error when running the standard example too. I have installed the latest version of bdc from github.

The full error I am getting is below:

Error in object[[name, exact = TRUE]] : subscript out of bounds
In addition: Warning messages:
1: In assert_deprecated(dbdir, driver, read_only) :
deprecated arguments will be removed from future releases, see function docs
2: In assert_deprecated(overwrite, lines) :
deprecated arguments will be removed from future releases, see function docs

Função corrigir xy trocados

Oi @sjevelazco, estes são os comentários que comentei com você na reunião:

  1. O filtro da linha 136 (filt) está funcionado? Aparentemente, não está filtrando somente para o Brasil
  2. Linha 148: iso2 BR também selecionou Bolívia, Bulgária, Brithish Guyana
  3. Exportar uma tabela com id (nosso id) xy originais, xy trocados, país, estado, cidade e localidade.
  4. Salvar o dataset com as coordenadas modificadas

doParallel shouldn't be just a suggestion or parallel should default to false

I was wondering why I had to load doParallel even though I don't use it in my code. The reason is that bdc_suggest_names_taxadb() has an argument parallel, that defaults to TRUE and given that parallel == TRUE bdc utilizes doParallel:

doParallel::registerDoParallel(cl)

I think the default for parallel should either be FALSE or you should move doParallel from suggestions up to imports.

Remoção de dependências

Pessoal, temos uma lista relativamente grande de pacotes como dependência. Precisamos reduzir essa lista ao máximo.

Por favor, marquem a caixa dos pacotes que vocês acham que são necessários para o funcionamento do nosso pacote. Aqueles que não forem marcados, vamos movê-los para uma lista de sugestão. Isso será informado pela própria função que depende desse pacote não marcado na lista.

  • here
  • dplyr
  • purrr
  • readr
  • fs
  • rlang
  • doParallel
  • foreach
  • sf
  • data.table
  • flora
  • geobr
  • lubridate
  • plyr
  • qs
  • rvest
  • sp
  • stringr
  • vroom
  • xml2
  • CoordinateCleaner
  • Hmisc
  • countrycode
  • rangeBuilder
  • tidyselect
  • raster
  • rnaturalearth
  • stringdist
  • stringi
  • taxadb
  • cowplot
  • ggplot2
  • kableExtra
  • knitr
  • scales
  • tibble
  • rgnparser
  • rworldmap
  • magrittr

unable to run bdc_clean_names

Hello. I am trying to run bdc_clean_names following the example in the Taxonomy vignette but using my own data: https://brunobrr.github.io/bdc/articles/taxonomy.html

bdc_clean_names(sci_names=mydata$scientificName, save_outputs = FALSE)

However, when I try to run it, I get an error saying that I am unable to install gnparser:

 `install_gnparser()` was deprecated in rgnparser 0.3.0 and is now defunct.
i Please see help page for deprecation reason and solution.
Run `rlang::last_error()` to see where the error occurred.

Is there a way to work around this? I was able to run gnparser using the Windows command line but I am unsure of how to get it to work with bdc. Thank you.

Sobre os testes

TL;DR: Notas sobre os testes. Nada importante para agora.

Os testes para cada função podem ser criados com a ajuda do pacote usethis. Por exemplo, para criar um teste para a função bdc_suggest_names_taxadb, use:

usethis::use_test('bdc_suggest_names_taxadb")

Automaticamente, a função criará o arquivo tests/testthat/test-bdc_suggest_names_taxadb.R. Se você estiver usando o RStudio, o arquivo será aberto automaticamente. Edite o arquivo (um teste de exemplo já está preenchido no novo arquivo) e rode os testes no console com:

devtools::test()

O interessante dessa abordagem é que podemos colocar o próprio GitHub para rodar esses testes a cada novo push na master ou em outras branchs. Se algo falhar, seremos notificados pelo email. Idealmente, as modificações via pull request só deveriam ser mergeadas se os testes para elas passarem.

O PR #57 adiciona testes para duas funções.

Nota: Não precisamos nos preocupar com isso nesse momento. Isso pode ser feito depois da submissão do trabalho.

erro na função bdc_country_from_coordinates

Estou tentando usar a função mas recebo o erro:

Error in `dplyr::filter()`:                                                                                                                                                            
ℹ In argument: `!is.na(data_no_country[[lat]])`.
Caused by error in `.subset2()`:
! invalid negative subscript in get1index <real>

alguma ideia do que pode ser?

Issue with bdc_query_names_taxadb

Hello. It's me again. After assembling more data I decided to run bdc_query_names_taxadb() again and now get the following message:

It looks like you tried to incorrectly use a table in a schema as source.
ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
ℹ If your table actually contains "." in the name use `check_from = FALSE` to silence this message.

I tried typing in the suggested commands in the RStudio console but nothing seems to happen. Do you know what I should do about this message?

Testar funções para limpar os nomes

Oi @Geiziane, como conversamos, é preciso testar qual é a melhor função para limpar os nomes antes de padronizar a taonomia.
Temos 3 opções: a função que desenvolvemos, a do pacote WFO e a do pacote taxadb. A ideia é ver qual delas consegue deixar os nomes o mais limpo possível (remover author, pontos, nome da família, sinais de incerteza taxonomica, etc).

Criar uma tabela para padronizar base de dados

@kguidonimartins, deixei aqui detalhes que falta ser adicionado

  1. Adicionar mensagem de avisos aos usuários. Pensei nos seguintes casos:
    • quando algum campo obrigatório estiver faltanto;
    • quando a há um nome do dataset na planilha "databaseInfo" mas o caminho para o dataset não existe ou está errado;
  • quando o nome fornecido pelo usuário não pode ser encontrado por esta escrito errado

Criar uma função para usuário corrigir taxonomia das espécies

Após a validação e correção taxonômica, será gerado um arquivo csv informado quais nomes foram resolvidos ou não. Nessa tabela, o usuário poderá fazer modificações nos nomes. Tais modificações serão incorporadas na base de dados antes de seguir para o próximo passo (limpeza geográfica).

A tabela liberada aos usuários pode conter os seguintes campos:

ID | Nome_verbatim | Name_verified | Resolved (T ou F)

ID = ID único criar por nós para identificar cada registro;
Nome_verbatim = Nome da espécie assim como estava na base de dados original
Name_verified = Nome validado segundo um base taxonômica
Resolved = Flag indicando se o nome foi resolvido (T) ou não (F).

@lucas-jardim, podemos criar essa função após a verificação taxonômica.

Implementações futuras

Oi gente, deixo aqui algumas funcionalidades que talvez podem ser implementadas futuramente.

Sprint final - lista de tarefas

Oi @Geiziane @kguidonimartins @lucas-jardim @sjevelazco, abaixo estão algumas tarefas pequenas (coisa rápida) a serem concluídas. Vamos tentar deixar tudo pronto para até sexta para gente conversar se está tudo ok durante a nossa última reunião, às 9 h.

Inclui a primeira letra do nome do resposável na frente da tarefa. T

Tarefas específicas.

  • Adicionar argumentos (data, lon, lat e id) na função xy_transposed [S]

  • Terminar o script para conferência taxonômica [L]

  • Arrumar função para remover registros fora de um pais [B]

  • Padronizar e inserir o script para gerar e salvar as figuras de cada etapa do workflow [G]

  • Conferir a documentação e os testes das funções [K]

Também dividi nossas funções para cada um de nós para que a gente possa testar as funções (vejam issue #58)
Confiram, por favor, se todas as funções possuem argumentos. Se não, podem adiciná-los, por favor, caso necessário. Para pradronizar, todos argumentos devem ser inseridos pelo usuários usando aspas. Os argumentos são:

  • base de dados: data
  • longitude: long
  • latitude: lat
  • nome das espécies: sci_name

Os testes para cada função podem ser inseridos em um arquivo .R com o nome da função na pasta tests\testthat. Dúvidas sobre como fazer os testes vejam #58 ou perguntem para nosso especialista @kguidonimartins no whtz, rs.

  • bdc_aux_functions.R até bdc_filter_out_flags.R [S]
  • bdc_flag_missing_names.R até bdc_flag_xy_provenace.R [G]
  • bdc_geographic_outlier.R até bdc_query_wfo.R [K]
  • bdc_quickmap.R até bdc_return_names.R [B]
  • bdc_round_dec.R até bdc_suggest_names_taxadb.R [L]

bdc_country_from_coordinates() identifies correct countries but assigns them to incorrect rows

I am using bdc_country_from_coordinates() to find the appropriate countries for a long list of coordinates that do not have country information assigned to they. I found that the method does find the correct country for each coordinate, but in the process of outputting them, assigns the wrong country to the points.

Example that reproduces issue:

point1 = Finland, point2 = Sweden, point3 = Norway, point4 = Estonia)

decimalLatitude = c(62.587273591263624, 66.62443625769812, 60.91266175537055, 59.166132649248496)
decimalLongitude = c(30.81351622904529, 21.219645421093123, 10.65224213789756, 25.883878594365648)
country = c("","","","")
testframe <- data.frame(decimalLatitude, decimalLongitude, country)

bdc_country_from_coordinates(testframe)

bdc_country_from_coordinates:
Country names were added to 4 records.

decimalLatitude decimalLongitude country
1 62.58727 30.81352 Estonia
2 66.62444 21.21965 Norway
3 60.91266 10.65224 Sweden
4 59.16613 25.88388 Finland

The output order is: Estonia, Norway, Sweden & Finland. The correct output order for these points should be Finland, Sweden, Norway & Estonia.

some functions create Output dir

Apparently, some functions create an (empty) Output dir and there's no way to stop them.

bdc_clean_names seems fine with save_outputs = FALSE but bdc_query_names_taxadb seems to create the directories even when the export arg is false.

I didn't look at the code (yet) so I didn't pinpoint it. It would be nice if no method created any files or directories "without permission" (i.e., all of them had a parameter for that with a sensible default off).

Cheers!

Error on bdc_coordinates_empty: invalid multibyte string

Hello

I'm having trouble running bdc_coordinates_empty command on dataset with special characters (loaded with encoding="Latin-1" ). Apparently it is related to the command dplyr::mutate_all(as.numeric) in the code of bdc_coordinates_empty. Below is an example of this problem, including the system settings.

I thought about skipping this step or making small changes to the bdc_coordinates_empty code, but I was wondering if equivalent issues will arise later on so I'd better make a change early in the workflow.

I would appreciate any help.

> data<-fread("data.csv",encoding="Latin-1")
> data<-data[33,c(3,4,6,10)]
> data
     scientificName latitude longitude                                              locality
1: Achirus lineatus -6785496 -34955091 Área De Proteção Ambiental Da Barra Do Rio Mamanguape
> data<-bdc_coordinates_empty(data = data,lat = "latitude",lon = "longitude")
Error in `mutate()`:
! Problem while computing `locality = .Primitive("as.double")(locality)`.
Caused by error in `mask$eval_all_mutate()`:
! invalid multibyte string at '<c1>rea D<65> Prote<e7><e3>o Ambiental Da Barra Do Rio Mamanguape'
Run `rlang::last_error()` to see where the error occurred.

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C               LC_TIME=pt_BR.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=pt_BR.UTF-8    LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rnaturalearthdata_0.2.0  rnaturalearth_0.1.0      lubridate_1.8.0          sf_1.0-8                
 [5] cowplot_1.1.1            remotes_2.4.2            forcats_0.5.1            stringr_1.4.0           
 [9] dplyr_1.0.9              purrr_0.3.4              readr_2.1.2              tidyr_1.2.0             
[13] tibble_3.1.7             ggplot2_3.3.6            tidyverse_1.3.2          vegan_2.6-2             
[17] lattice_0.20-45          permute_0.9-7            bdc_1.1.1                taxadb_0.1.5            
[21] data.table_1.14.2        CoordinateCleaner_2.0-20

loaded via a namespace (and not attached):
 [1] googledrive_2.0.0   colorspace_2.0-3    ellipsis_0.3.2      class_7.3-20        rgdal_1.5-32        rprojroot_2.0.3    
 [7] fs_1.5.2            rstudioapi_0.13     proxy_0.4-27        DT_0.23             fansi_1.0.3         rgnparser_0.2.0    
[13] xml2_1.3.3          codetools_0.2-18    splines_4.2.1       contentid_0.0.15    jsonlite_1.8.0      broom_1.0.0        
[19] cluster_2.1.3       dbplyr_2.2.1        rgeos_0.5-9         oai_0.3.2           compiler_4.2.1      httr_1.4.3         
[25] backports_1.4.1     assertthat_0.2.1    Matrix_1.4-1        fastmap_1.1.0       lazyeval_0.2.2      gargle_1.2.0       
[31] cli_3.3.0           htmltools_0.5.3     prettyunits_1.1.1   tools_4.2.1         gtable_0.3.0        glue_1.6.2         
[37] Rcpp_1.0.9          cellranger_1.1.0    raster_3.5-21       vctrs_0.4.1         nlme_3.1-158        conditionz_0.1.0   
[43] iterators_1.0.14    rvest_1.0.2         lifecycle_1.0.1     sys_3.4             googlesheets4_1.0.0 terra_1.5-34       
[49] MASS_7.3-58         scales_1.2.0        hms_1.1.1           qs_0.25.3           curl_4.3.2          geosphere_1.5-14   
[55] stringi_1.7.8       foreach_1.5.2       e1071_1.7-11        rgbif_3.7.2         rlang_1.0.4         pkgconfig_2.0.3    
[61] htmlwidgets_1.5.4   tidyselect_1.1.2    here_1.0.1          plyr_1.8.7          magrittr_2.0.3      R6_2.5.1           
[67] generics_0.1.3      DBI_1.1.3           arkdb_0.0.15        pillar_1.8.0        haven_2.5.0         whisker_0.4        
[73] withr_2.5.0         mgcv_1.8-40         units_0.8-0         sp_1.5-0            modelr_0.1.8        crayon_1.5.1       
[79] uuid_1.1-0          KernSmooth_2.23-20  utf8_1.2.2          RApiSerialize_0.1.0 tzdb_0.3.0          progress_1.2.2     
[85] grid_4.2.1          readxl_1.4.0        reprex_2.0.1        digest_0.6.29       classInt_0.4-7      openssl_2.0.2      
[91] RcppParallel_5.1.5  munsell_0.5.0       stringfish_0.15.7   askpass_1.1    

taxa db versioning issues within bdc_query_names_taxadb

The ``bdc_query_names_taxadb()` function hardcodes database versions here:

# Currently available databases and versions
switch(
EXPR = db,
itis = {
db_version <- 2022
},
ncbi = {
db_version <- 2022
},
col = {
db_version <- 2022
},
gbif = {
db_version <- 2022
},
iucn = {
db_version <- 2019 ### TODO: Change to 2022
},
ott = {
db_version <- 2021
},
fb = {
db_version <- 2019
},
tpl = {
db_version <- 2019
}
)

However, downstream in this function there are calls to the function taxadb::taxa_tbl() which has the default argument ofs:

taxa_tbl(
  provider = getOption("taxadb_default_provider", "itis"),
  schema = c("dwc", "common"),
  version = latest_version(),
  db = td_connect()
)

When latest_version() != db_version, bdc_query_names_taxadb() says that all species names were not found in the DB. This can be illustrated with the iucn database (hardcoded to 2019, but the 2022 version is available).

test <- bdc::bdc_query_names_taxadb(
 sci_name = "Sus scrofa", 
  db = "iucn"
)

test

 data.frame(test)
  original_search suggested_name distance    notes taxonID
1      Sus scrofa             NA       NA notFound    <NA>
  scientificName taxonRank acceptedNameUsageID taxonomicStatus
1           <NA>      <NA>                <NA>            <NA>
  kingdom phylum class order family genus specificEpithet
1    <NA>   <NA>  <NA>  <NA>   <NA>  <NA>            <NA>
  vernacularName infraspecificEpithet
1           <NA>                 <NA>

I'm working off the GitHub installation of bdc

Organização pastas

Oi @Geiziane @kguidonimartins @lucas-jardim @sjevelazco,

Podemos organizar os outputs do workflow da seguinte forma. O que você acham?

Output
Intermediate: (para salvar os arquivos intermediários)
Check: (para salvar arquivos que precisam/podem ser conferidos pelo usuários)
Report: (para salvar o relatório de cada etapa).

Podemos adicionar o número de cada etapa na frente do nome:
00 = smerged_databases
01 = pre-filter
02 = taxonomy
03 = space
04 = temporal

Por exemplo,

01_report.csv (relatório da etapa de pré filtro)
02_unresolved_names.csv (nomes que não foram encontrados na etapa de taxonomia)
01_correct_xy.csv (xy que foram corrigidos na etapa de pre-filtro)
03_report (relatório do número de registros sinalizados na etapa de análise das coordenadas).

Mudanças no README

  1. Melhorar in category (linha 13): "if additional, the field contains information that can be provided by users." Em quê essa informação adicional seriá importante?

  2. A coluna Type parece desnecessária: o tibble autpmaticamente identificará se é um atômico de caractere ou numérico.

  3. Arrumar a ordem das colunas do banco de dados (DatabaseInfo.csv) para ser igual à ordem do README.

  4. Em occurrenceID, explicitar o unique identifier. Exemplo: no GBIF, há os campos Key, taxonID, taxonKey, identificationKey e catalogKey. Qual o usuário colocaria?

  5. Em event_date, explicar o que é um event

  6. Especificar na explicação geral como proceder ao não preencher alguma coluna. Exemplo: não preencherei event_date, então deve-se colocar um NA.

  7. Em identified_by, trocar o exemplo e colocar como seria um input de concatenação.

  8. A partir do campo coordinate_uncertainty_in_meters (os três últimos campos), não é explícito se os campos são inputs definidos diretamente do usuário ou se são uma especificação de uma das colunas das bases de dados. Modifiquei a explicação como sendo as colunas das bases.

  9. Na descrição do campo event_date (linha 33), informar o formato da data, se for padronizado.

FAQ

This is a pinned issue, listing the most common errors that we observed in the previous issues.

1. Error in utils::download.file... after running bdc_clean_names()

Complete error message below:

Error in utils::download.file(grep(os, urls, value = TRUE), file, mode = "wb") : 
  lengths of 'url' and 'destfile' must match

On Windows systems, you can resolve by running options(download.file.method = "wininet") before bdc_clean_names().

For example:

options(download.file.method = "wininet")
my_clean_names <- bdc_clean_names(my_species_names)

See: #230 (comment)

2. Error: rapi_startup: Failed to open database... after running bdc_query_names_taxadb()

Complete error message below:

Error: rapi_startup: Failed to open database: IO Error: Trying to read a database file with version number 31, but we can only read version 33.
The database file was created with an older version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

You can solve this by deleting previous database version with fs::dir_delete(taxadb:::taxadb_dir()) and running your query again.

For example:

fs::dir_delete(taxadb:::taxadb_dir())
my_query_names <- bdc_query_names_taxadb(sci_name = my_species_names, db = "itis", suggestion_distance = 0.9)

See: #233 (comment)

Documentação funções

Exemplo de documentação para pronização das funções (sugestões? sempre bem-vindas!). Veja exemplo em bdc_flag_missing_names.R

#' Título: sem pontos final [Nome título não é necessário]
#' Descrição: com ponto final. [Nome descrição não é necessário]
#'
#' Campos: [argumento][tipo.] [Descrição...] [Default = "", sem houver] [Ponto final]  
#' @param data data.frame. Containing... [Ponto final]
#' @param sci_names character string.  The column with ... Default = "nome" [Ponto final].
#' @param lat numeric. The column with ... Default = "nome" [Ponto final].
#' @param lon numeric. The column with ... Default = "nome" [Ponto final].
#'
#' @details This test flags....[Ponto final]
#' @return A data. frame with...
#' @examples
#' \dontrun{
#' x <- data.frame(scientificName = c("Ocotea odorifera", NA, "Panthera onca", ""))
#' bdc_flag_missing_names(x)
#' }

A divisão para revisar as funções ficou da seguinte maneira:
Karlo

  • bdc_standardize_datasets
  • bdc_creates_dir
  • bdc_missing_xy
  • bdc_missing_names
  • bdc_invalid_xy
  • bdc_invald_basis_of_records
  • bdc_xy_out_country (não encontrada)
  • bdc_country_xy_mismatch
    Santiago
  • bdc_transposed_xy
  • bdc_get_wiki_country
  • bdc_standardize_country
  • bdc_correct_coordinates
  • bdc_coord_trans
  • bdc_xy_precision
    Geizi
  • bdc_summary_col
  • bdc_create_report
  • bdc_xy_from_locality
  • bdc_create_figures
  • bdc_filter_out_flags (melhorar o exemplo da função)
  • bdc_query_names_taxadb
    Lucas
  • bdc_query_names_taxadb
  • bdc_filter_out_names
  • bdc_clean_names
  • bdc_suggest_names_taxadb
  • bdc_clean_duplicates
  • bdc_parse_date

Loop 0 is not valid: Edge 1028 has duplicate vertex with edge 1033

I am trying to run step 5 (Getting country names from valid coordinates) of the Pre-filter and received the following error message:

Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, :
Loop 0 is not valid: Edge 1028 has duplicate vertex with edge 1033

Any help is welcome :-)

Thanks!

Please remove dependencies on **rgdal**, **rgeos**, and/or **maptools**

You will be aware, for example from:
https://r-spatial.org/r/2022/04/12/evolution.html,
https://r-spatial.org/r/2022/12/14/evolution2.html,
https://r-spatial.org/r/2023/04/10/evolution3.html and
https://rsbivand.github.io/csds_jan23/bivand_csds_ssg_230117.pdf and
perhaps view https://www.youtube.com/watch?v=TlpjIqTPMCA&list=PLzREt6r1NenmWEidssmLm-VO_YmAh4pq9&index=1
that rgdal, rgeos and maptools will be retired this
year, in October 2023.

This package is harmed in the scenario for CMD check using sp evolution status 2 (substitute
use of rgdal with sf for projection/transformation/CRS) and absence of retiring packages from the library path because of strong dependence on CoordinateCleaner. No response yet to issue raised there last year: ropensci/CoordinateCleaner#78. Please help the maintainer of that package to protect your package and workflows.

Teste do `{rgnparser}`

Pessoal, vejam aí o funcionamento do {rgnparser}.

if (!require("tidyverse")) install.packages("tidyverse")
#> Loading required package: tidyverse
if (!require("vroom")) install.packages("vroom")
#> Loading required package: vroom
if (!require("remotes")) install.packages("remotes")
#> Loading required package: remotes
if (!require("rgnparser")) remotes::install_github("ropensci/rgnparser")
#> Loading required package: rgnparser

# only once
# rgnparser::install_gnparser()

temp <- tempfile(fileext = ".xz")

download.file(
  url = "https://github.com/brunobrr/risk_assessment_flora_Brazil_I/raw/master/data/temp/standard_database.xz",
  destfile = temp
)

risk_flora <- vroom::vroom(temp)
#> Rows: 100,000
#> Columns: 17
#> Delimiter: "\t"
#> chr [14]: database_id, occurrenceID, scientificName, eventDate, family, country, stateProv...
#> dbl [ 2]: decimalLatitude, decimalLongitude
#> lgl [ 1]: coordinateUncertaintyInMeters
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message

spp_test <- c("Quadrella steyermarkii (Standl.) Iltis &amp; Cornejo", 
              "Parus major Linnaeus, 1788", 
              "Helianthus annuus var. texanus")

rgnparser::gn_parse_tidy(spp_test)
#> # A tibble: 3 x 9
#>   id    verbatim cardinality canonicalfull canonicalsimple canonicalstem
#>   <chr> <chr>          <dbl> <chr>         <chr>           <chr>        
#> 1 fbd1… Quadrel…           2 Quadrella st… Quadrella stey… Quadrella st…
#> 2 e4e1… Parus m…           2 Parus major   Parus major     Parus maior  
#> 3 e571… Heliant…           3 Helianthus a… Helianthus ann… Helianthus a…
#> # … with 3 more variables: authorship <chr>, year <dbl>, quality <dbl>

raw data

flora_spp_test <-
  risk_flora %>%
  filter(str_detect(scientificName, "cf"))

flora_spp_test
#> # A tibble: 177 x 17
#>    database_id occurrenceID scientificName decimalLatitude decimalLongitude
#>    <chr>       <chr>        <chr>                    <dbl>            <dbl>
#>  1 DRYFLOR_75… <NA>         Annona montan…           -17.9            -57.5
#>  2 GBIF_4211   416977058    Macfadyena de…            NA               NA  
#>  3 GBIF_4756   2644264743   Annona montan…           -24.0            -46.3
#>  4 GBIF_5804   1258572768   Macfadyena un…            NA               NA  
#>  5 GBIF_9070   1324590446   Annona montan…            NA               NA  
#>  6 GBIF_9174   1258242816   Macfadyena un…            NA               NA  
#>  7 ICMBIO_60   RB00171850   LEGUMINOSAE C…           -14.4            -48.4
#>  8 ICMBIO_239  RB00425911   POLYGALACEAE …           -17.2            -51.8
#>  9 ICMBIO_329  RB01112673   MELIACEAE cf.…           -22.3            -42.7
#> 10 ICMBIO_380  RB00241798   MELASTOMATACE…           -10.4            -41.3
#> # … with 167 more rows, and 12 more variables: eventDate <chr>, family <chr>,
#> #   country <chr>, stateProvince <chr>, county <chr>, locality <chr>,
#> #   coordinatePrecision <chr>, basisOfRecord <chr>, taxonRank <chr>,
#> #   identifiedBy <chr>, coordinateUncertaintyInMeters <lgl>, recordedBy <chr>

flora_spp_parsed <-
  flora_spp_test %>%
  pull(scientificName) %>%
  rgnparser::gn_parse_tidy()

flora_spp_parsed
#> # A tibble: 177 x 9
#>    id    verbatim cardinality canonicalfull canonicalsimple canonicalstem
#>    <chr> <chr>          <dbl> <chr>         <chr>           <chr>        
#>  1 81db… Annona …           2 Annona monta… Annona montana  Annona montan
#>  2 0494… Annona …           2 Annona monta… Annona montana  Annona montan
#>  3 e294… Macfady…           2 Macfadyena d… Macfadyena den… Macfadyena d…
#>  4 c0c1… Macfady…           2 Macfadyena u… Macfadyena ung… Macfadyena u…
#>  5 fa2d… POLYGAL…           0 <NA>          <NA>            <NA>         
#>  6 1dfb… MELIACE…           0 <NA>          <NA>            <NA>         
#>  7 e5ee… LEGUMIN…           0 <NA>          <NA>            <NA>         
#>  8 76cb… Macfady…           2 Macfadyena u… Macfadyena unc… Macfadyena u…
#>  9 0494… Annona …           2 Annona monta… Annona montana  Annona montan
#> 10 7d10… MELASTO…           0 <NA>          <NA>            <NA>         
#> # … with 167 more rows, and 3 more variables: authorship <chr>, year <lgl>,
#> #   quality <dbl>

full_join(flora_spp_test, flora_spp_parsed, by = c("scientificName" = "verbatim"), keep = TRUE) %>%
  select(scientificName, any_of(names(flora_spp_parsed)), -id)
#> # A tibble: 187 x 9
#>    scientificName verbatim cardinality canonicalfull canonicalsimple
#>    <chr>          <chr>          <dbl> <chr>         <chr>          
#>  1 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  2 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  3 Macfadyena de… Macfady…           2 Macfadyena d… Macfadyena den…
#>  4 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  5 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  6 Macfadyena un… Macfady…           2 Macfadyena u… Macfadyena ung…
#>  7 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  8 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  9 Macfadyena un… Macfady…           2 Macfadyena u… Macfadyena unc…
#> 10 LEGUMINOSAE C… LEGUMIN…           0 <NA>          <NA>           
#> # … with 177 more rows, and 4 more variables: canonicalstem <chr>,
#> #   authorship <chr>, year <lgl>, quality <dbl>

preprocessed data

flora_spp_test_pre <-
  risk_flora %>%
  filter(str_detect(scientificName, "cf")) %>%
  mutate(scientificName = str_remove(scientificName, "\\b[A-Z]+\\s+"))

flora_spp_test_pre
#> # A tibble: 177 x 17
#>    database_id occurrenceID scientificName decimalLatitude decimalLongitude
#>    <chr>       <chr>        <chr>                    <dbl>            <dbl>
#>  1 DRYFLOR_75… <NA>         Annona montan…           -17.9            -57.5
#>  2 GBIF_4211   416977058    Macfadyena de…            NA               NA  
#>  3 GBIF_4756   2644264743   Annona montan…           -24.0            -46.3
#>  4 GBIF_5804   1258572768   Macfadyena un…            NA               NA  
#>  5 GBIF_9070   1324590446   Annona montan…            NA               NA  
#>  6 GBIF_9174   1258242816   Macfadyena un…            NA               NA  
#>  7 ICMBIO_60   RB00171850   Clitoria cf. …           -14.4            -48.4
#>  8 ICMBIO_239  RB00425911   Polygala cf. …           -17.2            -51.8
#>  9 ICMBIO_329  RB01112673   cf. Trichilia            -22.3            -42.7
#> 10 ICMBIO_380  RB00241798   Tibouchina cf…           -10.4            -41.3
#> # … with 167 more rows, and 12 more variables: eventDate <chr>, family <chr>,
#> #   country <chr>, stateProvince <chr>, county <chr>, locality <chr>,
#> #   coordinatePrecision <chr>, basisOfRecord <chr>, taxonRank <chr>,
#> #   identifiedBy <chr>, coordinateUncertaintyInMeters <lgl>, recordedBy <chr>

flora_spp_parsed_pre <-
  flora_spp_test_pre %>%
  pull(scientificName) %>%
  rgnparser::gn_parse_tidy()

flora_spp_parsed_pre
#> # A tibble: 177 x 9
#>    id    verbatim cardinality canonicalfull canonicalsimple canonicalstem
#>    <chr> <chr>          <dbl> <chr>         <chr>           <chr>        
#>  1 81db… Annona …           2 Annona monta… Annona montana  Annona montan
#>  2 0494… Annona …           2 Annona monta… Annona montana  Annona montan
#>  3 c0c1… Macfady…           2 Macfadyena u… Macfadyena ung… Macfadyena u…
#>  4 e294… Macfady…           2 Macfadyena d… Macfadyena den… Macfadyena d…
#>  5 76cb… Macfady…           2 Macfadyena u… Macfadyena unc… Macfadyena u…
#>  6 2434… cf. Tri…           0 <NA>          <NA>            <NA>         
#>  7 0494… Annona …           2 Annona monta… Annona montana  Annona montan
#>  8 5adb… Tibouch…           2 Tibouchina v… Tibouchina vel… Tibouchina u…
#>  9 e39d… Polygal…           2 Polygala pse… Polygala pseud… Polygala pse…
#> 10 3ab7… Ormosia…           2 Ormosia fast… Ormosia fastig… Ormosia fast…
#> # … with 167 more rows, and 3 more variables: authorship <chr>, year <lgl>,
#> #   quality <dbl>

full_join(flora_spp_test_pre, flora_spp_parsed_pre, by = c("scientificName" = "verbatim"), keep = TRUE) %>%
  select(scientificName, any_of(names(flora_spp_parsed_pre)), -id)
#> # A tibble: 187 x 9
#>    scientificName verbatim cardinality canonicalfull canonicalsimple
#>    <chr>          <chr>          <dbl> <chr>         <chr>          
#>  1 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  2 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  3 Macfadyena de… Macfady…           2 Macfadyena d… Macfadyena den…
#>  4 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  5 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  6 Macfadyena un… Macfady…           2 Macfadyena u… Macfadyena ung…
#>  7 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  8 Annona montan… Annona …           2 Annona monta… Annona montana 
#>  9 Macfadyena un… Macfady…           2 Macfadyena u… Macfadyena unc…
#> 10 Clitoria cf. … Clitori…           2 Clitoria gui… Clitoria guian…
#> # … with 177 more rows, and 4 more variables: canonicalstem <chr>,
#> #   authorship <chr>, year <lgl>, quality <dbl>

Created on 2020-10-30 by the reprex package (v0.3.0)

Transposed coordinates

@sjevelazco a função bdc_transposed_coordinates tem um argumento de distância que corrige pontos só até essa distância do país. Dessa forma, pontos além dessa distância não são corrigidos. Essa distância foi utilizada por algum propósito?

Arquivo ausente

A leitura do seguinte arquivo não ocorre:

here::here("data", "temp", "standard_database.xz") %>%

Tentei encontrar em qual script e linha se salvava um arquivo .xz, mas não encontrei menção alguma. Há apenas arquivos .qs no primeiro script.

Ademais, corrigir #105, uma vez que está em conflito com este script, no que consta à ordem.

'Check_5' em '01_Prefilter'

data_pf5 <-

Erro:

Error in countrycode::countrycode(cntr_db$cntr_suggested, origin = "country.name.en",: sourcevar must be a character or numeric vector. This error often arises when users pass a tibble (e.g., from dplyr) instead of a column vector from a data.frame (i.e., my_tbl[, 2] vs. my_df[, 2] vs. my_tbl[[2]])

Erro remete especificamente à função:

cntr_db$cntr_iso2c <-

Input e bases de dados utilizados disponíveis em:

https://github.com/zanderaugusto/bdc/tree/master/Config

Pré-filtros

@kguidonimartins, estes são os filtros utilizamos nessa etapa. Inseri os scripts antigos como ponto de partida, mas fique à vontade para modificar e criar outros se necessário.

TODO

  • 1. Usar a função que o @sjevelazco criou para corrigir xy trocados
  • 2. Remover registros com coordenadas inválidas
  • 3. Remover registros sem lat ou sem long
  • 4. Registros em nome ou com nome vazio
  • 5. Registros fora do país selecionado
  • 6. Remover registros fósseis ou com origem duvidosa
  • 7. Salvar tabela na de nomes sem coordenadas (x ou y) mas que contém informações sobre localidade

Descrição

  1. Usar a função que o @sjevelazco criou para corrigir xy trocados.: Essa função substitui diretamente o xy trocados pelos corretos na tabela inserida e retorna uma tabela com o xy original e corrigido. Essa tabela poder ser salva no pasta Output/Check/01_correct_xy.csv (ver mais detalhes sobre organização das pastas no issue #32 )

Os seguintes filtros podem reportados como T ou F em uma tabela de resultados. Por exemplo, algo como...

res$correct_xy <- correct_xy 
res$rem_fossil_records <- rem_fossil_records 

Digo isso para que toda a base de dados seja sempre analisada em cada um dos filtros.

  1. Remover registros com coordenadas inválidas:
valid_coords <- 
data_01 %>% cc_val(
  .,
  lon = "longitude",
  lat = "latitude",
  verbose = T,
  value = "flagged"
)
  1. Remover registros sem lat ou sem long

  2. Registros em nome ou com nome vazio

  3. Registros fora do país selecionado

Talvez seja melhor criar uma função para o script abaixo para que a pessoa consiga informar x, y, nome do país e a distância do buffer

m <- rworldmap::getMap() # rworldmap
brazil <- m[which(m$NAME == "Brazil"), ]
brazil@data <- brazil@data %>% dplyr::select(ADMIN)
brazil@data$fill <- 1`

OCC <- data_01 %>% dplyr::select(longitude, latitude)
selec_points <- raster::extract(brazil, OCC, buffer = 2000)
data_01$.inBrazil <- selec_points$fill
  1. Remover registros fósseis ou com origem duvidosa
data_01 <- data_01 %>%
  mutate(.provenance = if_else(basisOfRecord %in% 
                                 c("Amost", "DrawingOrPhotograph" ,
                                   "Dupli", "EX" , "E", "Extra",
                                   "F", "FOSSIL_SPECIMEN",   
                                   "FossilSpecimen", "HS", "HUCP",
                                   "MACHINE_OBSERVATION",
                                   "MachineObservation", "MultimediaObject",
                                   "QQQQQ", "REPET", "RON", "V",  
                                   "X", "XS"), FALSE, TRUE)) 
  1. Salvar tabela na de nomes sem coordenadas (x ou y) mas que contém informações sobre localidade (eg. Outputs/Check/01_xy_from_locality.csv)

[ERROR]: Column names defined in the metadata do not match column names in the file

I'm having an issue with bdc_standardize_datasets() function. Of 14 datasets just one isn't reading properly. I get the error mentioned in the title. I've checked that I have the correct file path and compared the two column names lists multiple times but they're a complete match. I don't suppose there's a way for the error message to be slightly more specific?

For reference below are the column names with their corresponding DarwinCore variables. The first row is what I have in the metadata file and the second is the dataset.

datasetName,fileName,bibliographicCitation,Identification,scientificName,fieldNumber,verbatimEventDate,verbatimLocality,verbatimDepth,verbatimLatitude,verbatimLongitude,decimalLatitude,decimalLongitude

Data set,CRYPTIC CLADE,organism,field number,Date Collected ,location identifier,depth (m),Lat reported,Long reported,Dec lat,Dec long

Data set,CRYPTIC CLADE,organism,field number,Date Collected ,location identifier,depth (m),Lat reported,Long reported,Dec lat,Dec long

suggested name not found

Hello. Thank you so much for helping me get the bdc_clean_names function to run. I now have another question. After running it and then merging the names as shown in the Taxonomy vignette I then ran:
query_names <- bdc_query_names_taxadb( sci_name = CaliforniaBeeCoordinates_Clean$names_clean, replace_synonyms = TRUE, # replace synonyms by accepted names? suggest_names = TRUE, # try to found a candidate name for misspelled names? suggestion_distance = 0.9, # distance between the searched and suggested names db = "gbif", # taxonomic database parallel = FALSE, # should parallel processing be used? ncores = 2, # number of cores to be used in the parallelization process export_accepted = FALSE # save names linked to multiple accepted names )
This function also ran perfectly but I had a question about the output. I noticed that after merging the results there were some bee species that had spelling errors but were not fixed. For example, my original data had an entry labelled "Prostomia rubiflorus". The correct spelling is "Protosmia rubifloris" but after harmonizing the names and merging the results the notes column said that a correct spelling was not found even though it can be found in GBIF: https://www.gbif.org/species/1334783
Do you know why this might happen? Thank you.

Improve the runtime of `bdc_query_names_taxadb`

Hey it's me again ;)

We use bdc a lot and atm I'm investigating why our tests take so long. I found that bdc_query_names_taxadb takes 15-16 seconds on my (decent) machine when it yields no result. Successful queries are way faster.

Example:

bdc::bdc_query_names_taxadb(sci_name='Triops granitica', rank_name="Animalia", rank="kingdom")

I didn't drill down into what actually happens there (yet). I guess that an exhaustive search is just way slower and that the code exits early on hit? Is there some way to speed things up in the former case?

I also noticed that there's an unused parameter parallel defaulting to FALSE (I was hoping I could just flick it and reap crazy speed-ups).

It might be a good idea to inject a mock in my tests anyway. But as I'm retrofitting the tests an (obvious) way to speed things up would come in handy.

bdc_clean_names error

I am trying to run the bdc_clean_names() function and I get the following error:

bdc_clean_names(spatialnetdf$genus_species, save_outputs = FALSE)'
Error in utils::download.file(grep(os, urls, value = TRUE), file, mode = "wb") : 
  lengths of 'url' and 'destfile' must match

any idea how to resolve? (R version 4.2.2 and bdc 1.1.2)

bdc_clean_names

Hi folks,
I've been trying to use bdc_clean_names and I am getting the following error message:

head(splist$species_original)
[1] "Erithacus rubecula" "Regulus regulus"    "Turdus viscivorus"  "Parus major"        "Parus cristatus"   
[6] "Dendrocopos major" 
clean_names <- bdc_clean_names(splist$species_original)
The latest gnparser version is v1.6.5
Error in utils::download.file(grep(os, urls, value = TRUE), file, mode = "wb") : 
  lengths of 'url' and 'destfile' must match

I appreciate any guidance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.