globalnamesarchitecture / dwca_hunter Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 5.0 79.1 MB

Downloads biodiversity resources from internet and converts them to DarwinCore Archive files

License: MIT License

Ruby 100.00%

dwca_hunter's People

Contributors

Stargazers

Watchers

Forkers

pdevries gaurav mdoering gdower gerverska

dwca_hunter's Issues

As a User I want to have Mammals of the World data to be more clean

There are html entities instead of characters, long English annotations instead of synonyms

(itis) Database artifacts should be ignored

Database artifacts (any taxon name with the word 'artifact' in the unacceptability_reason field) are errors in ITIS, and I think they should be ignored when creating the DwC-A file.

suggest to index Plazi's treatment bank

hi @dimus et al.

Are you still maintaining globalnames.org ?

If so, I was hoping you can consider the following:

Plazi https://plazi.org keeps an extensive list of taxonomic literature and associated taxonomic names.

Plazi exports these taxonomic literature <> name links as DwC-A and register them with GBIF.

You can find their publications at https://www.gbif.org/occurrence/search?dataset_key=6384b520-7e9f-4874-a414-76c2e9b01d74&type_status=TYPE .

As a user (GloBI), I would like to be able use the Global Names resolvers to find taxonomic treatments in Plazi.

The taxonomic treatments can be located by linking the TaxonId fields that are available in the Plazi publications.

for example, when I lookup: Rhinolophus denti , I expect to find a Plazi name with id 885887A2FFC88A21F8B1FA48FB92DD65.taxon (also see https://www.gbif.org/occurrence/2597533915). This identifier can then be translated into a link to the related taxonomic treatment via http://treatment.plazi.org/id/885887A2FFC88A21F8B1FA48FB92DD65 .

Import IOC World Bird List

@ccicero, @dustymc, I started new imports, and will notify you for every ticket, this way it will be easier for you to see their status.
IOC World Bird List - https://www.worldbirdnames.org/ioc-lists/master-list-2/

Add MycoBank data-source

Requested at gnames/gnverifier#99

https://www.mycobank.org/images/MBList.zip

Import AOS checklist

FYI @dustymc, @ccicero
AOS Checklist - http://checklist.americanornithology.org/taxa/

Update EOL

Import Clements Checklist

FYI @dustymc @ccicero
Clements Checklist - https://www.birds.cornell.edu/clementschecklist/download/

import Arctos from 2018

convert IPNI names data to DWCA

As a User I want to see the ASM Mammalian Diversity Database appear as a data source

As a User I want to see the ASM Mammalian Diversity Database appear as a data source in https://resolver.globalnames.org and related services .

It appears that the https://mammaldiversity.org resource is being used among mammal researchers.

I was able to download attached dump mammal.json.gz using curl 'https://mammaldiversity.org/species-account/api.php?q=*' -H 'User-Agent:' | gzip > mammal.json.gz .

Note that the User-Agent was somehow needed.

See globalbioticinteractions/globalbioticinteractions#446 .

FYI @tigerhawkvok @n8upham

mammal.json.gz

Update Arctos

There is a new version of Arctos available, we need to harvest it and give @dustymc feedback. See
ArctosDB/arctos#3205

(itis) A single TSN may have more than one accepted_tsn

In the synonym_links file, a single TSN might be linked to multiple TSNs (see TSN 103337 for an example). It's not clear what the best DwC-A representation of this might be.

taxon ids of unknown provenance in Mammal Species of the World v3

When inspecting the json results for matches of Enhydra lutris against Mammal Species of the World (data source id 174), taxon_id 28576 and internal_id 14001090 is found (see below). However, it appears that the internal_id 14001090 is the identifier that the Mammal Species of the World exposes to link to their taxon pages (e.g., https://www.departments.bucknell.edu/biology/resources/msw3/browse.asp?id=14001090). For some reason, the taxon_id 28576 cannot be found in any of the Mammal Species of the World data products.

This suggests that the (current) internal_id are suitable to be used as taxon ids (incl. in path ids hierarchy).


data_source_id	174
data_source_title	"The Mammal Species of The World"
gni_uuid	"3096feea-1216-5f59-ab70-fcff3492cef6"
name_string	"Enhydra lutris Linnaeus 1758"
canonical_form	"Enhydra lutris"
classification_path	"Mammalia\|Carnivora\|Caniformia\|Mustelidae\|Lutrinae\|Enhydra\|Enhydra lutris"
classification_path_ranks	"class\|order\|suborder\|family\|subfamily\|genus\|species"
classification_path_ids	"1\|25367\|27246\|28538\|28539\|28575\|28576"
taxon_id	"28576"
local_id	"14001090"
edit_distance	0
imported_at	"2018-08-04T20:50:18Z"
match_type	2
match_value	"Exact match by canonical form"
prescore	"3\|0\|0"
score	0.988
29

Update ION

(resource_itis.rb) get_ranks should use both rank_id and kingdom_id

At the moment, get_ranks uses the rank_id to store rank names. However, ITIS uses duplicate rank_ids to identify differently named ranks between kingdoms (e.g. "phylum" has kingdom_id=1, rank_id=30, while "division" has kingdom_id=3, rank_id=30). Using only rank_ids therefore forces everybody onto the ranks defined for kingdom Chromista (kingdom_id=6).

This should be a pretty easy fix: changing every piece of code that uses @Ranks to use both rank_id and kingdom_id when looking up the appropriate term. This should be pretty easy, as @Ranks is only used on three lines.

Update Index Fungorum

Paul Kirk sent a new dump of Index Fungorum, it has to be converted to DWCA and imported.

Import ASM Mammal diversity DB

Import changed, it is now a csv file

Import Myriatrix dataset

@Archilegt wrote:

The data is downloadable in DwC. See the copy in GBIF for reference: http://www.gbif-uat.org/dataset/61e2d02a-34f7-4705-8840-c1ee49dfd951

@Archilegt, do you know if all your names get ingested into GBIF taxonomic backbone? If yes, I do get them already through GBIF darwin core file

As a User I do not want to see data sources that do not work anymore

dwcahunter list should return only resources that do convert

Add Leipzig Catalog of Vascular Plants

from gnames/gnverifier#62 by @abubelinha

Dataset: https://github.com/idiv-biodiversity/LCVP

Article https://doi.org/10.1038/s41597-020-00702-z

Import Howard and Moore Checklist

FYI @ccicero, @dustymc
Howard and Moore Checklist - https://www.howardandmoore.org/howard-and-moore-database/

Update NCBI

import EOL

Update PaleoBioDB

Adding The Parasite Tracker TCN Taxonomy to Global Names

@dimus
The Arctos community would like to explore the possibility of adding ectoparasite taxonomy in use by institutions participating in the Parasite Tracker TCN to Global Names. The TPT is working to assemble taxonomy reference files for major groups of ectoparasites, with the names and classifications in csv format. Please let me know what steps would be involved with integrating these with Global Names.