globalnamesarchitecture / gnindex Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Add a column for parsing quality and in score include -1 for parsing quality 2 and -2 for parsing quality 3
Update gnindex
database with recent changes in MySQL
.
We need all output fields to cover semantically the fields in resolver.globalnames.org
Currently it returns no match
We supply data source ids in our queries. For users to be able to use them effectively they need to know
which data source IDs to choose. For that they need to be able to get a list of all data sources in one query
and also supply data source IDs and get back description of the data sources.
To compute data for #16 single machine is not enough anymore.
The following names are not recognized by fuzzy match currently
d1f122ca-f72d-5c73-b134-051c9ce54a3c: [{:verbatim=>"Aphrodina", :input=>"Aphrodina", :match=>"Aphrosina", :ed=>"1", :accepted=>nil, :score=>"0.5"}]
a5db531c-f5df-5e76-8c4c-1ad04dd86fbf: [{:verbatim=>"Asteridae", :input=>"Asteridae", :match=>"Asteriidae", :ed=>"1", :accepted=>nil, :score=>"0.5"}]
b25e9447-8720-5157-a71b-6a754efdf13e: [{:verbatim=>"Caricella", :input=>"Caricella", :match=>"Carinella", :ed=>"1", :accepted=>nil, :score=>"0.5"}]
6c7aac26-f4a7-517b-a7ec-267012587d21: [{:verbatim=>"Chriocentrus dorab", :input=>"Chriocentrus dorab", :match=>"Chirocentrus dorab (non Swainson, 1775)", :ed=>"1", :accepted=>"Chirocentrus nudus Swainson, 1839", :score=>"0.75"}]
e0a2e08f-35a9-55ce-8173-51a922a9b6fb: [{:verbatim=>"Cyphoderis", :input=>"Cyphoderis", :match=>"Cyphomeris", :ed=>"1", :accepted=>nil, :score=>"0.5"}]
b8eeab4c-29ed-51e0-a4f1-863b2eb45696: [{:verbatim=>"Paranomia", :input=>"Paranomia", :match=>"Paranosia", :ed=>"1", :accepted=>nil, :score=>"0.5"}]
7dfa5ed8-122d-5873-8c7d-558ecacc57e5: [{:verbatim=>"Pleurotoma", :input=>"Pleurotoma", :match=>"Pleurostoma", :ed=>"1", :accepted=>nil, :score=>"0.5"}]
b2f04a6e-1490-56fa-9cba-f64358c70e4b: [{:verbatim=>"Rhinella castaenotica", :input=>"Rhinella castaenotica", :match=>"Rhinella castaneotica (Caldwell, 1991)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]
23f85c50-f1d6-5928-9701-cfd0d2a1e2ce: [{:verbatim=>"Solenosteria fusiformis", :input=>"Solenosteria fusiformis", :match=>"Solenosteira fusiformis (Blainville, 1832)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]
e05d7c46-605b-55f5-9f64-5afd46e308a6: [{:verbatim=>"Tripnuestes gratilla", :input=>"Tripnuestes gratilla", :match=>"Tripneustes gratilla (Linnaeus, 1758)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]
Related to #5
2527ms for 540 recs is too slow. Explore Explain for PostgreSQL. Try to assemble canonical names table.
It is going to be released soon, so we need to start working on the README file
In this case we combine our results as following:
match: exact and exact canonical (ExactNameMatchByUUID, ExactCanonicalNameMatchByUUID -> Match
)
fuzzy full fuzzy match (FuzzyCanonicalMatch -> FuzzyMatch
)
no match: partial match, partial fuzzy match, genus match, no match, errors (FuzzyPartialMatch, ExactMatchPartialByGenus, ExactPartialMatch -> NoMatch
)
The following cases should work with stems
938d3659-1e85-5dee-92d4-bb763fbdf8bb: [{:verbatim=>"Anticorbula sinuosum", :input=>"Anticorbula sinuosum", :match=>"Anticorbula sinuosa (Morrison, 1943)", :ed=>"2", :accepted=>nil, :score=>"0.75"}]
351fa4ea-69e8-528f-981d-b15d81867dd3: [{:verbatim=>"Bos grunniensis", :input=>"Bos grunniensis", :match=>"Bos grunniens Linnaeus, 1766", :ed=>"2", :accepted=>nil, :score=>"0.75"}]
5a474555-187f-597b-b979-e863f31ec6b2: [{:verbatim=>"Brachycera", :input=>"Brachycera", :match=>"Brachycerus", :ed=>"2", :accepted=>nil, :score=>"0.5"}]
767ed8b1-d474-572b-a35d-5687d91ac546: [{:verbatim=>"Cranopsis coniferus", :input=>"Cranopsis coniferus", :match=>"Cranopsis conifera (Cope, 1862)", :ed=>"2", :accepted=>"Incilius coniferus (Cope, 1862)", :score=>"0.75"}]
Acaulon mediterraneum,1
Acroporium suzukii,2
Very often it is enough for people to know that something did get matched, and they need to get the 'best' result only. We need an option in the query to set it such an option.
We need to track that performance doesn't degrade when new changes are applied.
If uninomial name is provided, then we should do fuzzy match (even if it's a genus). Edit distance should be 1 or less.
Related to #5
Add heuristic rules for matches: No fuzzy match for uninomials shorter than 6 letters, no more than 1 ed per 5 letters per word
Currently we capitalize first word. Some databases have genus capitalized all the way. So a better approach is
Convert first world into "first letter capitalized, all others lowercase"
Add score penalty (-1) if normalized first word is different from original.
Example:
homo sapiens -> Homo sapiens
HOMO sapiens -> Homo sapiens
Homo sapiens -> Homo sapiens
Some features of the database are not compatible with ruby format of ActiveRecord. So we need to keep schema in sql to get all functionality. To create ./db_migrations/db/structure.sql file you need to run rake db:migration
Macrobiotus harmsworthi subsp. obscurus Dastych, 1985
shows as a synonym in CoL 2017, but it is not
This is a tough one, as quality of data sources are not certain. Until we have a good algorithm for figuring out their quality we can simply assign 'subjective' quality value to each of them. May be we need some
kind of a structure for that with fields like
isCurated: if a database has a significant manual quality control.
isAutoCurated: if a database has reasonably sophisticated algorithms.
We probably need another score for this that we should use when we sort results.
DB migration is required. Column name: data_quality
, record_count
.
Quality order: isCurated
> isAutoCurated
.
If two names have same score for 2 curated databases, then by record_count
(least is better), and then return with least DB ID.
parent: #3
Related to #5
Add fields:
preferredResults
bestMatchOnly
when I match Adarys robusulus
to Catalogue of Life
it returns
Partial canonical form fuzzy match
as Abaris
with EditDistance 1
Instead it should not try to do fuzzy match on genus, and try to do genus match only.
val
instead of def
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.