Giter VIP home page Giter VIP logo

gnindex's People

Contributors

alexander-myltsev avatar dimus avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gnindex's Issues

As a User I want to be able to query data sources

We supply data source ids in our queries. For users to be able to use them effectively they need to know
which data source IDs to choose. For that they need to be able to get a list of all data sources in one query
and also supply data source IDs and get back description of the data sources.

As a User I want fuzzy match to work for all relevant names with edit distance 1

The following names are not recognized by fuzzy match currently

d1f122ca-f72d-5c73-b134-051c9ce54a3c: [{:verbatim=>"Aphrodina", :input=>"Aphrodina", :match=>"Aphrosina", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

a5db531c-f5df-5e76-8c4c-1ad04dd86fbf: [{:verbatim=>"Asteridae", :input=>"Asteridae", :match=>"Asteriidae", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

b25e9447-8720-5157-a71b-6a754efdf13e: [{:verbatim=>"Caricella", :input=>"Caricella", :match=>"Carinella", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

6c7aac26-f4a7-517b-a7ec-267012587d21: [{:verbatim=>"Chriocentrus dorab", :input=>"Chriocentrus dorab", :match=>"Chirocentrus dorab (non Swainson, 1775)", :ed=>"1", :accepted=>"Chirocentrus nudus Swainson, 1839", :score=>"0.75"}]

e0a2e08f-35a9-55ce-8173-51a922a9b6fb: [{:verbatim=>"Cyphoderis", :input=>"Cyphoderis", :match=>"Cyphomeris", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

b8eeab4c-29ed-51e0-a4f1-863b2eb45696: [{:verbatim=>"Paranomia", :input=>"Paranomia", :match=>"Paranosia", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

7dfa5ed8-122d-5873-8c7d-558ecacc57e5: [{:verbatim=>"Pleurotoma", :input=>"Pleurotoma", :match=>"Pleurostoma", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

b2f04a6e-1490-56fa-9cba-f64358c70e4b: [{:verbatim=>"Rhinella castaenotica", :input=>"Rhinella castaenotica", :match=>"Rhinella castaneotica (Caldwell, 1991)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]

23f85c50-f1d6-5928-9701-cfd0d2a1e2ce: [{:verbatim=>"Solenosteria fusiformis", :input=>"Solenosteria fusiformis", :match=>"Solenosteira fusiformis (Blainville, 1832)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]

e05d7c46-605b-55f5-9f64-5afd46e308a6: [{:verbatim=>"Tripnuestes gratilla", :input=>"Tripnuestes gratilla", :match=>"Tripneustes gratilla (Linnaeus, 1758)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]

As a User I want by default to have only 3 states, exact match, fuzzy match and no match

In this case we combine our results as following:

match: exact and exact canonical (ExactNameMatchByUUID, ExactCanonicalNameMatchByUUID -> Match)
fuzzy full fuzzy match (FuzzyCanonicalMatch -> FuzzyMatch)
no match: partial match, partial fuzzy match, genus match, no match, errors (FuzzyPartialMatch, ExactMatchPartialByGenus, ExactPartialMatch -> NoMatch)

As a User I want to get fuzzy matches when difference is in stems.

The following cases should work with stems

938d3659-1e85-5dee-92d4-bb763fbdf8bb: [{:verbatim=>"Anticorbula sinuosum", :input=>"Anticorbula sinuosum", :match=>"Anticorbula sinuosa (Morrison, 1943)", :ed=>"2", :accepted=>nil, :score=>"0.75"}]

351fa4ea-69e8-528f-981d-b15d81867dd3: [{:verbatim=>"Bos grunniensis", :input=>"Bos grunniensis", :match=>"Bos grunniens Linnaeus, 1766", :ed=>"2", :accepted=>nil, :score=>"0.75"}]

5a474555-187f-597b-b979-e863f31ec6b2: [{:verbatim=>"Brachycera", :input=>"Brachycera", :match=>"Brachycerus", :ed=>"2", :accepted=>nil, :score=>"0.5"}]

767ed8b1-d474-572b-a35d-5687d91ac546: [{:verbatim=>"Cranopsis coniferus", :input=>"Cranopsis coniferus", :match=>"Cranopsis conifera (Cope, 1862)", :ed=>"2", :accepted=>"Incilius coniferus (Cope, 1862)", :score=>"0.75"}]

As a User I want to be able to resolver lowcase and UPPERCASE genera and uninomials

Currently we capitalize first word. Some databases have genus capitalized all the way. So a better approach is

  • Convert first world into "first letter capitalized, all others lowercase"

  • Add score penalty (-1) if normalized first word is different from original.

Example:

homo sapiens -> Homo sapiens
HOMO sapiens -> Homo sapiens
Homo sapiens -> Homo sapiens

As a User I want the score of results depend on a 'quality' of a data source

This is a tough one, as quality of data sources are not certain. Until we have a good algorithm for figuring out their quality we can simply assign 'subjective' quality value to each of them. May be we need some
kind of a structure for that with fields like

isCurated: if a database has a significant manual quality control.
isAutoCurated: if a database has reasonably sophisticated algorithms.

We probably need another score for this that we should use when we sort results.

DB migration is required. Column name: data_quality, record_count.

Quality order: isCurated > isAutoCurated.

If two names have same score for 2 curated databases, then by record_count (least is better), and then return with least DB ID.

Partial fuzzy match happens on a Genus level

when I match Adarys robusulus to Catalogue of Life it returns

Partial canonical form fuzzy match as Abaris with EditDistance 1

Instead it should not try to do fuzzy match on genus, and try to do genus match only.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.