globalnamesarchitecture / gnindex Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 919 KB

License: MIT License

Scala 84.34% Thrift 2.85% Ruby 8.37% Shell 0.95% Dockerfile 3.49%

gnindex's People

Contributors

Watchers

gnindex's Issues

As a User I need `nameFilter` to work independent of capitalisation

As a User I need speed comparison between `advancedResolve` and `simpleResolve`

As a Developer I need to find out why filters are so slow

As a User I want to distinguish name strings that were parsed well, from name strings that had parsing problems

Add a column for parsing quality and in score include -1 for parsing quality 2 and -2 for parsing quality 3

As a User I'd like to be ensure that quality of new API is not worse that of the old one

As a User I'd like to work with most fresh database

Update gnindex database with recent changes in MySQL.

As a User I'd like to get acceptedName same as returned one when it is not synonym

As a Developer I need stable testing database

As a User I'd like to understand details behind calculated score

As a User I want api output fields to be congruent with output of resolver.globalnames.org

We need all output fields to cover semantically the fields in resolver.globalnames.org

"Diloma arida" should have a partial match in CoL 2017

Currently it returns no match

As a User I want to be able to query data sources

We supply data source ids in our queries. For users to be able to use them effectively they need to know
which data source IDs to choose. For that they need to be able to get a list of all data sources in one query
and also supply data source IDs and get back description of the data sources.

As a User I'd like to figure out the difference in result between `gnindex` and `resolver`

As a Developer I need Spark Cluster

To compute data for #16 single machine is not enough anymore.

As a User I want fuzzy match to work for all relevant names with edit distance 1

The following names are not recognized by fuzzy match currently

d1f122ca-f72d-5c73-b134-051c9ce54a3c: [{:verbatim=>"Aphrodina", :input=>"Aphrodina", :match=>"Aphrosina", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

a5db531c-f5df-5e76-8c4c-1ad04dd86fbf: [{:verbatim=>"Asteridae", :input=>"Asteridae", :match=>"Asteriidae", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

b25e9447-8720-5157-a71b-6a754efdf13e: [{:verbatim=>"Caricella", :input=>"Caricella", :match=>"Carinella", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

6c7aac26-f4a7-517b-a7ec-267012587d21: [{:verbatim=>"Chriocentrus dorab", :input=>"Chriocentrus dorab", :match=>"Chirocentrus dorab (non Swainson, 1775)", :ed=>"1", :accepted=>"Chirocentrus nudus Swainson, 1839", :score=>"0.75"}]

e0a2e08f-35a9-55ce-8173-51a922a9b6fb: [{:verbatim=>"Cyphoderis", :input=>"Cyphoderis", :match=>"Cyphomeris", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

b8eeab4c-29ed-51e0-a4f1-863b2eb45696: [{:verbatim=>"Paranomia", :input=>"Paranomia", :match=>"Paranosia", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

7dfa5ed8-122d-5873-8c7d-558ecacc57e5: [{:verbatim=>"Pleurotoma", :input=>"Pleurotoma", :match=>"Pleurostoma", :ed=>"1", :accepted=>nil, :score=>"0.5"}]

b2f04a6e-1490-56fa-9cba-f64358c70e4b: [{:verbatim=>"Rhinella castaenotica", :input=>"Rhinella castaenotica", :match=>"Rhinella castaneotica (Caldwell, 1991)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]

23f85c50-f1d6-5928-9701-cfd0d2a1e2ce: [{:verbatim=>"Solenosteria fusiformis", :input=>"Solenosteria fusiformis", :match=>"Solenosteira fusiformis (Blainville, 1832)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]

e05d7c46-605b-55f5-9f64-5afd46e308a6: [{:verbatim=>"Tripnuestes gratilla", :input=>"Tripnuestes gratilla", :match=>"Tripneustes gratilla (Linnaeus, 1758)", :ed=>"1", :accepted=>nil, :score=>"0.75"}]

As a User I'd like faster access time: for that optimise access time to database

Related to #5

2527ms for 540 recs is too slow. Explore Explain for PostgreSQL. Try to assemble canonical names table.

As a User I want to understand what this project is about and how to use it by looking at README

It is going to be released soon, so we need to start working on the README file

As a User I'd like to have precomputed ranked canonical names

As a User I want by default to have only 3 states, exact match, fuzzy match and no match

In this case we combine our results as following:

match: exact and exact canonical (ExactNameMatchByUUID, ExactCanonicalNameMatchByUUID -> Match)
fuzzy full fuzzy match (FuzzyCanonicalMatch -> FuzzyMatch)
no match: partial match, partial fuzzy match, genus match, no match, errors (FuzzyPartialMatch, ExactMatchPartialByGenus, ExactPartialMatch -> NoMatch)

As a User I want to get fuzzy matches when difference is in stems.

The following cases should work with stems

938d3659-1e85-5dee-92d4-bb763fbdf8bb: [{:verbatim=>"Anticorbula sinuosum", :input=>"Anticorbula sinuosum", :match=>"Anticorbula sinuosa (Morrison, 1943)", :ed=>"2", :accepted=>nil, :score=>"0.75"}]

351fa4ea-69e8-528f-981d-b15d81867dd3: [{:verbatim=>"Bos grunniensis", :input=>"Bos grunniensis", :match=>"Bos grunniens Linnaeus, 1766", :ed=>"2", :accepted=>nil, :score=>"0.75"}]

5a474555-187f-597b-b979-e863f31ec6b2: [{:verbatim=>"Brachycera", :input=>"Brachycera", :match=>"Brachycerus", :ed=>"2", :accepted=>nil, :score=>"0.5"}]

767ed8b1-d474-572b-a35d-5687d91ac546: [{:verbatim=>"Cranopsis coniferus", :input=>"Cranopsis coniferus", :match=>"Cranopsis conifera (Cope, 1862)", :ed=>"2", :accepted=>"Incilius coniferus (Cope, 1862)", :score=>"0.75"}]

Rename `bestMatch` to `bestMatchOnly`

Find out why empty results are returned when they should not

Acaulon mediterraneum,1
Acroporium suzukii,2

As a User I want an option to get only one "best" match result for my query

Very often it is enough for people to know that something did get matched, and they need to get the 'best' result only. We need an option in the query to set it such an option.

As a sysadmin I'd like to set up GraphQL web interface on the public server

check licence of Sangria GraphiQL

As a User I'd like faster access time: for that limit fuzzy matched data from Matcher

As a Developer I need a tool for performance benchmark

We need to track that performance doesn't degrade when new changes are applied.

As a User I'd like to know parsing error details in my query

As a User I'd like faster access time: for that optimise Thrift data from Matcher

As a User I'd like to have fuzzy match to uninomials

If uninomial name is provided, then we should do fuzzy match (even if it's a genus). Edit distance should be 1 or less.

As a Developer I need to cover every known feature with tests

As a User I'd like faster access time: for that optimise creation of inner structures from Slick response (even exact match is too slow)

Related to #5

As a User I do not want to get fuzzy matches for short words

Add heuristic rules for matches: No fuzzy match for uninomials shorter than 6 letters, no more than 1 ed per 5 letters per word

As a User I want to be able to resolver lowcase and UPPERCASE genera and uninomials

Currently we capitalize first word. Some databases have genus capitalized all the way. So a better approach is

Convert first world into "first letter capitalized, all others lowercase"
Add score penalty (-1) if normalized first word is different from original.

Example:

homo sapiens -> Homo sapiens
HOMO sapiens -> Homo sapiens
Homo sapiens -> Homo sapiens

As a User I'd like to be sure that new DB is consistent

As a Developer I need to profile entire application to figure out execution time of different code parts

As a Developer I'd like to launch whole project with docker compose

As a developer I want to have database schema dump in sql format

Some features of the database are not compatible with ruby format of ActiveRecord. So we need to keep schema in sql to get all functionality. To create ./db_migrations/db/structure.sql file you need to run rake db:migration

Macrobiotus harmsworthi subsp. obscurus Dastych, 1985 should not be a synonym for CoL

Macrobiotus harmsworthi subsp. obscurus Dastych, 1985 shows as a synonym in CoL 2017, but it is not

As a User I'd like faster access time: for that check UUIDs existence in RAM instead of DB request

As a User I want the score of results depend on a 'quality' of a data source

This is a tough one, as quality of data sources are not certain. Until we have a good algorithm for figuring out their quality we can simply assign 'subjective' quality value to each of them. May be we need some
kind of a structure for that with fields like

isCurated: if a database has a significant manual quality control.
isAutoCurated: if a database has reasonably sophisticated algorithms.

We probably need another score for this that we should use when we sort results.

DB migration is required. Column name: data_quality, record_count.

Quality order: isCurated > isAutoCurated.

If two names have same score for 2 curated databases, then by record_count (least is better), and then return with least DB ID.

preferredResults
- name { id value }
- localId
- url
bestMatchOnly

Partial fuzzy match happens on a Genus level

when I match Adarys robusulus to Catalogue of Life it returns

Partial canonical form fuzzy match as Abaris with EditDistance 1

Instead it should not try to do fuzzy match on genus, and try to do genus match only.

Compute canonical and canonicalUuid in gnparser once per parsed result (val instead of def)
Transfer UUIDs via Thrift as bits instead of Strings
Statically compile Slick queries
Deal with partial canonicals as a whole string instead of string chunks

globalnamesarchitecture / gnindex Goto Github PK

gnindex's People

Contributors

Watchers

gnindex's Issues

Recommend Projects

Recommend Topics

Recommend Org