globalnamesarchitecture / gnparser Goto Github PK
View Code? Open in Web Editor NEWSplit scientific names to meaningful elements with meta information
Home Page: https://parser.globalnames.org/
License: MIT License
Split scientific names to meaningful elements with meta information
Home Page: https://parser.globalnames.org/
License: MIT License
Paddy suggested to have 2 version of canonical -- one which is identical to canonical we have, and another one that contains everything from 'normalized' version except authors. Canonical_complete practically speaking is the same as canonical_with ranks plus showing subgenus.
Rana aurora Baird and Girard, 1852; H.B. Shaffer et al., 2004
is currently parsed as uninomial Rana
It has to parse up to Rana aurora Baird and Girard, 1852
Aus bus var.|f.|ssp.|etc.
names need to be moved from preprocessing to AST to get better warnings
Some entries from test_data.txt
are commented out currently as we do not have working rules for them. This issue is finished when all such tests are passing -- they do not have to be complete and have all fields.
Duplicate of #164
Parser_run is a legacy field in the output which is not needed anymore
Depends on PR #24
It suppose to pick names like
Acontias lineatus WAGLER 1830: 196
This PR requires subtasks to be completed:
unescapeHtml
that would map char positions from unescaped string to escapedpart of #41
Latrodectus 13-guttatus Thorell, 1875
is not allowed anymore it has to be changed to a spelled out 13 in this case tridecimguttatus
. I suspect that spelling of a number will change depending on the rest of epithet. And I am afraid it means our best shot is to build a dictioniary that will convert number version to spelled out version for all such epithets. (Probably 200-300 of these)
Quality indicates if pased name is well formed, or is having problems. There are 4 levels of quality:
The last fith
level is parsed=false
where no canonical form is generated and quality field is not existant in the output
All names with approximations and comparisons in test from section #species and infraspecies without epithets, comparisons<
suppose to be marked as surrogates
We assign f.
as normalized token for both filius
and forma
. In most cases it is fine, however if we had a name like Aus bus L. forma bus
it is obvious that in the context we deal with rank forma
. When we convert it to normalized version of Aus bus L. f. bus
the semantic meaning of f. is becoming ambiguous and might mean both filius
and forma
. In current implementation if we parse normalized version again we are getting "conversion" of forma
rank into son of
postfix belonging to the author.
Caulerpa cupressoides forma nuda
canonical should not have forma (line 245 in test). Same is true for name from the same test group.
Leave authors array only
We have rules for that, but they do not work right now
Some names from #ignoring sensu sec
are 'garbage collected' instead of preprocessed
s.l.
, s.s.
etc.
Some biologists, especially botanists, pay close attention to infraspecies ranks, and they want to have ranks to be shown in canonical form. Others don't care.
Example: Aus bus 1889
We break our tests data into groups and names of the groups are given in comments preceding them. We need to tidy them up a bit
Coleoptera Bold:AAV0432
should not be normalized to Coleoptera Bold
":" is considered as end of the word with sortSpace?
Currently we have a problem with Warnings collection because a mutable collection is shared between all rules and all pb2 parser phases, and has no rollback mechanism.
Arthopyrenia hyalospora x
Basionym authorship has two alternative spellings in case if there is no combination authorship -- basionym authors can be surrounded by parentheses or not. Currently we do not save in AST how name had been represented in the verbatim name, although we do have rules that separate this information. We need to add a field to Authorship that reflects the difference and when we show the name in normalized form to use the same parenthesis (exestance or absence) as in the original.
If verbatim was Homo sapiens Linneaus 1758
normalized should be Homo sapiens Linnaeus 1758
If verbatim was Homo sapiens (Linnaeus 1758)
normalized should be Homo sapiens (Linnaeus 1758)
Description should classify both cases as basionym authorship (may be with 'explicit/implicit' flag?)
The problem with current code is that AST nodes could be applied to any string input.
There are ways to try to solve this:
Currently Aus x bus
normalizes to Aus x Aus bus
which is not correct. We need to change rules to normalize to Aus x bus
Names like Aus x bus
repeat genus info in details
currently parser runs from the command line with java -jar name_of_parser.jar...
we need wrapper scripts to simplify that
GNParser uses org.apache.commons.id.uuid.UUID.nameUUIDFromString
from littleshoot-commons-id library. And it has a bug: it uses default encoding when gets bytes. If default encoding is UTF8 (that is on *nix) then all is fine. On Win it might be Cp1251 that breaks everything.
Workaround is to launch sbt with opts: JAVA_OPTS="-Dfile.encoding=UTF8" sbt
Abryna- regis
Abryna regis- Paiva, 1860
Abryna -petri Paiva, 1860
should be left unparsed
Depends on #30
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.