Comments (9)
"I'm planning on writing code to correct the author string"
You mean do a lookup to check whether the string is Aus bus (M., 1870) or Aus bus M., 1870, then correct the string? FYI, I've just audited a dataset which had the following malformed author strings:
1844
Cr.)
Dahlborn)
([Den. & Sch.])
Den. & Schiff.)
[Den. & Schiff.]
Dewitz)
Gr. & Rob.)
(Grote
((Herr.-Sch.)
Hew.)
)Holland)
Klug)
micr(Chaudoir)
Olliff)
([Schiff.])
tetra(Gray)
)Wlkr.)
from gnparser.
I'm fixing the malformed author strings before submitting them to gnparser.
I'm harvesting this dataset with a web crawler, so there's additional HTML formatting that I can use to accurately isolate the author string. I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis.
Thanks for the heads up. So far the web crawler hasn't hit any brackets or nested parentheses yet for this dataset, which would definitely be more challenging to correct automatically. I'm also logging a warning and will manually review that the author strings were corrected properly.
from gnparser.
Ta.
"I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis."
...or removing the single parenthesis, if the correct form (say, in GBIF's backbone) is no parentheses?
from gnparser.
I found 2 other issues related to author string parsing:
- If the author name includes an apostrophe (e.g., O’Donnell) some software editors replace the apostrophe with a curly apostrophe, which breaks author parsing:
With curly apostrophe:
https://parser.globalnames.org/?q=Ambaeolothrips+pampeanus+Mound%2C+Cavalleri%2C+O%E2%80%99Donnell%2C+Infante%2C+Ortiz+%26+Goldarazena%2C+2016
With regular apostrophe:
https://parser.globalnames.org/?q=Ambaeolothrips+pampeanus+Mound%2C+Cavalleri%2C+O%27Donnell%2C+Infante%2C+Ortiz+%26+Goldarazena%2C+2016
- Author first name initials that include hyphens break authorship parsing:
Xie Y-H, Yuan S-Y, Li Z-Y & Zhang H-R, 2013
Hyphenated:
https://parser.globalnames.org/?q=Ctenothrips+yangi+Xie+Y-H%2C+Yuan+S-Y%2C+Li+Z-Y+%26+Zhang+H-R%2C+2013
Hyphens removed:
https://parser.globalnames.org/?q=Ctenothrips+yangi+Xie+Y+H%2C+Yuan+S+Y%2C+Li+Z+Y+%26+Zhang+H+R%2C+2013
Removing the hyphens likely is not the proper way of formatting these author strings--I just removed the hyphens to show that the parser isn't handling hyphenated first names correctly.
from gnparser.
- Missing open parenthesis: I would say opens a can of worms that I am afraid to deal with.
- Curvy apostrophe sound like a safe addition, +1 for adding it to pre-processing stage.
- Hythen without a period is something I haven't meet before, do you have many names like this @gdower ?
from gnparser.
Ta.
"I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis."
...or removing the single parenthesis, if the correct form (say, in GBIF's backbone) is no parentheses?
Authors in parentheses are original ones, and I guess if open parenthesis is missing, it is safe to assume that everything up to the start of the authorship is original authors. However missing closed parenthesis is more dangerous to assume.
from gnparser.
@dimus, of the author strings in CoL with hyphenated author given name initials, only around 2% don't include periods in the initials. Variants include:
Last A-B.
Last A-B
Last A.-B
A-B. Last
A-B Last
A.-B Last
I notified the data provider of the parentheses typos, and he corrected them.
Thanks for your updates!
from gnparser.
I added missing parenthesis cases to Go parser https://gitlab.com/gogna/gnparser/issues/40
it is part of v0.7.1
from gnparser.
Closing it here, issues are sesolved in https://gitlab.com/gogna/gnparser/issues/28 and https://gitlab.com/gogna/gnparser/issues/40
from gnparser.
Related Issues (20)
- Correctly parse `Fusinus eucos�nius`
- Correctly parse `Velutina haliotoides (Linnaeus, 1758), <i>sensu</i> Fabricius, 1780` HOT 1
- Correctly parse `Fusinus clavilithoides Landau, Harzhauser, Büyükmeriç & Breitenberger, 20` HOT 1
- Fix the bug with positioning
- Check if webserver and tcpserver returns valid JSON
- Introduce `json_pretty` format flag
- `file` command options are not passed to default launch
- Parser breaks on "</i>Hipponicidae<i>_incertae_sedis</i>" HOT 2
- Parser breaks on `Quadrella steyermarkii (Standl.) Iltis & Cornejo`
- Add logging of parsing time
- Limit the size of input stream for CLI
- "Aeolothrips andalusiacus zur Strassen 1973" is not parsed correctly
- As a User I want to use italic markup as a help with parsing HOT 1
- As a User I want partial match finds also contain results with genus and the lowest infraspecies. HOT 1
- 'Nototriton matama Boza-Oviedo, Rovito, Chaves, García-Rodríguez, Artavia, Bolaños, and Wake, 2012' has 'and' in canonical
- how do you differentiate author strings? HOT 16
- Documentation for canonicalName HOT 2
- Submit new issues to https://github.com/gnames/gnparser/issues
- Question. Can this identify scientific names (latin or greek words in scientific litrature)? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gnparser.