globalnamesarchitecture / gn_crossmap Goto Github PK
View Code? Open in Web Editor NEWRuby Gem which crossmaps a list of scientific names to names from a data source in GN Index
License: MIT License
Ruby Gem which crossmaps a list of scientific names to names from a data source in GN Index
License: MIT License
Probably an issue for GNresolver in general, but matching the name "Acanthurus chirurgus" gives a sensible result, with 2 possible names of equal score (0.988):
1,Acanthurus chirurgus,Canonical form exact match,Acanthurus chirurgus,"Acanthurus chirurgus (Bloch, 1787)",Acanthurus chirurgus,species,species,,,0,0.988,26846886
1,Acanthurus chirurgus,Canonical form exact match,Acanthurus chirurgus,"Acanthurus chirurgus (non Bloch, 1787)",Acanthurus chirurgus,species,species,synonym,"Acanthurus bahianus Castelnau, 1855",0,0.988,26831504
whereas matching the same name but with scientificNameAuthorship = "(Linnaeus)" (which shouldn't match either hit), means that priority is given to the name that is a synonym (scores of 0.988 vs 0.75)
1,Acanthurus chirurgus,(Linnaeus),Canonical form exact match,Acanthurus chirurgus (Linnaeus),"Acanthurus chirurgus (non Bloch, 1787)",Acanthurus chirurgus,species,species,synonym,"Acanthurus bahianus Castelnau, 1855",0,0.988,26831504
1,Acanthurus chirurgus,(Linnaeus),Canonical form exact match,Acanthurus chirurgus (Linnaeus),"Acanthurus chirurgus (Bloch, 1787)",Acanthurus chirurgus,species,species,,,0,0.75,26846886
Since the return values from crossmap often include commas, the output fields are often quoted strings with embedded command, making it rather tedious to parse the output without using a proper csv parser. If it were possible to get an tab-separated output, this would presumably be much easier, since non of the return fields are likely to have a tab.
So could we have a switch (or perhaps even the default) to output with tabs as separators. Or perhaps you could simply use the same separator for output as was used in the input, which would reduce the number of switches needed.
On this topic, I don't know if the absence of tabs in the return fields is guaranteed (e.g. no tabs in the matchedName), but perhaps it should be, as I can't see any use for them, and it seems sensible to replace all occurrences of lines of whitespace characters with a single, normal space.
It not suppose to happen if done by computer program. But if header done by human -- it might happen often, as people tend to have space after comma
During ingestion phase we try to infer rank. If rank haven't been created program exists, as it tries to call an array method on nil
Connected to GlobalNamesArchitecture/gnlist-resolver-gui#77
If headers have no expected words -- there is no error explaining how to fix the situation
This is not important , but for simple use, if might be nice to allow the command-line crossmap script to allow stdin as an input files (by convention this is usually done as e.g. -i -
rather than -i filename.csv
). For example, I have a dublin core file with species names as one of the lines, and I was thinking I would pass it through a simple perl script to pick out the names and feed them into crossmap. It's slightly more elegant if I can do this in a single pipeline, rather than having to save to intermediate files.
I only mention it as this is usually allowed by default by most command-line parsing libraries (e.g. argparse in python - I'm not familiar with ruby). So I'm surprised that it doesn't just come "for free" in crossmap.
I will try exponential moving average https://stackoverflow.com/a/936720/23080
alpha = 0.1 # smoothing factor
...
speed = (speed * (1 - alpha)) + (currentSpeed * alpha)
Make it possible to get intermediate status for resolution and data preparation as well.
If ingest has many files to take nothing happens with logs for minutes or hours. User should know that program works, and works as expected.
It would be possible with preparsed workflow, but not with when a scientific name is one name-string.
At the moment, if I run the following input into crossmap -d 1, I get a 'canonical name exact match' rather than an 'exact match', because the author name isn't in braces.
taxonID,scientificNameAuthorship,scientificName
0,"Linnaeus, 1758",Abramis bjoerkna
So I have to do
taxonID,scientificNameAuthorship,scientificName
0,"(Linnaeus, 1758)",Abramis bjoerkna
Can you remove the requirement for the braces in the scientificNameAuthorship field? It would be nice if this could also match against "L., 1758", and maybe even "L. 1758"
web GUI has a functionality that allows to change headers supplied with a file to new headers compatible with gn_crossmap. These modified headers have to be used instead of ones supplied with a file.
Scope of work
Connected to GlobalNamesArchitecture/gnlist-resolver-gui#78
I have been trying to chain crossmap instance together, to look up synonyms using CoL, then look up the returned (accepted) name from another provider (e.g. EoL / OpenTree / whatever). I do this using the acceptedName from the first call as the scientificName for the second.
However, this requires changing the column names between instances, so that acceptedName
of one instance is treated as the scientificName
of another (and also to change the output names so that the input fields in the first command aren't taken as input for the second command).
At the moment I'm switching the names using a simple perl regexp substitution (and blatting the scientificName fields in the initial input):
cat names.csv | crossmap -i - -o - | perl -pe 's/scientificName/input/g; s/acceptedName/scientificName/' | crossmap -i - -o -d 179
but perhaps if this is a use case that might see more general usage, it might be worth having a switch which allows this sort of relabelling automatically?
When people create data in Excel they leave comments, graphs etc in the 'margins' which makes such files invalid from CSV standpoint. We still need to deal with such files, if they have information we require
When users get back result from crossmapping and they have all the fields they had originally - it will help them to work with new information as they can see it in the context of the fields of their file, and they can sort, exclude/include etc using these fields
When used as a library -- collect and show the following results
for example
echo -e "taxonID,scientificName\n1,Acanthurus chirurgus" | crossmap -i - -o - -d 1
gives
1,Acanthurus chirurgus,Canonical form exact match,Acanthurus chirurgus,"Acanthurus chirurgus (Bloch, 1787)",Acanthurus chirurgus,species,species,,,0,0.988,26846886
1,Acanthurus chirurgus,Canonical form exact match,Acanthurus chirurgus,"Acanthurus chirurgus (non Bloch, 1787)",Acanthurus chirurgus,species,species,synonym,"Acanthurus bahianus Castelnau, 1855
So the first line is missing an acceptedName
field. In this case I would like to default to the first line, since it is not a synonym. But I can't simply do that by filtering out lines with 'synonym' and looking for the "acceptedName" field, which I would like to do.
Tab creates much less artificial "columns" when people paste list consisting out of just names-trings and nothing else
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.