poisotlab / ncbitaxonomy.jl Goto Github PK
View Code? Open in Web Editor NEWWrapper around the NCBI taxonomy files
Home Page: https://poisotlab.github.io/NCBITaxonomy.jl/stable/
License: MIT License
Wrapper around the NCBI taxonomy files
Home Page: https://poisotlab.github.io/NCBITaxonomy.jl/stable/
License: MIT License
The _materialize_data
seems to be missing a case, as the unique names do not
end up in the names.arrow
file. This is going to be a problem rapidly.
Phages are PHG
, env are ENV
Hi! This is gonna be a long one.
Here are three viruses that all have exact matches in the NCBI taxonomy:
Adeno-associated virus - 3
Adeno-associated virus 3B
Adenovirus predict_adv-20
They're an interesting case study for what's going horribly wrong here. In theory, they should all be retrieved as exact matches. Two are, in fact, the same "species". For example, the same NCBI API call through taxize
:
> classification(get_uid("Adeno-associated virus - 3"), db = "ncbi")
== 1 queries ===============
Retrieving data for taxon 'Adeno-associated virus - 3'
√ Found: Adeno-associated+virus+-+3
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`46350`
name rank id
1 Viruses superkingdom 10239
2 Monodnaviria clade 2731342
3 Shotokuvirae kingdom 2732092
4 Cossaviricota phylum 2732415
5 Quintoviricetes class 2732422
6 Piccovirales order 2732534
7 Parvoviridae family 10780
8 Parvovirinae subfamily 40119
9 Dependoparvovirus genus 10803
10 Adeno-associated dependoparvovirus A species 1511891
11 Adeno-associated virus - 3 no rank 46350
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adeno-associated virus 3B"), db = "ncbi")
== 1 queries ===============
Retrieving data for taxon 'Adeno-associated virus 3B'
√ Found: Adeno-associated+virus+3B
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`68742`
name rank id
1 Viruses superkingdom 10239
2 Monodnaviria clade 2731342
3 Shotokuvirae kingdom 2732092
4 Cossaviricota phylum 2732415
5 Quintoviricetes class 2732422
6 Piccovirales order 2732534
7 Parvoviridae family 10780
8 Parvovirinae subfamily 40119
9 Dependoparvovirus genus 10803
10 Adeno-associated dependoparvovirus A species 1511891
11 Adeno-associated virus - 3 no rank 46350
12 Adeno-associated virus 3B no rank 68742
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adenovirus predict_adv-20"), db = "ncbi")
== 1 queries ===============
Retrieving data for taxon 'Adenovirus predict_adv-20'
√ Found: Adenovirus+predict_adv-20
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`2710954`
name rank id
1 Viruses superkingdom 10239
2 Varidnaviria clade 2732004
3 Bamfordvirae kingdom 2732005
4 Preplasmiviricota phylum 2732008
5 Tectiliviricetes class 2732529
6 Rowavirales order 2732559
7 Adenoviridae family 10508
8 unclassified Adenoviridae no rank 189831
9 Adenovirus PREDICT_AdV-20 species 2710954
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
Everything I'm going to describe is being run through an R script called jncbi() which is included below for convenience:
jncbi <- function(spnames, type = 'host') {
raw <- data.frame(Name = spnames)
write_csv(raw, '~/Github/virion/Code_Dev/TaxonomyTempIn.csv', eol = "\n")
if(type == 'host') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/host.jl")}
if(type == 'virus') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/virus.jl")}
if(type == 'pathogen') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/pathogen.jl")}
clean <- read_csv("~/Github/virion/Code_Dev/TaxonomyTempOut.csv")
file.remove('~/Github/virion/Code_Dev/TaxonomyTempIn.csv')
file.remove('~/Github/virion/Code_Dev/TaxonomyTempOut.csv')
clean$Name <- stringr::str_to_sentence(clean$Name)
clean$match <- stringr::str_to_sentence(clean$match)
return(clean)
}
Doesn't really change anything about the attributes. Just outsources a file to clean and brings it back in.
Here are some contrasting results of virus.jl on different kinds of input.
When I pass 8,632 viruses through jncbi, 4,968 come back NA (no match) and 273 come back fuzzy matches (3,419 exact matches). (A file to reproduce this is attached. I'm only including these stats because I think they're probably relevant to our understanding of how big this bug is.) The results are concerning:
Name matched match taxid
adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
adeno-associated virus 3B NA NA NA
adenovirus PREDICT_AdV-20 NA NA NA
> jncbi(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20"), type = 'virus')
Progress: 100%|█████████████████████████████████████████| Time: 0:00:01
-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
Name = col_character(),
matched = col_logical(),
match = col_character(),
taxid = col_double()
)
# A tibble: 3 x 4
Name matched match taxid
<chr> <lgl> <chr> <dbl>
1 Adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
2 Adeno-associated virus 3b NA NA NA
3 Adenovirus predict_adv-20 NA NA NA
2B. THOSE THREE VALUES (LOWERCASE)
> jncbi(str_to_lower(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20")), type = 'virus')
Progress: 100%|█████████████████████████████████████████| Time: 0:00:02
-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
Name = col_character(),
matched = col_logical(),
match = col_character(),
taxid = col_double()
)
# A tibble: 3 x 4
Name matched match taxid
<chr> <lgl> <chr> <dbl>
1 Adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
2 Adeno-associated virus 3b FALSE Adeno-associated virus 3a 1406223
3 Adenovirus predict_adv-20 FALSE Adenovirus predict_adv-20 2710954
==========================
I haven't included it but, if you str_to_lower the virus names before they're passed for the entire list, it also significantly reduces the number of no-match's, and also extends the runtime from 5 mins to about 30 mins, confirming this is, in fact, part (but not all) of the issue
So there are two separate problems that need to be debugged.
Capitalization appears to be making everything wonky. I don't want to do R-end solves to this, given that there's capitalization changes in pathogen.jl - I think you can probably solve this by revisiting that script. (When you do, please do not turn it back into a generic script for both hosts and viruses.)
These should have exact (match=TRUE) matches in the NCBI taxonomy. Both instead get called to fuzzy matching. The first fuzzy match is wrong, while the second fuzzy match is actually the correct exact match, and the strings returned are identical (no differences, as far as I can tell in spacing).
Hey NCBITaxonomy.jl maintainers 👋 (@tpoisot mostly, I suppose? :))
With a group of people we're currently working on reviewing tools for taxonomic name harmonization for ecologists.
With mostly focused on R packages as R is the most used programming language by ecologists.
However we would like to have a section about tools that exist in other languages. As such we'd like to mention NCBITaxonomy.jl
, but I'm not familiar with citation practice in Julia modules
How should I cite the module? Do you know any other modules that may be relevant for taxonomic name harmonization?
Thanks :)
Is your feature request related to a problem? Please describe.
Currently, the build step is storing data in the package repo - https://github.com/JuliaPackaging/Scratch.jl might be a preferable alternative.
Additional context
This will limit compatibility to 1.5, which is not really an issue.
Need to change the build and load steps.
Is your feature request related to a problem? Please describe.
It would be good to allow all distance functions from StringDistances
Describe the solution you'd like
Additional keyword argument to taxid
(and namefinder
)
Ideally, this can take the form of code you would like to write:
using NCBITaxonomy
taxid("Box turus"; fuzzy=true, d=StringDistances.JaroWinkler)
Additional context
Work ongoing on main
.
A "moonshot" idea I had for this library would be implementing rudimentary natural-language-processing (NLP) methods for processing taxon identifiers.
As an example, if the input contains ["A. p. aciculatus", "ponderosa pine", "Agelaius phoeniceus", "A. phoeniceus californicus", "red winged blackbird", "Agelaius xanthomus", "Pinus ponderosa", "P. ponderosa"]
we would want a cleaning function to return ids in NCBI associated with the coarsest resolution id, e.g. ["Agelaius", "Pinus ponderosa"]
Clearly a false-postive here could be analysis-breaking so reporting some degree of confidence in
each resolved species label would also be necessary.
Just something to ruminate on
What to do?
Make namefinder
s for the main divisions.
Why?
This is going to be what users do anyways.
**## Any ideas how?
Uh, yeah.
taxon("gorilla")
- add a keyword to prefer scientific names
Describe the bug
julia> ncbi"Reptilia"
Testudines (ncbi:8459)
Expected behavior
An array of names? A warning? Both?
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
What to do?
When the package is loaded, check the date of the taxonomy
Why?
No need to download the taxonomy every few days for minor changes
Any ideas how?
Do something on load and rebuild if needed
Is your feature request related to a problem? Please describe.
It would be saving a lot of time to build a custom namefinder, when we know what we are likely to find.
Describe the solution you'd like
using NCBITaxonomy
# ... ids are a list of taxa or IDs
namefinder(ids)("Speciosus sp.")
Additional context
All of the internals are in place, all that is needed is to write a wrapper.
Is your feature request related to a problem? Please describe.
There is no way to know the rank of a taxon, which is problematic when building namefinders.
Describe the solution you'd like
using NCBITaxonomy
rank(ncbi"Vulpes vulpes")
Describe alternatives you've considered
None - the only alternative is to do a lookup in the nodes_table
manually, which is what the function would do.
Hi!
First, thank you very much for this package! It has been handy for me.
I think that it would be great to have an integration with Phylo.jl to easily get the common taxonomic tree for a set of taxa as in the NCBI site. That could be really useful for visualization and exploration.
Ideally, this can be a single function, similar to lineage
and most probably depending on that, that also accepts stop_at
, but that takes a list/set of NCBITaxon
s. For example:
using NCBITaxonomy
common_tree([ncbi"Bos taurus", ncbi"Mus musculus", ncbi"Homo sapiens"], stop_at=ncbi"Mammalia")
Best regards,
A lot of virus names are not the same as their matched names! For example:
> classification(get_uid("adeno-associated virus 3b"))
== 1 queries ===============
Retrieving data for taxon 'adeno-associated virus 3b'
√ Found: adeno-associated+virus+3b
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`68742`
name rank id
1 Viruses superkingdom 10239
2 Monodnaviria clade 2731342
3 Shotokuvirae kingdom 2732092
4 Cossaviricota phylum 2732415
5 Quintoviricetes class 2732422
6 Piccovirales order 2732534
7 Parvoviridae family 10780
8 Parvovirinae subfamily 40119
9 Dependoparvovirus genus 10803
10 Adeno-associated dependoparvovirus A species 1511891
11 Adeno-associated virus - 3 no rank 46350
12 Adeno-associated virus 3B no rank 68742
Technically it's VIRION workflow and not NCBITaxonomy.jl, but, when you have a second - could you also expand pathogen.jl to return species separate from matched name?
In some cases, it might be a good idea to allow regex search - this is pretty easy to do
Is your feature request related to a problem? Please describe.
It would be cool to get the closest names rather than a single match.
Describe the solution you'd like
using NCBITaxonomy
similar_names("Box taurus"; ...)
Add a way to get an array of NCBITaxon
with identical names in response to multiple match exception.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.