poisotlab / ncbitaxonomy.jl Goto Github PK

View Code? Open in Web Editor NEW

6.0 5.0 2.0 2.89 MB

Wrapper around the NCBI taxonomy files

Home Page: https://poisotlab.github.io/NCBITaxonomy.jl/stable/

License: MIT License

Julia 100.00%

ncbi-taxonomy ncbi biodiversity biodiversity-data taxonomy verena ivado

ncbitaxonomy.jl's Issues

The unique name title is not populated when reading to the names.arrow file

The _materialize_data seems to be missing a case, as the unique names do not
end up in the names.arrow file. This is going to be a problem rapidly.

New divisions

Phages are PHG, env are ENV

> 50% of queries are returning NA; and, separately, exact matches aren't returning where appropriate

Hi! This is gonna be a long one.

Here are three viruses that all have exact matches in the NCBI taxonomy:
Adeno-associated virus - 3
Adeno-associated virus 3B
Adenovirus predict_adv-20

They're an interesting case study for what's going horribly wrong here. In theory, they should all be retrieved as exact matches. Two are, in fact, the same "species". For example, the same NCBI API call through taxize:

> classification(get_uid("Adeno-associated virus - 3"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adeno-associated virus - 3'

√  Found:  Adeno-associated+virus+-+3
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`46350`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adeno-associated virus 3B"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adeno-associated virus 3B'

√  Found:  Adeno-associated+virus+3B
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`68742`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350
12            Adeno-associated virus 3B      no rank   68742

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

> classification(get_uid("Adenovirus predict_adv-20"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adenovirus predict_adv-20'

√  Found:  Adenovirus+predict_adv-20
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`2710954`
                       name         rank      id
1                   Viruses superkingdom   10239
2              Varidnaviria        clade 2732004
3              Bamfordvirae      kingdom 2732005
4         Preplasmiviricota       phylum 2732008
5          Tectiliviricetes        class 2732529
6               Rowavirales        order 2732559
7              Adenoviridae       family   10508
8 unclassified Adenoviridae      no rank  189831
9 Adenovirus PREDICT_AdV-20      species 2710954

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

Everything I'm going to describe is being run through an R script called jncbi() which is included below for convenience:

jncbi <- function(spnames, type = 'host') {
  raw <- data.frame(Name = spnames)
  write_csv(raw, '~/Github/virion/Code_Dev/TaxonomyTempIn.csv', eol = "\n")
  
  if(type == 'host') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/host.jl")}
  if(type == 'virus') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/virus.jl")}
  if(type == 'pathogen') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/pathogen.jl")}
  
  clean <- read_csv("~/Github/virion/Code_Dev/TaxonomyTempOut.csv")
  file.remove('~/Github/virion/Code_Dev/TaxonomyTempIn.csv')
  file.remove('~/Github/virion/Code_Dev/TaxonomyTempOut.csv')
  
  clean$Name <- stringr::str_to_sentence(clean$Name)
  clean$match <- stringr::str_to_sentence(clean$match)
  return(clean)
}

Doesn't really change anything about the attributes. Just outsources a file to clean and brings it back in.

Here are some contrasting results of virus.jl on different kinds of input.

A BIG LIST

When I pass 8,632 viruses through jncbi, 4,968 come back NA (no match) and 273 come back fuzzy matches (3,419 exact matches). (A file to reproduce this is attached. I'm only including these stats because I think they're probably relevant to our understanding of how big this bug is.) The results are concerning:

Name matched match taxid
adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
adeno-associated virus 3B NA NA NA
adenovirus PREDICT_AdV-20 NA NA NA

JUST THOSE THREE VALUES

> jncbi(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20"), type = 'virus')
Progress: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| Time: 0:00:01

-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  matched = col_logical(),
  match = col_character(),
  taxid = col_double()
)

# A tibble: 3 x 4
  Name                       matched match                      taxid
  <chr>                      <lgl>   <chr>                      <dbl>
1 Adeno-associated virus - 3 TRUE    Adeno-associated virus - 3 46350
2 Adeno-associated virus 3b  NA      NA                            NA
3 Adenovirus predict_adv-20  NA      NA                            NA

2B. THOSE THREE VALUES (LOWERCASE)

> jncbi(str_to_lower(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20")), type = 'virus')
Progress: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| Time: 0:00:02

-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  matched = col_logical(),
  match = col_character(),
  taxid = col_double()
)

# A tibble: 3 x 4
  Name                       matched match                        taxid
  <chr>                      <lgl>   <chr>                        <dbl>
1 Adeno-associated virus - 3 TRUE    Adeno-associated virus - 3   46350
2 Adeno-associated virus 3b  FALSE   Adeno-associated virus 3a  1406223
3 Adenovirus predict_adv-20  FALSE   Adenovirus predict_adv-20  2710954

==========================
I haven't included it but, if you str_to_lower the virus names before they're passed for the entire list, it also significantly reduces the number of no-match's, and also extends the runtime from 5 mins to about 30 mins, confirming this is, in fact, part (but not all) of the issue

So there are two separate problems that need to be debugged.

Capitalization appears to be making everything wonky. I don't want to do R-end solves to this, given that there's capitalization changes in pathogen.jl - I think you can probably solve this by revisiting that script. (When you do, please do not turn it back into a generic script for both hosts and viruses.)
These should have exact (match=TRUE) matches in the NCBI taxonomy. Both instead get called to fuzzy matching. The first fuzzy match is wrong, while the second fuzzy match is actually the correct exact match, and the strings returned are identical (no differences, as far as I can tell in spacing).

Citing NCBITaxonomy.jl

Hey NCBITaxonomy.jl maintainers 👋 (@tpoisot mostly, I suppose? :))
With a group of people we're currently working on reviewing tools for taxonomic name harmonization for ecologists.
With mostly focused on R packages as R is the most used programming language by ecologists.

However we would like to have a section about tools that exist in other languages. As such we'd like to mention NCBITaxonomy.jl, but I'm not familiar with citation practice in Julia modules
How should I cite the module? Do you know any other modules that may be relevant for taxonomic name harmonization?

Thanks :)

Use Scratch.jl

Is your feature request related to a problem? Please describe.
Currently, the build step is storing data in the package repo - https://github.com/JuliaPackaging/Scratch.jl might be a preferable alternative.

Additional context
This will limit compatibility to 1.5, which is not really an issue.

Documentation update for the next release

Check docstrings
Examples

Use a central location for the taxonomy and default to PKG if not set

Need to change the build and load steps.

Allow arbitrary distance function

Is your feature request related to a problem? Please describe.
It would be good to allow all distance functions from StringDistances

Describe the solution you'd like
Additional keyword argument to taxid (and namefinder)

Ideally, this can take the form of code you would like to write:

using NCBITaxonomy
taxid("Box turus"; fuzzy=true, d=StringDistances.JaroWinkler)

Additional context
Work ongoing on main.

[out there] processing taxon identifiers via NLP

A "moonshot" idea I had for this library would be implementing rudimentary natural-language-processing (NLP) methods for processing taxon identifiers.

As an example, if the input contains ["A. p. aciculatus", "ponderosa pine", "Agelaius phoeniceus", "A. phoeniceus californicus", "red winged blackbird", "Agelaius xanthomus", "Pinus ponderosa", "P. ponderosa"] we would want a cleaning function to return ids in NCBI associated with the coarsest resolution id, e.g. ["Agelaius", "Pinus ponderosa"]

Clearly a false-postive here could be analysis-breaking so reporting some degree of confidence in
each resolved species label would also be necessary.

Just something to ruminate on

Use Preferences.jl to store a single taxo db

Standard namefinders

What to do?

Make namefinders for the main divisions.

Why?

This is going to be what users do anyways.

**## Any ideas how?

Uh, yeah.

Issue with lowercase search conflicting with vernaculars

taxon("gorilla") - add a keyword to prefer scientific names

In-part names only return the first element

Describe the bug

julia> ncbi"Reptilia"
Testudines (ncbi:8459)

Expected behavior

An array of names? A warning? Both?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add a freshness parameter

What to do?
When the package is loaded, check the date of the taxonomy

Why?
No need to download the taxonomy every few days for minor changes

Any ideas how?
Do something on load and rebuild if needed

Create a namefinder from a list of IDs

Is your feature request related to a problem? Please describe.
It would be saving a lot of time to build a custom namefinder, when we know what we are likely to find.

Describe the solution you'd like

using NCBITaxonomy

# ... ids are a list of taxa or IDs

namefinder(ids)("Speciosus sp.")

Additional context
All of the internals are in place, all that is needed is to write a wrapper.

rank function for a taxonomic ID

Is your feature request related to a problem? Please describe.
There is no way to know the rank of a taxon, which is problematic when building namefinders.

Describe the solution you'd like

using NCBITaxonomy

rank(ncbi"Vulpes vulpes")

Describe alternatives you've considered
None - the only alternative is to do a lookup in the nodes_table manually, which is what the function would do.

Taxonomic tree

Hi!

First, thank you very much for this package! It has been handy for me.

I think that it would be great to have an integration with Phylo.jl to easily get the common taxonomic tree for a set of taxa as in the NCBI site. That could be really useful for visualization and exploration.

Ideally, this can be a single function, similar to lineage and most probably depending on that, that also accepts stop_at, but that takes a list/set of NCBITaxons. For example:

using NCBITaxonomy

common_tree([ncbi"Bos taurus", ncbi"Mus musculus", ncbi"Homo sapiens"], stop_at=ncbi"Mammalia")

Best regards,

pathogen.jl extended to also return species ranks where different than matched names

A lot of virus names are not the same as their matched names! For example:

> classification(get_uid("adeno-associated virus 3b"))
==  1 queries  ===============

Retrieving data for taxon 'adeno-associated virus 3b'

√  Found:  adeno-associated+virus+3b
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`68742`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350
12            Adeno-associated virus 3B      no rank   68742

Technically it's VIRION workflow and not NCBITaxonomy.jl, but, when you have a second - could you also expand pathogen.jl to return species separate from matched name?

RegEx search

In some cases, it might be a good idea to allow regex search - this is pretty easy to do

Return the nearest names

Is your feature request related to a problem? Please describe.
It would be cool to get the closest names rather than a single match.

Describe the solution you'd like

using NCBITaxonomy

similar_names("Box taurus"; ...)

Function to disambiguate names

Add a way to get an array of NCBITaxon with identical names in response to multiple match exception.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.