Giter VIP home page Giter VIP logo

Comments (13)

d4straub avatar d4straub commented on June 19, 2024 1

I am benchmarking currently, hope that will shed some light on it.

from ampliseq.

d4straub avatar d4straub commented on June 19, 2024

Hi there,

I actually never looked into that. Therefore I am not aware of a solution or whether it could be implemented. Maybe someone else could chime in.

For more context, could you describe why that would help you/what you would gain? E.g. whats your usecase? Might that be of more general interest as well?

from ampliseq.

annotatebio avatar annotatebio commented on June 19, 2024

Thanks for the swift reply! The reason for the request is that while comparing 16S results from databases like SILVA, with whole genome-based methods (with NCBI Taxonomy used), one stumbles upon a problem of outdated taxonomic labels of SILVA and discrepansies between OTUs and genomic taxonomy.

Since SILVA stores information on 16S gene primary accession in GenBank (and from what I see some of the ampliseq database files do that too), it is possible to use it for finding what's the NCBI taxonomy assigned to the gene - which is likely more up to date and in line with whole genomes' taxonomy.

from ampliseq.

d4straub avatar d4straub commented on June 19, 2024

I see. The outdated taxonomies could be probably improved that way.
Regarding comparing it with whole genome-based methods, one could either use GTDB via --dada_ref_taxonomy gtdb (the corresponding shotgun metagenomics assembly classifier can be used in e.g. nf-core/mag) or Kraken2 with the standard database using --kraken2_ref_taxonomy standard (which seems to work just fine in preliminary benchmarks).

from ampliseq.

annotatebio avatar annotatebio commented on June 19, 2024

Great recommendations!

I've launched GTDB-based classification with DADA2, however I see some crucial taxa are not detected further than phylum/class/family level, even though genus/species level gets assigned to the same ASV with SILVA. It seems to me that the GTDB database have a different content on the sequence level than the default SILVA, and hence the classification results differ?

Would it be the same case for Kraken2 database?

from ampliseq.

d4straub avatar d4straub commented on June 19, 2024

Databases are non-trivial to compare, so if you do not find a "crucial" taxa, turn to another one.
Another reason could be that classification with DADA2 is not always same, some taxa close to cutoffs can fall below or raise above said cutoff because of tiny number alterations (that are outside of my control, fixed seed doesnt help). So running the same taxonomic classification with DADA2 multiple times can lead to missing or added taxonomic levels when around the confidence threshold.

Would it be the same case for Kraken2 database?

I dont know, you would need to test.

from ampliseq.

erikrikarddaniel avatar erikrikarddaniel commented on June 19, 2024

IMO, "sbdi-gtdb" is better than "gtdb" as we know there are rRNA-sequences in the GTDB collection that are assigned to the wrong species. "sbdi-gtdb" is phylogenetically vetted to remove these.

from ampliseq.

annotatebio avatar annotatebio commented on June 19, 2024

Thanks for all the suggestions - indeed I gave --kraken2_ref_taxonomy standard and --dada_ref_taxonomy sbdi-gtdb a go and while the latter definitely brought some of the results closer on the genus level, phylum level taxonomy still seems to be a jungle in comparison to whole-genome GTDB: sometimes I see Bacteroidota, but sometimes I see Firmicutes. Proteobacteria should be Pseudomonadota so this label is probably also not up to date.

I didn't expect it to be such a challenge to benchmark the technologies, looks like it requires a lot of manual research to map the taxonomic labels correspondence, otherwise while plotted one next to another the data looks like the results were completely different.

from ampliseq.

annotatebio avatar annotatebio commented on June 19, 2024

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

from ampliseq.

erikrikarddaniel avatar erikrikarddaniel commented on June 19, 2024

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

That's it. I'm working on SBDI-GTDB 08RS214, and soon release 09 (when that's released, likely in late April).

from ampliseq.

annotatebio avatar annotatebio commented on June 19, 2024

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

from ampliseq.

erikrikarddaniel avatar erikrikarddaniel commented on June 19, 2024

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

New releases of databases are included in new releases of the pipeline itself, so yes. Hopefully, I'm done with the next release in time for Ampliseq 2.10.

from ampliseq.

annotatebio avatar annotatebio commented on June 19, 2024

That would be fantastic - thanks for all the work, I'm hitting the Watch button then :) Good luck!

from ampliseq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.