Deion of feature As one of the input database files (silva_s

Accessing 16S gene identifiers about ampliseq HOT 13 OPEN

annotatebio commented on June 19, 2024

Accessing 16S gene identifiers

from ampliseq.

Comments (13)

d4straub commented on June 19, 2024 1

I am benchmarking currently, hope that will shed some light on it.

from ampliseq.

d4straub commented on June 19, 2024

Hi there,

I actually never looked into that. Therefore I am not aware of a solution or whether it could be implemented. Maybe someone else could chime in.

For more context, could you describe why that would help you/what you would gain? E.g. whats your usecase? Might that be of more general interest as well?

from ampliseq.

annotatebio commented on June 19, 2024

Thanks for the swift reply! The reason for the request is that while comparing 16S results from databases like SILVA, with whole genome-based methods (with NCBI Taxonomy used), one stumbles upon a problem of outdated taxonomic labels of SILVA and discrepansies between OTUs and genomic taxonomy.

Since SILVA stores information on 16S gene primary accession in GenBank (and from what I see some of the ampliseq database files do that too), it is possible to use it for finding what's the NCBI taxonomy assigned to the gene - which is likely more up to date and in line with whole genomes' taxonomy.

from ampliseq.

d4straub commented on June 19, 2024

I see. The outdated taxonomies could be probably improved that way.
Regarding comparing it with whole genome-based methods, one could either use GTDB via --dada_ref_taxonomy gtdb (the corresponding shotgun metagenomics assembly classifier can be used in e.g. nf-core/mag) or Kraken2 with the standard database using --kraken2_ref_taxonomy standard (which seems to work just fine in preliminary benchmarks).

from ampliseq.

annotatebio commented on June 19, 2024

Great recommendations!

I've launched GTDB-based classification with DADA2, however I see some crucial taxa are not detected further than phylum/class/family level, even though genus/species level gets assigned to the same ASV with SILVA. It seems to me that the GTDB database have a different content on the sequence level than the default SILVA, and hence the classification results differ?

Would it be the same case for Kraken2 database?

from ampliseq.

d4straub commented on June 19, 2024

Databases are non-trivial to compare, so if you do not find a "crucial" taxa, turn to another one.
Another reason could be that classification with DADA2 is not always same, some taxa close to cutoffs can fall below or raise above said cutoff because of tiny number alterations (that are outside of my control, fixed seed doesnt help). So running the same taxonomic classification with DADA2 multiple times can lead to missing or added taxonomic levels when around the confidence threshold.

Would it be the same case for Kraken2 database?

I dont know, you would need to test.

from ampliseq.

erikrikarddaniel commented on June 19, 2024

IMO, "sbdi-gtdb" is better than "gtdb" as we know there are rRNA-sequences in the GTDB collection that are assigned to the wrong species. "sbdi-gtdb" is phylogenetically vetted to remove these.

from ampliseq.

annotatebio commented on June 19, 2024

Thanks for all the suggestions - indeed I gave --kraken2_ref_taxonomy standard and --dada_ref_taxonomy sbdi-gtdb a go and while the latter definitely brought some of the results closer on the genus level, phylum level taxonomy still seems to be a jungle in comparison to whole-genome GTDB: sometimes I see Bacteroidota, but sometimes I see Firmicutes. Proteobacteria should be Pseudomonadota so this label is probably also not up to date.

I didn't expect it to be such a challenge to benchmark the technologies, looks like it requires a lot of manual research to map the taxonomic labels correspondence, otherwise while plotted one next to another the data looks like the results were completely different.

from ampliseq.

annotatebio commented on June 19, 2024

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

from ampliseq.

erikrikarddaniel commented on June 19, 2024

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

That's it. I'm working on SBDI-GTDB 08RS214, and soon release 09 (when that's released, likely in late April).

from ampliseq.

annotatebio commented on June 19, 2024

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

from ampliseq.

erikrikarddaniel commented on June 19, 2024

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

New releases of databases are included in new releases of the pipeline itself, so yes. Hopefully, I'm done with the next release in time for Ampliseq 2.10.

from ampliseq.

annotatebio commented on June 19, 2024

That would be fantastic - thanks for all the work, I'm hitting the Watch button then :) Good luck!

from ampliseq.

Accessing 16S gene identifiers about ampliseq HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent