When clustering with swarm, when is an appropriate time to check for chimeras? <p

Hi Colin, did you use the parenthesis around the <code class="notran

When I run the following command, nothing happens. <div class="highlight highlight

I didn't think to check the cut output. Good idea!</p

Chimera checking and swarm about swarm HOT 22 CLOSED

torognes commented on June 3, 2024

Chimera checking and swarm

from swarm.

Comments (22)

tobiasgf commented on June 3, 2024 1

Maybe you need to linearize your sequences first, so that there are no line breaks in the sequences? (see the "Swarm front page" here for an "awk" command doing that.)

from swarm.

olar785 commented on June 3, 2024 1

Thanks tobiasgf, it did the trick. I thought my sequences were linearised already as they all seemed to be on one line but apparently not...thanks again!

from swarm.

frederic-mahe commented on June 3, 2024

I do it after swarm, on OTU representatives.

I never tested it properly, but the way swarm builds OTUs should not be affected by chimeras. Chimeras will very likely form independent OTUs, it is then logical to check for chimeras after clustering. In practice, you just have to create a fasta file containing OTU representatives (i.e. the most abundant amplicon in each OTU) and run a chimera check on it. As you may already know, uchime needs abundance values to distinguish parents and chimeras, so it is important to keep them. A simple command should do the trick (if your fasta entries are on two lines):

grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta

Torbjørn has implemented uchime in vsearch. It is slightly faster and more sensitive than usearch's implementation. I recommend using it.

from swarm.

colinbrislawn commented on June 3, 2024

Hello Frederic,

Thank you for the advice. I'm currently making great use of vsearch and I appreciate how well it integrates with other bioinformatics programs.

I want to make sure I understand the code snippet you sent me:

grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta

I can see that the inputs are a fasta file and a swarm results file, and the output is a fasta file of representative sequences. Thank you for mentioning that this takes linearized fasta files.

Is the output fasta file comparable to the fasta of OTU centroids generated by usearch/vsearch?
Should the input be my dereplicated fasta file or my seqs before dereplication?
Does this code snippet expect abundance annotations in swarm format or usearch/vsearch format?

Thank you for explaining this to me,
Colin

from swarm.

frederic-mahe commented on June 3, 2024

Hi Colin,

Is the output fasta file comparable to the fasta of OTU centroids generated by usearch/vsearch?

Yes and no. Yes, the output of the code snippet is like a list of usearch centroids. But swarm and uclust (usearch/vsearch) clustering are very different in nature. I personally use swarm to perform clustering for all my amplicon-based projects. Swarm gives higher resolution results and guarantees that each centroid is indeed a local maximum of abundance (more likely to be a "true" sequence). When checking for chimeras, I assume using swarm's centroids will improve the quality of the results.

Should the input be my dereplicated fasta file or my seqs before dereplication?

The dereplicated one, to get the abundance values.

Does this code snippet expect abundance annotations in swarm format or usearch/vsearch format?

The code snippet does not expect abundance annotations per se, but the following step (chimera checking) does.

from swarm.

colinbrislawn commented on June 3, 2024

Hello Frederic,

Thank you for the clarification. I understand how swarm greatly improves the definition of an OTU and provides better biological clusters in both similar and divergent sequences. Swarm introduces some new file types, I'm just having trouble parsing the output.

I've used the command you provided, but it does not work for me. I ran each command separately and found that the grep command was failing to find any matches in my dereplicated sequences.

cut -d " " -f 1 swarm.swarms > cut.swarms
### returns cut.swarms
grep -A 1 -F -f < cut.swarms seqs.derep.fasta > grep_hits.txt
### returns an empty file.

Here is what my dereplicated sequence file looks like:

head seqs.derep.fasta
>515rcbc_JedArch_206_179;size=108822;
TACGGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCC
GAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTG
AAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTG
GGGAGCAAACAG
>515rcbc_JedArch_173_216;size=70114;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGTTAAGTCAGTTGTGAAAGTTT
GCGGCTCAACCGTAAAATTGCAGTTGAAACTGGCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTG
AAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGACTGCAACTGACACTGAGGCTCGAAAGTGTG
GGTATCAAACAG

Using the first label as an example, '515rcbc' is the primer, 'JedArch' is the project name, '206' is the sample number, and '179' is the order that sequences appeared in that sample (before dereplication).

Thank you for helping me use swarm for the first time. I'm looking forward to seeing the results!

Colin

from swarm.

frederic-mahe commented on June 3, 2024

Hi Colin,

did you use the parenthesis around the cut command?

grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta

It tells the shell to redirect the output of cut to grep as if it was a file (the -f option of grep normally reads from a file).

from swarm.

colinbrislawn commented on June 3, 2024

When I run the following command, nothing happens.

grep -A 1 -F -f <(cut -d " " -f 1 swarm.swarms) seqs.derep.fasta

Edit: Rather, something does happen; grep searches and finds no matches.

from swarm.

frederic-mahe commented on June 3, 2024

OK, can you copy-paste the first 10 lines of the cut output? I will compare it to the first lines of your dereplicated fasta file to understand why it is not working.

from swarm.

colinbrislawn commented on June 3, 2024

I didn't think to check the cut output. Good idea!

It look like the cut command is not changing my swarm results at all. As an example:

cut -d " " -f 1 swarm.swarms > cut.swarms
shasum *swarms*
ea44e1201b07bfb02083c30a412f6b9730dbfb9c  cut.swarms
ea44e1201b07bfb02083c30a412f6b9730dbfb9c  swarm.swarms

Here are the first characters of my swarm.swarms file:

swarm_1 436205  515rcbc_JedArch_206_179;size=108822;,515rcbc_JedArch_86_211;size=52828;,515rcbc_JedArch_339_24693;size=2413;,515rcbc_JedArch_229_25294;size=913;,515rcbc_JedArch_176_8284;size=847;,515rcbc_JedArch_91_17504;size=574;,515rcbc_JedArch_253_4914;size=539;,515rcbc_JedArch_240_8296;size=483;,515rcbc_JedArch_3_8302;size=468;,515rcbc_JedArch_200_143;size=368;,515rcbc_JedArch_10_17677;size=343;,515rcbc_JedArch_282_33153;size=335;,515rcbc_JedArch_212_6380;size=319;,515rcbc_JedArch_90_14542;size=314;,515rcbc_JedArch_276_14847;size=307;,515rcbc_JedArch_204_

Edit: Also my swarm output appears to have only one line. Is that normal?

wc -l swarm.swarms 
1 swarm.swarms

from swarm.

frederic-mahe commented on June 3, 2024

Hi Colin,

your swarm output should normally contain more than one line. Can you please show me the command line you used to perform the clustering?

from swarm.

colinbrislawn commented on June 3, 2024

Sure!

    ### dereplication
vsearch-1.0.7 --derep_fulllength ../../demultiplexing/split_bc_type_12/seqs.fna \
--output seqs.derep.fasta --sizeout

    ### clustering with swarm
swarm-2.0.0-osx seqs.derep.fasta -z -t 6 \
-o swarm.swarms -s swarm.stats -u swarm.uc -r swarm.mothur
    ### not sure how long (hour?). No .mothur file outputted.

The file seqs.fna was created with the qiime script split_libraries_fastq.py

Thanks!
Colin

from swarm.

frederic-mahe commented on June 3, 2024

Hi Colin,

sorry for the late reply, the problem comes from the -r swarm.mothur option. That option modifies the swarm's output, and does not create another file. So do not use the -r option if you do not intend to use mother formatted results. The -u option also slows does swarm a lot, do not use it if you don't really need results in that format.

from swarm.

colinbrislawn commented on June 3, 2024

Hello Frederic,

I was unaware that -r swarm.mothur would also change the normal output. I thought it would make a new file, like -u does.

I will try again without -r and also compare speeds with and without -u.

Thank you for all your assistance!
Colin B

from swarm.

mdehollander commented on June 3, 2024

Hi,

I would like to join in on this topic, because i am also working on chimera detection of swarm output. I have a few questions:

For uchime_denovo does the input need to be in usearch format (with size= in the header)? I have the seqs now formatted as >md5sum_abundance and only the uchime_ref finds chimeras.
What would be the best way to incorporate the uchime output with the guide to make contigency tables: Working with several samples? Just remove chimeric OTUs?

Thanks for any advice!

from swarm.

torognes commented on June 3, 2024

Hi @mdehollander,

Yes, uchime_denovo needs abundance information because by default it will only use a sequence as a potential chimera parent sequence if the abundance of the parent is twice the abundance of the potential chimeric sequence. That level can be adjusted though, with the --abskew option. Vsearch does not recognize abundance information in the swarm format (after the underscore).

Torbjørn

from swarm.

frederic-mahe commented on June 3, 2024

Hi,

(a quick answer, I will take the time to answer fully as soon as I will be available)

if you have swarm-style abundance annotations, you can easily switch to usearch-style annotations:

sed -e '/^>/ s/_/;size=/' file.fasta > file2.fasta

use -i if you want to make the change in place.

from swarm.

frederic-mahe commented on June 3, 2024

Hi @mdehollander,

regarding your second question (how to incorporate chimera detection results in the contingency tables?). What I did recently on my own projects is to perform de novo chimera detection on OTU representatives (so, after swarm). Using vsearch, I obtained a list of chimeras. From there, I see two different possibilities when building the OTU table: 1) discard OTUs whose representative sequence has been identified as a chimera; 2) keep all OTUs but add a "chimeric" column to flag OTUs (0 = not a chimera, 1 = chimera).

The flag option is tempting. It could be interesting to also report the uchime score or the 18th column of the --uchimeout output. It would allow more subtlety than the binary (chimera or not chimera).

from swarm.

frederic-mahe commented on June 3, 2024

Hi @mdehollander,

What would be the best way to incorporate the uchime output with the guide to make contigency tables: Working with several samples? Just remove chimeric OTUs?

Yes, my advice is to tag and ultimately remove OTUs identified as chimeras. I will not modify the guide to make contigency tables (Working with several samples) to take chimeras into account. I posted this script as an example of code to build an OTU table, it was not meant to be a comprehensive tool.

I've created a FAQ on swarm's wiki and added a question about chimeras. I will close that issue, feel free to reopen it if you still have questions.

from swarm.

olar785 commented on June 3, 2024

Hi,
I would be very keen to use swarm clustering algorithm and just as colinbrislawn I am using vsearch to look for chimeras. My sequences are all about 450bp long but after swarm clustering and using the command you proposed above:

grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta

I end up with "OTUs" of 80bp. Any idea why this is happening? Thanks for your help.

from swarm.

Seena1 commented on June 3, 2024

I performed Swarming and I got 2 files, namely representative .fasta and data.swarms files, I would like to perform chimera detection with VSEARCH, could anyone please help with the command.

from swarm.

torognes commented on June 3, 2024

@Seena1, you could try this command:

vsearch --uchime_denovo representative.fasta --nonchimeras nonchimeras.fasta

from swarm.

Chimera checking and swarm about swarm HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent