Comments (22)
Maybe you need to linearize your sequences first, so that there are no line breaks in the sequences? (see the "Swarm front page" here for an "awk" command doing that.)
from swarm.
Thanks tobiasgf, it did the trick. I thought my sequences were linearised already as they all seemed to be on one line but apparently not...thanks again!
from swarm.
I do it after swarm, on OTU representatives.
I never tested it properly, but the way swarm builds OTUs should not be affected by chimeras. Chimeras will very likely form independent OTUs, it is then logical to check for chimeras after clustering. In practice, you just have to create a fasta file containing OTU representatives (i.e. the most abundant amplicon in each OTU) and run a chimera check on it. As you may already know, uchime needs abundance values to distinguish parents and chimeras, so it is important to keep them. A simple command should do the trick (if your fasta entries are on two lines):
grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta
Torbjørn has implemented uchime in vsearch. It is slightly faster and more sensitive than usearch's implementation. I recommend using it.
from swarm.
Hello Frederic,
Thank you for the advice. I'm currently making great use of vsearch and I appreciate how well it integrates with other bioinformatics programs.
I want to make sure I understand the code snippet you sent me:
grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta
I can see that the inputs are a fasta file and a swarm results file, and the output is a fasta file of representative sequences. Thank you for mentioning that this takes linearized fasta files.
Is the output fasta file comparable to the fasta of OTU centroids generated by usearch/vsearch?
Should the input be my dereplicated fasta file or my seqs before dereplication?
Does this code snippet expect abundance annotations in swarm format or usearch/vsearch format?
Thank you for explaining this to me,
Colin
from swarm.
Hi Colin,
Is the output fasta file comparable to the fasta of OTU centroids generated by usearch/vsearch?
Yes and no. Yes, the output of the code snippet is like a list of usearch centroids. But swarm and uclust (usearch/vsearch) clustering are very different in nature. I personally use swarm to perform clustering for all my amplicon-based projects. Swarm gives higher resolution results and guarantees that each centroid is indeed a local maximum of abundance (more likely to be a "true" sequence). When checking for chimeras, I assume using swarm's centroids will improve the quality of the results.
Should the input be my dereplicated fasta file or my seqs before dereplication?
The dereplicated one, to get the abundance values.
Does this code snippet expect abundance annotations in swarm format or usearch/vsearch format?
The code snippet does not expect abundance annotations per se, but the following step (chimera checking) does.
from swarm.
Hello Frederic,
Thank you for the clarification. I understand how swarm greatly improves the definition of an OTU and provides better biological clusters in both similar and divergent sequences. Swarm introduces some new file types, I'm just having trouble parsing the output.
I've used the command you provided, but it does not work for me. I ran each command separately and found that the grep command was failing to find any matches in my dereplicated sequences.
cut -d " " -f 1 swarm.swarms > cut.swarms
### returns cut.swarms
grep -A 1 -F -f < cut.swarms seqs.derep.fasta > grep_hits.txt
### returns an empty file.
Here is what my dereplicated sequence file looks like:
head seqs.derep.fasta
>515rcbc_JedArch_206_179;size=108822;
TACGGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCC
GAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTG
AAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTG
GGGAGCAAACAG
>515rcbc_JedArch_173_216;size=70114;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGTTAAGTCAGTTGTGAAAGTTT
GCGGCTCAACCGTAAAATTGCAGTTGAAACTGGCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTG
AAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGACTGCAACTGACACTGAGGCTCGAAAGTGTG
GGTATCAAACAG
Using the first label as an example, '515rcbc' is the primer, 'JedArch' is the project name, '206' is the sample number, and '179' is the order that sequences appeared in that sample (before dereplication).
Thank you for helping me use swarm for the first time. I'm looking forward to seeing the results!
Colin
from swarm.
Hi Colin,
did you use the parenthesis around the cut
command?
grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta
It tells the shell to redirect the output of cut
to grep
as if it was a file (the -f
option of grep
normally reads from a file).
from swarm.
When I run the following command, nothing happens.
grep -A 1 -F -f <(cut -d " " -f 1 swarm.swarms) seqs.derep.fasta
Edit: Rather, something does happen; grep searches and finds no matches.
from swarm.
OK, can you copy-paste the first 10 lines of the cut
output? I will compare it to the first lines of your dereplicated fasta file to understand why it is not working.
from swarm.
I didn't think to check the cut
output. Good idea!
It look like the cut command is not changing my swarm results at all. As an example:
cut -d " " -f 1 swarm.swarms > cut.swarms
shasum *swarms*
ea44e1201b07bfb02083c30a412f6b9730dbfb9c cut.swarms
ea44e1201b07bfb02083c30a412f6b9730dbfb9c swarm.swarms
Here are the first characters of my swarm.swarms
file:
swarm_1 436205 515rcbc_JedArch_206_179;size=108822;,515rcbc_JedArch_86_211;size=52828;,515rcbc_JedArch_339_24693;size=2413;,515rcbc_JedArch_229_25294;size=913;,515rcbc_JedArch_176_8284;size=847;,515rcbc_JedArch_91_17504;size=574;,515rcbc_JedArch_253_4914;size=539;,515rcbc_JedArch_240_8296;size=483;,515rcbc_JedArch_3_8302;size=468;,515rcbc_JedArch_200_143;size=368;,515rcbc_JedArch_10_17677;size=343;,515rcbc_JedArch_282_33153;size=335;,515rcbc_JedArch_212_6380;size=319;,515rcbc_JedArch_90_14542;size=314;,515rcbc_JedArch_276_14847;size=307;,515rcbc_JedArch_204_
Edit: Also my swarm output appears to have only one line. Is that normal?
wc -l swarm.swarms
1 swarm.swarms
from swarm.
Hi Colin,
your swarm output should normally contain more than one line. Can you please show me the command line you used to perform the clustering?
from swarm.
Sure!
### dereplication
vsearch-1.0.7 --derep_fulllength ../../demultiplexing/split_bc_type_12/seqs.fna \
--output seqs.derep.fasta --sizeout
### clustering with swarm
swarm-2.0.0-osx seqs.derep.fasta -z -t 6 \
-o swarm.swarms -s swarm.stats -u swarm.uc -r swarm.mothur
### not sure how long (hour?). No .mothur file outputted.
The file seqs.fna
was created with the qiime script split_libraries_fastq.py
Thanks!
Colin
from swarm.
Hi Colin,
sorry for the late reply, the problem comes from the -r swarm.mothur
option. That option modifies the swarm's output, and does not create another file. So do not use the -r
option if you do not intend to use mother formatted results. The -u
option also slows does swarm a lot, do not use it if you don't really need results in that format.
from swarm.
Hello Frederic,
I was unaware that -r swarm.mothur
would also change the normal output. I thought it would make a new file, like -u
does.
I will try again without -r
and also compare speeds with and without -u
.
Thank you for all your assistance!
Colin B
from swarm.
Hi,
I would like to join in on this topic, because i am also working on chimera detection of swarm output. I have a few questions:
- For uchime_denovo does the input need to be in usearch format (with size= in the header)? I have the seqs now formatted as >md5sum_abundance and only the uchime_ref finds chimeras.
- What would be the best way to incorporate the uchime output with the guide to make contigency tables: Working with several samples? Just remove chimeric OTUs?
Thanks for any advice!
from swarm.
Hi @mdehollander,
Yes, uchime_denovo needs abundance information because by default it will only use a sequence as a potential chimera parent sequence if the abundance of the parent is twice the abundance of the potential chimeric sequence. That level can be adjusted though, with the --abskew
option. Vsearch does not recognize abundance information in the swarm format (after the underscore).
- Torbjørn
from swarm.
Hi,
(a quick answer, I will take the time to answer fully as soon as I will be available)
if you have swarm-style abundance annotations, you can easily switch to usearch-style annotations:
sed -e '/^>/ s/_/;size=/' file.fasta > file2.fasta
use -i
if you want to make the change in place.
from swarm.
Hi @mdehollander,
regarding your second question (how to incorporate chimera detection results in the contingency tables?). What I did recently on my own projects is to perform de novo chimera detection on OTU representatives (so, after swarm
). Using vsearch
, I obtained a list of chimeras. From there, I see two different possibilities when building the OTU table: 1) discard OTUs whose representative sequence has been identified as a chimera; 2) keep all OTUs but add a "chimeric" column to flag OTUs (0 = not a chimera, 1 = chimera).
The flag option is tempting. It could be interesting to also report the uchime score
or the 18th column of the --uchimeout
output. It would allow more subtlety than the binary (chimera or not chimera).
from swarm.
Hi @mdehollander,
What would be the best way to incorporate the uchime output with the guide to make contigency tables: Working with several samples? Just remove chimeric OTUs?
Yes, my advice is to tag and ultimately remove OTUs identified as chimeras. I will not modify the guide to make contigency tables (Working with several samples) to take chimeras into account. I posted this script as an example of code to build an OTU table, it was not meant to be a comprehensive tool.
I've created a FAQ on swarm's wiki and added a question about chimeras. I will close that issue, feel free to reopen it if you still have questions.
from swarm.
Hi,
I would be very keen to use swarm clustering algorithm and just as colinbrislawn I am using vsearch to look for chimeras. My sequences are all about 450bp long but after swarm clustering and using the command you proposed above:
grep -A 1 -F -f <(cut -d " " -f 1 data.swarms) data.fasta | sed -e '/^--$/d' > data_representatives.fasta
I end up with "OTUs" of 80bp. Any idea why this is happening? Thanks for your help.
from swarm.
I performed Swarming and I got 2 files, namely representative .fasta and data.swarms files, I would like to perform chimera detection with VSEARCH, could anyone please help with the command.
from swarm.
@Seena1, you could try this command:
vsearch --uchime_denovo representative.fasta --nonchimeras nonchimeras.fasta
from swarm.
Related Issues (20)
- Incorrect results with long sequences, very high gap penalties, and d>1 HOT 9
- Roadmap HOT 9
- table formats for making OTU table from swarm results HOT 2
- documentation: maximal input sequence length HOT 9
- Link to new swarm 3 paper HOT 3
- way too many clusters? HOT 7
- Issues with plotting: OTU not found HOT 3
- OTU table without abundance values HOT 6
- Compiling fails on Windows HOT 3
- maximal header length HOT 1
- uc file question HOT 7
- Invalid numeric argument for option -t or --threads HOT 8
- integer division might be one byte too small in some cases HOT 14
- Seeds not in same order as clusters in output files HOT 2
- Option to exclude seeds or clusters with low abundance HOT 1
- risk of silent 8-bit unsigned integer overflow in score matrix HOT 5
- fastidious boundary applies to the consolidated cluster mass? HOT 1
- Option to output cluster member ids/sequence id? HOT 3
- swarm terminates if any sequence in the input fasta file has an N HOT 2
- Refactor Bloom filter code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from swarm.