Comments (35)
@mossishahi Lauren recently fixed some whitespace errors in makeTSVfile.py
. I made a new ARCS release (1.0.5) that includes Lauren's fix: https://github.com/bcgsc/arcs/releases/download/v1.0.5/arcs-1.0.5.tar.gz
Can you try the makeTSVfile.py
in that tarball and see if it solves your problem?
from arcs.
Hi Lauren,
I have now tried to run arcs-make, but still run into problems in the very first steps of the pipeline
'make: *** No rule to make target oenMenl_1.3.fa.fa', needed by
oenMenl_1.3.fa.renamed.fa'. Stop.'
But never mind. - I've found the source of the issue with my own pipeline that both a collaborator and I ran into. As silly as it is, I'll still post it here in case it can be useful for somebody: We both thought that the tsv file output by ARCS can be used as input for LINKS, and that the step over the makeTSVfile.py script can be skipped. Well, it can't, as we found out.
Thanks again for the swift support & very best wishes,
Reto
from arcs.
Hello @mossishahi,
Just to confirm, you are using longranger align
, and aligning your longranger basic
processed reads to your draft genome? Have you sorted the bamfile(s) by name (samtools sort -n
), and is the barcode available either in the BX:Z
tag or in the form readname_barcode
?
It might also help to take a look at our ARCS makefile (available in Examples/arcs-make), which can be used to do many of the different steps in the ARCS pipeline for you.
Lauren
from arcs.
@lcoombe
As the supernova support told us, it's not required to runlongranger align
on the outputs of longranger basic
and it process the barcodes by itself.
We have only one bamfile made by longranger align
and so no need to be sorted . We expect the longranger align
to put the barcodes in the headers in a form like you mentioned.
I've not got familiar with the arcs-make and its difference with the arcs.
from arcs.
@mossishahi : I ask because we normally use longranger basic
followed by bwa mem
for the alignments - Then we can re-use that read set with a bunch of different tools that are not neccessarily from 10x Genomics, and know that the barcode has been removed from the first read of each read pair.
ARCS requires the input BAM file to be sorted by read name, and I don't how the output BAM from longranger align
is sorted. An unsorted input will lead to an error like skipped 531501621 unpaired reads
similar to what you mentioned above. I would definitely check how the BAM is sorted, and run samtools sort -n
if it is not sorted by read name.
If you want to be sure that the format outputted by longranger align
conforms to the format expected by ARCS, feel free to post a couple of lines of it here.
arcs-make is a Makefile that will run all the different steps of the ARCS pipeline for you. The inputs it expects are longranger basic
processed reads and a draft genome. So for example, it will run the BWA mem
alignments, launch the arcs
binary, the makeTSVfile.py
script, etc.
Since you are using a different aligner from the one that we normally use, you might not be able to run the Makefile directly, but you can use it to see what steps need to be run in the pipeline.
from arcs.
@lcoombe
as I noticed your purposed pipeline is somehow like this: ??
-
run
longranger basic
on fastq files -
run the
bwa mem
on the output files oflongranger basic
(I don't have an exact idea, do you know how should I specify the inputs of it? ) -
run
samtools sort -n
on the bam files
Q we have 4 fastq files which are separated into two groups, each group is related to a sample and they yield to two individual results inlongranger basic
process. we decided to run this script on the second file in order to index the barcodes:
gunzip -c lib2.fq.gz| sed '1~4s/-1$/-2/' | pigz >lib2.bx.fq.gz
the question is that doesbwa mem
accepts two inputs and how does it make the outputs? -
specify the bamfile as the input of Arcs (is it required to sort the bamfile list.txt ?)
some lines of our bamfile extracted by samtools view
SN7001394:302:HMMVKBCXY:1:2114:10044:41109 163 DjScaffold1 6000 60 18S133M = 6180 308 TGCCGTTTTCCAAATCACAAAACGGCAATCCCCATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGGTTT DDDDDEHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIGIIIIHIIIIHIIIIIIIIIIIIGHHHIIIIIIIHIIHIHGIIIIHIIIHIIIIIIIHIIHIIIEEHIIF@HFHHIHIIIIHHH RX:Z:ATCATGGAGATGTGGC QX:Z:DDDDDIIIIIIIIIII XS:i:-85 AS:i:-16 XM:Z:0 AM:Z:1 XT:i:0 BX:Z:ATCATGGAGATGTGGC-1 DM:Z:1.100000 RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
SN7001394:302:HMMVKBCXY:1:1113:8868:36728 163 DjScaffold1 6011 60 151M = 6291 376 CCGCATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGATTTTTATATCAGTTAATTCAGAGATTCAACCA DDDDDIHIIIIIGIIIHIIIIIIIIIIIIIIIIIIIIIIIICCHIHIIIHIIIIIIIIHIHHGHHHHHIIIIIIIIIHIGIIHIIIIIIIIEHFHGIIEHIGHHIIIIIIHIIIIHHHIIGIIHIIIHHIIIIIHGIHHHHEHHHIIIGID RX:Z:TAGACACAGTCGACTT QX:Z:DDDDDIIIIIIIIIII XS:i:-105 AS:i:-27 XM:Z:0 AM:Z:1 XT:i:0 BX:Z:TAGACACAGTCGACTT-1 DM:Z:3.250000 RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
SN7001394:302:HMMVKBCXY:2:1207:11417:81454 147 DjScaffold1 6011 60 151M = 5802 -360 CCCCATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGATTTTTATATCAGTTAATTCAGAGATTCAACCA IHFIGHHHHHIIIIIIIHHG?EFHIIIIHHIHHGIHHIHHIIGIIIIHIHIHHIHHIIHIIIIIIGIIIIH?HHHHHHEHHGHHFFHIHHFHHG@1IIIIIIIIIIIHIHGIIIIIIIIIHIIIIIIIIIIIIIHIIIIIIIIIIIDDCDD RX:Z:GCATCTCTCAGCTCTC QX:Z:DDDDDIIIIIIIIIII XS:i:-84 AS:i:-4 XM:Z:0 AM:Z:1 XT:i:0 BX:Z:GCATCTCTCAGCTCTC-1 DM:Z:2.125000 RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
SN7001394:302:HMMVKBCXY:1:2101:7075:63131 99 DjScaffold1 6014 60 128M = 6138 275 CATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGGTTTTTATATCAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIGIIIHIHIIIIFHIIIIHIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIG RX:Z:GTACGTACATGTATGC QX:Z:DDDDDIIIIIIIIIHI TR:Z:CCGAGAG TQ:Z:IIIIIII XS:i:-71 AS:i:0 XM:Z:0 AM:Z:1 XT:i:0 BX:Z:GTACGTACATGTATGC-1 DM:Z:0.833333 RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
SN7001394:302:HMMVKBCXY:1:1210:18584:48645 99 DjScaffold1 6016 60 128M = 6174 310 TCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGATTTTTATATCAGTT IIIIIIIIIIGIIIIIIIIIIGIIIIIIIHIIIIIIIIHIIIHIIIIIIIIIHIIIIIIIHIIIIIIIIIIIIIIIIIGIIHIIHIIIIIHIIIHIIIIIIIIIIIIIIIIIICGCHHIIIIIIIHII RX:Z:ACATCTTTCGCCTGTT QX:Z:DDDDDIIIIIIIIIII TR:Z:ATCCCCA TQ:Z:IIIIIII XS:i:-74 AS:i:-5 XM:Z:0 AM:Z:0 XT:i:0 BX:Z:ACATCTTTCGCCTGTT-1 RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
SN7001394:302:HMMVKBCXY:1:2114:9274:77248 99 DjScaffold1 6025 60 128M = 6212 338 CTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAG
do you recommend to run your suggested pipeline or there is only a small fault with our pipeline?
from arcs.
@mossishahi : Here's a sample command for the arcs-make Makefile, and the commands that would be run. (Using -n
option with a makefile triggers a 'dry-run' -- the commands that would have been run are printed but not executed). Hopefully this will clear up your questions about how to run the ARCS pipeline.
arcs-make arcs -n draft=celegans_LRSIM_pseudohap_noNewLines reads=celegans-N2-LRSIM_longrangerBarcoded
Here, the draft assembly is celegans_LRSIM_pseudohap_noNewLines.fa, and the longranger basic
processed reads are celegans-N2-LRSIM_longrangerBarcoded.fq.gz
Here are the commands run:
perl -ne 'chomp; if(/>/){$ct+=1; print ">$ct\n";}else{print "$_\n";} ' < celegans_LRSIM_pseudohap_noNewLines.fa > celegans_LRSIM_pseudohap_noNewLines.renamed.fa
/usr/bin/time -v bwa index celegans_LRSIM_pseudohap_noNewLines.renamed.fa |& tee celegans_LRSIM_pseudohap_noNewLines_bwa_index.log
/usr/bin/time -v sh -c 'bwa mem -t8 -C -p celegans_LRSIM_pseudohap_noNewLines.renamed.fa celegans-N2-LRSIM_longrangerBarcoded.fq.gz | samtools view -Sb - | samtools sort -@8 -n - -o celegans_LRSIM_pseudohap_noNewLines.sorted.bam' |& tee celegans_LRSIM_pseudohap_noNewLines_bwa_mem.log
echo celegans_LRSIM_pseudohap_noNewLines.sorted.bam > celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof
/usr/bin/time -v arcs --bx -v -f celegans_LRSIM_pseudohap_noNewLines.renamed.fa -a celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof -c 5 -m 50-10000 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500 |& tee celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_arcs.log
python /projects/btl/lcoombe/git/arcs/Examples/../Examples/makeTSVfile.py celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_original.gv celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500.tigpair_checkpoint.tsv celegans_LRSIM_pseudohap_noNewLines.renamed.fa
ln -s celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500.tigpair_checkpoint.tsv celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.tigpair_checkpoint.tsv
/usr/bin/time -v LINKS -f celegans_LRSIM_pseudohap_noNewLines.renamed.fa -s empty.fof -b celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3 -l 5 -a 0.3 -z 500 |& tee celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3_links.log
If you have multiple read files, you can run the bwa mem
command (the output is piped to samtools sort -n
, as you can see above) separately for each read file. Then, make a file (named celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof above) which lists the paths to each individual BAM file. This list of file names is provided to ARCS with the -a
option.
Looking at the sample lines from your alignment file, I can see that the alignments aren't sorted by name, which ARCS does assume.
Hope that helps -- let me know if you have further questions!
from arcs.
@lcoombe I was supposed that running samtools sort -n
on the present bamfile could be useful, I did it but it didn't conduct to a successful Arcs run. It skipped the paired reads as the previous run and it was printing out some messages like this which I break the process.
Warning: Skipping an unpaired read. Read pairs should be consecutive in the SAM/BAM file.
Prev read: SN7001394:302:HMMVKBCXY:1:2210:14481:27671
Curr read: SN7001394:302:HMMVKBCXY:2:2206:21254:21263
Warning: Skipped 1000000 unpaired reads.
Warning: Skipped 2000000 unpaired reads.
now I decided to follow your pipeline using bwa mem but there is a question:
As I contacted the support of Supernova, they expressed that our present fastq files are standard fastq files while your pipeline ideal files are interleaved fastq files. Does it matter?
note: a part of supernova support message about the Arcs:
However, it does appear that they are using the "interleaved" fastq format, where the R1 and R2 reads are present in the same fastq file, rather than in separate files like you have. (Interleaved format is an older format that we no longer use).
from arcs.
Hi @mossishahi,
-
Did you let ARCS finish, or did you stop when you saw those error messages? It is possible that you might still see that warning message but still get a non-empty graph output. That could happen if your aligned outputs supplementary alignments - ARCS reads the BAM file a pair at a time, so will output that warning when it sees a pair of reads that do NOT have the same name, but it will continue reading through the file and identifying correct read pairs. Feel free to post or send me ([email protected]) a portion of your BAM file if you'd like another set of eyes.
-
Sounds good to use BWA mem. I'm not sure which stage of fastq-processing you were discussing with the Supernova support -- The input reads for the Supernova de novo assembler are the raw reads (barcode still in read 1 sequence). For ARCS, the input is the interleaved fastq reads file post
longranger basic
.Longranger basic
outputs an interleaved fastq file calledbarcoded.fastq.gz
,
where the barcode sequence is in aBX:Z
comment of the read headers.
(https://support.10xgenomics.com/genome-exome/software/pipelines/latest/advanced/other-pipelines)
from arcs.
@lcoombe , My apologies for not responding to your comment sooner.
thanks for your guidance.
- Actually, I stopped the Arcs and didn't let it continue because I was supposed that It will result in errors again.
- Now I'm doing some preparations on our data to run the
bwa mem
on them. Support of Supernova informed me that they made a mistake and actually, there is no problem with fastq files.
already I have some questions about the steps you mentioned in above comments:
- one step in the pipeline is:
samtools sort -@8 -n - -o celegans_LRSIM_pseudohap_noNewLines.sorted.bam
,
Q1: I would ask if it is the-
between-n
and-o
correct, and what does it mean? (it seems that the bam file name specified at the end of the command is input name, no ouput name is required? )
Q2: I didn't runtee
functions in the pipeline. is it necessary to run? no.fof
format is generated. is it ok to specify the file names in .txt file?
Q3: in below arcs command, the input parameters seem a bit ambiguous? would you please give me some explanation about the--bx
parameter?
arcs --bx -v -f celegans_LRSIM_pseudohap_noNewLines.renamed.fa -a celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof -c 5 -m 50-10000 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500
Q4: it seems that in above command the default values has been set again, is it necessary? and is it better to chage the values according to our data?
regards
from arcs.
Hi @mossishahi,
- Yes, the
-
between-n
and-o
is correct. The output of bwa mem is being piped directly to thesamtools sort
, so it indicates that it should read from standard input, not from a specified file. The-o
indicates the output file name. - The
tee
functions aren't necessary -- they are just used in the pipeline for capturing the standard out and standard error to log files. - I'm not too sure what you mean about the input parameters being ambiguous -- If you are unsure what they mean, arcs has a help page which explains them:
arcs --help
.--bx
just indicates that the barcode is read out of the BX:Z tag of the alignments (although that option is deprecated -- ARCS now will look for the barcode in that tag, and then in the read header in the form "read_name"_barcode if it doesn't find it in the SAM tag) - You can certainly set the parameters for ARCS as you wish based on your data - Doing runs with various parameters is a good idea to ensure you find the best parameters you can for your particular data. If you want to use the
arcs-make
file, take a look at the help page --arcs-make help
. Parameters can be specified like this:
arcs-make arcs draft=test_scaffolds reads=test_reads m=50-6000 a=0.9
Hope that helps!
from arcs.
@lcoombe I'm grateful for your guidance.
as you recommended I followed this pipe line:
- run the
longranger basic
on the fastq files - run the
bwa index
on our genome reference - run the
bwa mem
on the 10x libraries and reference
conver the output to bam
sorting bamfiles
and now, the below command:
arcs --bx -v -f ~/e/Planaria_10X/Fastq/For_10x_Denovo_Data/PG2103_03BE5/ref/refdata-DjScaff_fnl20141213/PG2103/outs/pre-ref/DjScaff_fnl20141213.fa -a bam-libs.txt -c 5 -m 50-10000 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b ImprovedDjScaff_fnl20141213
but however the output message wasDone
and the messages during the process was like:
On line 40000000
On line 50000000
On line 60000000
On line 70000000
On line 80000000
but theImprovedDjScaff_fnl20141213_original.gv
file is empty! eventhough theImprovedDjScaff_fnl20141213.dist.gv
is about 11MB.
from arcs.
@lcoombe as I saw you had requested a sample of the genome in similar issues, I prepared it:
15 >DjScaffold8
16 TTCTTTTCCTTTAATGCTTCCTCTAATCAGGCTTCCTTTAGTGGGCTTACTTACGTGGTCTACCTTTAACAATCTGGTTAATTTACTTATGACTCAGTTACCAAAGTAACCTGTACAAATGGTTCTACTGAAAACTGTCTAGGCAAAATGCAATTGAATCTA
17 >DjScaffold9
18 TCAATATAATTCACAATATAAACAGATAATTTAAAATAATTAATAACGAAATAACTAATTAATTGTTCAATAAGAACCAAATAAAGAATTTTTTTTCAAAAGTTATTTCTTAAAGACGTTGAGTATAAAATTTACAGAAGACCATAACTATTGAGAATGGCG
19 >DjScaffold10
20 AAAAAAAATAATAAATAAAATTTTTTTTATGAATTATTTTCCCTAAATTTTACGTGGGATTTTAAAGAGTTTTATCTGCTTATATGTAATTTATGAAACATTTTATCAACTTATATGTAATGTTTGACAAATTTGTTCTAATAAATCAAATAAGTTACCAAA
21 >DjScaffold11
22 TTATAATAACAGTTTGTATTTAACTGATTCTTTGTAGTGGAAATAATGATTTTAAATACTTCAATCATTCAGTGAATAAACTAGTATTACAAATGAGAATTCTTGATCGACTGGTTTTTAATAGTGAAAATTTTCAATTTAATCTGGATATTTATATTATAA
23 >DjScaffold12
24 AAAAGGTAATGTTTAAATCGTGTTGCAAAATAATATATTGCCAGGAGTTCATTTCCTAGTTACACAGTACATCTGCTCTGTCTTTGTAAACTTTTTTTGATCCATAGCAGGTAACTTTCTCTTCATTATTTTGTAACTGGCTAAGAACTACCCCCATAGCAT
25 >DjScaffold13
26 AATTGTTGGACATAACCATTGATCGTTTCTTAGAATTCATTTAAAATAAAAGCTTCGATCAAAAAAATTTTTTTTTGTTTTTTTGAGATTGATTCGATATTTTGACCTTGTTATTGTCAGAAATTGAAAGTCAATTTTCGGTTCCACCCAATTATTTCTAAA
27 >DjScaffold14
28 TGGTTGATGGAATTTGTTGGATGATGTGTTGTTTGAGAGGTAATGTGATGGAGAAAATTTTGGATGCCGATTGTGACTTAATTTACGATTTTTACGGCTTGAATTTGATTGGTTGATTTTTTGAAGTAATATAAGTGTGTTAAAAAATATTTTCGAGCGGAG
29 >DjScaffold15
also we had generated 2 fastq files using longranger basic
,
first lines of barcoded_1.fastq
:
1 @SN7001394:302:HMMVKBCXY:1:1110:9233:53199
2 TCAGAAGAAGATAATAAGAAGGGGATAACCAATCCGAAAGTGACAGTGAAACTGAAATGTGTGAGTAGCATAAATATGACAGCGAAGAATTGGTAGATCGTAGCTGGTTGCCGTCATGTGAAGTAAGT
3 +
4 IHGIIIIHIIHIIIIIIIGIIIGIIIIHIHHHIIIIIIIGIHGIFHGHHIIIIIFHHHHIIIIIIHHIGHHHEHIIGHGGHHGHHIHIHHHIIHIHIIHIIHIIIIIIIIGHHHHHFHHHEHHHHII<
5 @SN7001394:302:HMMVKBCXY:1:1110:9233:53199
6 ATGTAGATAATATAAAAGTCCGTTTAATTCCGTTCAACGTTGCTTCATCCGGGTCCTCCAGTTTTTTCTATTTAAAACTTATAACAGTAAGGTTTGTGCCCTTAAAACAAAGAGCTTCAAAAGTTGCCTTAATAGGTGTTAAAGCATCAAA
7 +
8 BDDDDIIIIIIIIIHIIIIIIHEHIIIIIIEHHIGIIHHIIIIIIIIIIIIIIIIIGIIIIIGIIHIIIIIIIIIIIIGHIFHGIIIIIIIIIHIHIEHHIGHHEHGHIHHGHIIIIIIGIIIHHHGEIIHHIIHIHHHHCHHIIIIIIGH
9 @SN7001394:302:HMMVKBCXY:1:2107:13745:91712
10 TCAACAATTATCTGAGCGATGTGATATTCTCACCGAAATGCTTAAAAACCAATTGAGATCAAAGCTAGCCAGGAACTTCATCTCATTGATGTCAAGACTCCATATGATATTGCTAGCTGTATGGAAAA
11 +
12 HHIIIIIIIIEHIIIIIIIIIIIIIIGIIGIIIIIIIHIIIIIIIHIIIIIIIFHIIIIHIIIIIIHIGHGHIHHGHIFHHHHIIHIGIIIIIIIIGIIIIIHHHHIIIIIIHHIHIIIIIIIIIIII
13 @SN7001394:302:HMMVKBCXY:1:2107:13745:91712
14 TCAAACATCAATTGGAGTGCGCAAACTTGGAGAATACATAATACTATGGTTGCCATCGACAATTGATGTTTGATGAAATGCATTTGCATTTAATGTTGATTTGATTTTTCTTGGAGATTTTCAATCGTGAGAGAATCTGATTGTTGTTGGA
15 +
16 DDDDDIIIIEHIIHIHIHHHDHIIIIIIHIHIIIHGHIIIIIIIIIGHHIHIIIIHIIIIIIHIHIHCGEGHIHHICHEHHCHHIIIIGIHIIIIEEEHHHCHHHHIHICHHFIIIF?GHIHHHHHIHHHHHHHHIIIIIIIIIIHIIGHC
17 @SN7001394:302:HMMVKBCXY:1:2111:3028:90075
18 AGATTGAAAATCGCACAAGGAGTTTCAATATGAAGGGTAGGAAAGCATCTCCTGAACTGAACTCGTATTATTATAACTACTAACGTATTTCTCATTGTATATTTACAGACATTTCATATCGTAATTTT
19 +
20 IIHHIIIIIIIIIIIIIIIIIIHHHIIIIGIHIIIIIIIHHIIIHIIIIIHIIIIIIIHHIIIIGHHHHHIIIIIIHIHHHIIIIIHHHHFHHHEHHHIHHHIIIGHHHIIIIHI?FG@HHHIFIHII
21 @SN7001394:302:HMMVKBCXY:1:2111:3028:90075
22 TTTTGCGTTTTTTTTGTTGGCTACAAAAATAAAAAAATTATTTTTAAAAAAAAAAAAAGGAGCAAAAAAAAATAAATATTTTTTTTTTTAAAAAAAAAAATTTTTTTTTTTATAATTAATTTTAGTTTTTTTTATGTGGTTTTAAGTTATT
23 +
24 0<0001//</<1//</11111111<1<111111<</C111<111<<1DC?F//</<//00111<<C@H/CE/111111<<<GE<CCE</1<1C1//</</1011D//////00000000000<000</0///:://///8//.//////;.
25 @SN7001394:302:HMMVKBCXY:1:2103:4787:39729
26 CCATGGCAAGGGAGGAGAACAAGGAAATATGGAGACCGACTTGATTACACCTAGAAGCAATGTGCTGGGCAGATAAAATATGCCACTGCTGTGGCAGAAAAGGGCACATATCAAAAAATTGTAGCAAT
27 +
28 IIIIIIIIIIIIIIIHIIGIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIHHEHHHIIGHIIIIIIIIIIIIIIIHIIIIIIHIIIIIIGIHIIIIIIIIHIGIHII0
and first lines of barcoded_2.fastq:
1 @SN7001394:302:HMMVKBCXY:2:2215:18724:23724
2 GATCAATTTACCAGATATATGTGTCGGCACAACATCAAAATTAAAAAGACCATGGCAAGGACCATATGAGGTGATAGAGGTGACGGAAACGAACCTCAAGGTTCGGAAAAAAGGGAATTTACTGGAGA
3 +
4 IEFFEEHGHIHEGEEHHEHFH<<GHHICHHFEHCFCGHHI@@CHEGHDEHHIHEHIIHHGI1D?@GEH?CGCGH@@HFEH1<FHGHECCEHICFC11DGIH<1<<C0DCC0F@=@/<//FH@//CCH/
5 @SN7001394:302:HMMVKBCXY:2:2215:18724:23724
6 ATTAATCCTTCCCCGTTTTCTTCAAACTCAGCGTCCTCGCTGAGACCGTTTTTATTTATCATGGTTCCTTTACCCCTTGGCAAAGAAAGTATTAGTGTTTTACTGAGCACCGTCATGGCTTGTTGTGATCGTAATGGATAACTGTGTGATA
7 +
8 <<@@<<11CEHE1FE0D1<DG@H<@111DFCE/<D<CCGHEH@1GHEC<0<1<1<11<1D<G1D<GC1<1<1<GHII?<1<C<<E1D1D<<F111<1DC1<1<1<1<<1DFDHEG1@@HH?C111<DE1F1<<C<1<11111<<1<<1111
9 @SN7001394:302:HMMVKBCXY:2:2105:15021:65237
10 AGAGATTAATGAACACAGTATTGATCAATGTAGAACATGTAATACTTTATATGGATGACATTTTGATTGCATCAGAATCATGTGAGCAACATTTAATAAACATAGAGAATATGATGTCCAAATTACAA
11 +
12 IIIIIIIHHIHIIIIIIHIIIIIIHIHHIIIIIIIIIHHHIIIIIIIIIIHIIHGHIIIIIIIIIIIIIIIIHIIIIIGIIIIIIIIIIIIIIGIIIHHHIIGHHIIIGGHIHHIIIIIIIIIIIIII
13 @SN7001394:302:HMMVKBCXY:2:2105:15021:65237
14 TGTTATTGTTTAATGAAAGAAAAACATCTTTGAACAGAGTATGAAACCCATAGTTGCTCGGTATTAATTTTTCCTTTAACGTATCAAATTTTTTATTGATTGTTTCCCCATAGTTTTTGTATTTTCGCTGAATTTTCATCAAGTTTTTTGG
15 +
16 DDDBBHIIIIIIIIIIIIIIIIHIIIHIIIIIIIIIIIIIIIIIIIIIHHIIIIIHIIIIIIHHHIHHIGIIIIHHIGIIGHHIIIIIGIFHIIIIIGIIIGHHHHHIIHIIGHHHHIGIIIHIIEHIHIHHIIHHIHHIHIIHHIHH<E<
17 @SN7001394:302:HMMVKBCXY:2:1205:17368:72333
18 AATTATAAATATATTGTTAAAAGTTTAATAAGTTTTTAATCTTATCTATAATGTGTGTATCTATACGTGCTTTTGGTGTGTATTATTTTTCTTATATCTTTCTATAATGATATAGGTACTTATTAAAC
19 +
20 IHIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIHIIIIIIIHIHIIIIIIIIIIIGIIIEHFEFHHGHDEHHHIGIIIIIIIIGIHIHIIIIIHIIIIIIIHGHIIIIHHHHIIIIIIIF
21 @SN7001394:302:HMMVKBCXY:2:1205:17368:72333
22 ACTAAATTAACCATATTTTATTCATTTAATTTTATATTTTATTCTTTTTAAAAATATTTGATTTTATAAAAATACATTAAAATAAACATCTTTTATTCATTTAATTTTATATTTTAAAAATCCTTAAATTTTAGTAATGTATTTTATAAAT
23 +
24 DDDDDIHHIIIIIHIIIIIIHHIIHIHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIIIIFGHHIIHIIIIIHIIGHHHIIHIHHIIIIIIIDHHIIIHIIIIIHHIHHH<GHHEHHIIIGHEFIHHIIIIIIIIIIIIGHH
25 @SN7001394:302:HMMVKBCXY:2:2213:3592:77185
26 ATATTTAAATTTCAAATAATTAATATATTACGTTGTGGTCTTTTATTATACATATATATATATTGCATTTCATTATTATTTCTTGTAATTAAATTAAAAAAATCGTATATTAAAAAAAAAAAAATAAT
27 +
28 IFHHHHIIIEEFEGHHGHIFEHH@G1FFEH?ECCCG00CGEGFHFHICFE1CF@C@GHHFHHHH@1<CHE1<GEHHFE<CF1<<1<<CCGFEFH@EGF@ECCG10DFEEGH?@E<C/C/C<C</<C<<
from arcs.
@lcoombe now I noticed that unfortunately I've forgot to run the simple command:
perl -ne 'chomp; if(/>/){$ct+=1; print ">$ct\n";}else{print "$_\n";} ' < x.fa > x.renamed.fa
and it might be because of that. if it's the reason, whiche steps should be rerun?
should 'bwa index' be done again?
from arcs.
@mossishahi - Just to be safe, yes I do rename the scaffolds to numbers before running the ARCS pipeline. If you rename the scaffolds, you need to re-run the bwa indexing, alignment and ARCS steps again.
For the read files -- I'm assuming that reads further down the file do have a BX:Z:
tag, just not the first few that you posted?
If that doesn't solve the issue, it would be helpful for you to post the exact commands you ran, the log from ARCS and perhaps a few lines from one of your alignment files. It would also be helpful to know the expected size of your genome, the N50 of your draft assembly and the sequencing coverage of your 10x data.
Also just a head's up that I'm on vacation this week so I won't be checking my e-mails regularly until next week.
from arcs.
@lcoombe I'm really thankful for your attention and sorry because of the distractions four you on your vacation.
I've renamed the scaffolds and re-run the process.
all of the reads have barcodes on their headers.
I hopefully wait for the end of the process and will inform you of the results.
regards.
from arcs.
@lcoombe , I hope not to distract you on your vacation, I put the issue messages here just to be saved by the time I'm facing them and not to forget them.
during this command: bwa mem -t8 -C -p DjScaff_fnl20141213.renamed.fa barcoded_2.fq.gz | samtools view -Sb - | samtools sort -@8 -n - -o DjScaff_fnl20141213.sorted_2.bam
I noticed a suspicious message:
However there is some commands like:
candidate unique pairs for (FF, FR, RF, RR): (2167, 75752, 649, 1871)
analyzing insert size distribution for orientation FF ...
(25, 50, 75) percentile: (112, 191, 285)
low and high boundaries for proper pairs: (1, 804)
and also some lines like this for FR,RF, RR
but as you see more details in the screen, there was some lines like:
skip orientation FF
skip orientation RF
skip orientation RR
and this code fragment is repeated in the process.
does it indicate a failiure in our process?
from arcs.
@lcoombe I re-run the pipeline and fortunately now the_original.gv
file is not empty. but while I try to run the makeTSVfile
it has below error:
File "/s/chopin/a/grad/asharifi/e/Applications/arcs/Examples/makeTSVfile.py", line 82
estDist = True
^
TabError: inconsistent use of tabs and spaces in indentation
kind regards
from arcs.
@mossishahi This looks like normal behaviour of bwa
. I would not worry about it unless it says something like "warning" or "error".
from arcs.
@benvvalk thanks alot but it seems that the folder of URL you mentioned does not contain Example/makeTSVfile.py
and I also couldn't find the file some where else.
from arcs.
@mossishahi Oh dear, I see you are right. That file is not included in the release tarball.
You can download the latest version of the file from here instead: https://github.com/bcgsc/arcs/blob/master/Examples/makeTSVfile.py
(click the "Raw" link)
from arcs.
Thanks @benvvalk!
@mossishahi - Were you able to successfully run the ARCS pipeline using the updated makeTSVfile.py script?
from arcs.
@lcoombe thanks for your follow up. We had a problem with the LINKS, because of the bloomfilter module which I hope to be solved soon.
from arcs.
@lcoombe I'm grateful for all of your friendly and kind guidance.
It seems our process is done, the result is below:
A C G T N IUPAC Other GC GC_stdev
0.3651 0.1349 0.1349 0.3651 0.3573 0.0000 0.0000 0.2698 0.0515
Main genome scaffold total: 200912
Main genome contig total: 450272
Main genome scaffold sequence total: 1565.210 MB
Main genome contig sequence total: 1006.033 MB 35.725% gap
Main genome scaffold N/L50: 11387/28.531 KB
Main genome contig N/L50: 62048/4.169 KB
Main genome scaffold N/L90: 69220/3.615 KB
Main genome contig N/L90: 275294/907
Max scaffold length: 760.01 KB
Max contig length: 406.522 KB
Number of scaffolds > 50 KB: 5411
% main genome in scaffolds > 50 KB: 35.91%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 200,912 450,272 1,565,209,624 1,006,032,708 64.27%
250 200,912 450,272 1,565,209,624 1,006,032,708 64.27%
500 200,886 450,246 1,565,197,188 1,006,020,272 64.27%
1 KB 124,602 371,431 1,510,377,283 951,541,159 63.00%
2.5 KB 81,422 322,785 1,447,707,331 889,406,218 61.44%
5 KB 56,648 276,090 1,356,610,958 833,018,628 61.40%
10 KB 36,233 228,443 1,191,766,199 756,926,814 63.51%
25 KB 13,403 147,175 836,431,662 562,693,175 67.27%
50 KB 5,411 95,363 562,067,174 395,565,653 70.38%
100 KB 2,081 54,534 330,835,144 238,452,291 72.08%
250 KB 178 8,876 57,545,169 43,193,791 75.06%
500 KB 7 468 4,187,947 3,457,694 82.56%
though the detail statistics of our previous draft genome is:
A C G T N IUPAC Other GC GC_stdev
0.3651 0.1349 0.1349 0.3651 0.3572 0.0000 0.0000 0.2698 0.0513
Main genome scaffold total: 202925
Main genome contig total: 450272
Main genome scaffold sequence total: 1565.189 MB
Main genome contig sequence total: 1006.033 MB 35.725% gap
Main genome scaffold N/L50: 13220/27.741 KB
Main genome contig N/L50: 62048/4.169 KB
Main genome scaffold N/L90: 71225/3.615 KB
Main genome contig N/L90: 275294/907
Max scaffold length: 760.01 KB
Max contig length: 406.522 KB
Number of scaffolds > 50 KB: 5780
% main genome in scaffolds > 50 KB: 32.61%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 202,925 450,272 1,565,189,494 1,006,032,708 64.28%
250 202,925 450,272 1,565,189,494 1,006,032,708 64.28%
500 202,899 450,246 1,565,177,058 1,006,020,272 64.28%
1 KB 126,612 371,427 1,510,354,975 951,539,230 63.00%
2.5 KB 83,429 322,776 1,447,679,403 889,398,825 61.44%
5 KB 58,645 276,062 1,356,542,964 832,983,212 61.40%
10 KB 38,209 228,363 1,191,510,678 756,784,954 63.51%
25 KB 14,951 145,472 828,163,693 556,904,021 67.25%
50 KB 5,780 86,260 510,443,509 358,821,201 70.30%
100 KB 1,589 36,264 222,480,828 160,529,419 72.15%
250 KB 37 1,626 12,140,407 9,503,035 78.28%
500 KB 2 20 1,479,103 1,463,639 98.95%
the statistics are extracted form data by bbstats
command of bbmap
One point we noticed after analyzing the result of Arcs in comparison with the previous draft genome is that Arcs utilizes 10X data to make bigger scaffolds but the total size of the genome has no change, Actually, it doesn't insert the 10X reads in the genome string and only links some scaffolds to make bigger scaffolds.
It's important for us to find a way to do that. Do you have any idea?
the other point that you can see the N50 which is reduced and it's surprising.
Thanks.
from arcs.
@mossishahi - Glad the pipeline finished!
That is correct - ARCS is purely a scaffolder using 10X reads, so will join sequences together and introduce N
s between the sequences (with or without the gap sizes estimated, depending on if you used the -D
option). ARCS looks for sequences that have many 10X barcodes in common as evidence that the sequences are near to each other in the genome, and should thus be joined - It doesn't reconstruct any sequence between the joined contigs. (If you want more information about how the algorithm works, take a look at our paper in Bioinformatics: https://doi.org/10.1093/bioinformatics/btx675).
To attempt to fill in the gaps introduced by ARCS in your scaffolded assembly, you would need to use a downstream gap-filling tool such as Sealer (https://github.com/bcgsc/abyss/tree/master/Sealer).
Yes, that is surprising that the N50 is lower - Do you see the same for the NG50 metric? (To calculate this stat I use abyss-fac from the ABySS tool suite.) I don't know what parameters you played with, but in our runs we commonly vary -a
(Higher = less stringent). (Note that in varying parameters for LINKS, such as -a
that you only need to re-run the final LINKS step)
from arcs.
@lcoombe thanks.
as you suggested I ran the abyss-fac, the result:
the new draft:
n n:500 L50 min N80 N50 N20 E-size max sum name
200912 199536 9264 500 3920 21174 82789 47857 745447 1.006e9 --
the old draft:
n n:500 L50 min N80 N50 N20 E-size max sum name
202925 201549 11026 500 3919 20210 62251 36572 745447 1.006e9 --
now, the N50 is reaonable, but if n:500 meaning is the number of scaffolds longer than 500, why it's decreased? whats the meaning of E-size and sum?
the other point, we didn't run Tigmint before the Arcs.
should we re-run the whole process suggested by the tigmint-make arcs
or there is a shortcut to use the previous generated files? (as I noticed there is some differences in commands options, for example in samtoosl view, or bwa mem and etc. )
I'm studying in order to find a way to make toal size of our draft genome bigger. there are some options: one is to use gap-filling tools like sealer (the point is that the draft genome is made of paired-end and mate-pair reads extracted from the library that are different from 10x data library.) the other option is to run a denovo assembly using both the paired-end, mate-pair lib and 10x library (I'm not sure if abyss assembler uses the 10X data only for scaffolding (like the Arcs) or inserts the 10x reads in genome string).
one other idea is to use the draft improved by Tigmint and Arcs and the utilize gap-fillers using paired-end libraries as the inputs.
thanks for your patience.
from arcs.
Hi @mossishahi,
Since ARCS performs scaffolding (merging sequences), you do expect the number of scaffolds to decrease. The n:500 decreasing just indicates that ARCS was able to merge sequences >=500 bp.
The sum
is the total number of bases in scaffolds >= 500bp, and the E-size
is the expected size of a randomly chosen scaffold (By randomly choosing a base in the assembly). Note that by default, abyss-fac
does NOT count ambiguity codes when calculating these stats.
If you wanted to run Tigmint before ARCS, you can use the same arcs-make
file I mentioned above, but use arcs-tigmint
as the target instead of arcs
. Tigmint does also use BWA alignments of reads to the draft genome, so you could potentially re-use the same alignments as you used for ARCS, but you would need to re-run samtools sort
as Tigmint requires the alignments to be sorted by BX:Z
tag. However, you would need to re-do all the other Tigmint steps and all the ARCS steps. If the alignments didn't take too long, it it probably simpler to just re-do all the steps.
Yes, any of the options that you mention would potentially work.
For the gap-filling approach, if you are using Sealer, you can use any set of reads that you want - Ie. you could use the PE reads and the 10X reads if you want (we often use this approach for our genome assemblies).
For running a new de novo assembly, take a look at the newest release of ABySS. You can specify all the read sets that you have (PE, MP, 10X), and it will perform the ABySS assembly, then run tigmint and ARCS (https://github.com/bcgsc/abyss#scaffolding-with-linked-reads). It will use the 10X reads for building the de Bruijn graph, so in that sense bases from the 10X reads will be incorporated into the assembly. However, it uses ARCS for scaffolding with the 10X data following the initial ABySS assembly, so these merges will introduce gaps as we discussed above.
Hope that helps!
from arcs.
@lcoombe I tried to run tigmint-arcs process, but the result was unexpected for me. Surprisingly, the important factors such as N50 and Sum has been decreased by a large magnitude. Is it normal?
n n:500 L50 min N80 N50 N20 E-size max sum name
223600 217764 13453 500 3392 16324 51247 31755 745447 1.004e9 DjScaff_fnl20141213.renamed.tigmint.arcs.fa
It's also amazing for me that the factors have been decreased toward the experiment which the Arcs had been only run!
from arcs.
@mossishahi - So if I understand correctly, you are saying that the N50/sum are lower for Tigmint+ARCS vs. just ARCS? That is possible depending on how many cuts Tigmint made. We do normally see that breaking with Tigmint first does end up increasing the N50 post-ARCS (vs. no Tigmint), but this will likely depend on different factors such as the parameters used for the tools, coverage of 10x data, etc.
Comparing the metrics of baseline, post-Tigmint, post-Tigmint/ARCS, post-ARCS assemblies is a good idea to best understand your results.
Here's the Tigmint biorxiv paper which shows examples of running Tigmint/ARCS on various different assemblies: https://doi.org/10.1101/304253
from arcs.
We are seeing the behaviour your describe (drop in contiguity) in some genome assemblies, @mossishahi. If you have a way of assessing the assembly correctness (using QUAST and a reference or a close reference, I strongly suggests that you do). This is independent of ARCS, however.
It looks like the original issue you were having (empty _original.gv file) has been resolved. Accordingly, I will now close the ticket. Feel free to open a new issue, as needed.
from arcs.
Hi,
we are facing a similar issue, but with inconsistent outcomes depending on the sample/bam file. We have two 10X libraries from the reference individual, one ca 60x and one ca 15x, and two 10X libraries from individuals from the same species with ca 60x and 25x coverage. However, arcs only outputs a complete *original.gv file for the 60x coverage library of the reference individual, for the others it is empty. We use the exactly same pipeline to produce them. Any insights?
(Note that also for the one that outputs the file and the according tsv table, LINKS fails to produce scaffolds; opened a separate issue on the LINKS page).
Thanks a lot for any insights & help!
Reto
from arcs.
If you have a fully empty output graph file, that usually indicates an issue with the input file formatting. Take a look at the log files to see if there are any warning messages, and/or where many alignments are being filtered out.
Are you using the provided ARCS makefile (Examples/arcs-make)? I recommend that for issues like this -- using that helps to fix formatting errors that could be causing the empty graph file and perhaps the LINKS issue you are experiencing.
from arcs.
Dear Lauren,
Thanks a lot for your swift reply. Unfortunately, I have to admit that I have pretty much no clue how to run the make pipeline, for which reason I have followed the procedure outlined under issue #9.
ARCS seems to run just fine, with no error messages whatsoever. Same holds for LINKS. One thing I see in the ARCS log is that there seem to be no scaffold-end barcodes retained (see also below):
{ "All_barcodes_unfiltered":2314172, "All_barcodes_filtered":1549248, "Scaffold_end_barcodes":0, "Min_barcode_reads_threshold":5, "Max_barcode_reads_threshold":10000 }
Is this what your question about where alignments are filtered out refers to?
Our aim of using ARCS+LINKS is to scaffold a ca 1.3 Gb PacBio assembly with 1967 scaffolds that is polished with Pilon and Tigmint. The contig N50 of this assembly is 8.6 Mb, with contigs up to 45 Mb. For scaffolding we have 10X data from the same individual (ca 60x) that has been sequenced from UHMW DNA (mean fragment length 200kb) extracted from blood, DNA from the same individual extracted from liver (ca 15x), but could also use similar data from different individuals (which is something I started trying because we got no scaffolding from the former data).
As already mentioned, we are facing two problems:
- For our main data with 60x coverage, ARCS seems to work, but LINKS produces no scaffolds
- For the other data, the original.gv file is empty, although alignments are produced with the exactly same script.
I have checked both the interleaved reads files and the bam files, and they seem to look fine. Here a couple of lines from one of the individuals on which ARCS fails:
Interleaved fastq.gz:
@A00621:126:HMH3GDSXX:2:2468:19840:5854_AAACACCAGACAATAC TCATAAAAGTTTCTATTACTAGAGTTTAGGATAAATAATAGCAGAACATAAAACAACCAGCCAGGATGTGGCTCCCAAATCCAGAATTTAAATACAGGGATAAAAACTTCCATAAAATATAAATTCTT + FFFFF,:F,:,,FF:F,,,F:,FF:FFFF,:,,,FF,,,,,,F,FFF,,FFFFF,FFFF,F,F,FF:,F,F,:FFFFF,F,,F:,F,,:FF,F:FFFF,F,FFF,,F,,,F:,FFF::FF,FF,FFF, @A00621:126:HMH3GDSXX:2:2468:19840:5854_AAACACCAGACAATAC TCCAAAATGTTTATATAAAATTTATACTGTGAGGAAATGGACAAGATTTTACATTTACATACAAAAATTAGGGGGAAGATCTTTTCATCACATTATAAATTTTATATAAACATTTTGGAAGGGAAATAATATGTGGCAATAAGATAGACTG + FFFFF:F::FFFF:FFFFFFFFFFFF,:FFFFFFFF:FF:F:FFFF::FFFFF,F:FFF::FFFFF:FFFFF:FFF,FF:F:F:FFFFFF::,FF::F:F::FFFFFFF::,,,F,,FF,FFFFFFFFF,:FF,FFF,F,F:,FF,F,FFF @A00621:126:HMH3GDSXX:1:2565:19759:29982_AAACACCAGACAATAC GAAATAAAATATATGCAAACACCAATCCTGGAGAGCTCAAAATGATGCTTCACTACATACCTAAAACACAGACATCACATGTTGCCTCAACACACAACATCGACACTCAATCCCTTCAAAACATCATT + F,FF:,:F::FFFF,,,FFFFFFF:FF,F,,:,:,FFFF,F,F,FFFF,F,,::,:F,FF:FF,,,FFF:,:,:FF:,,,,F,,FFFF:,FFF,F,,,,F,,,,F,F,F,FFFFF,F,,F,F,F,,FF @A00621:126:HMH3GDSXX:1:2565:19759:29982_AAACACCAGACAATAC CTGCTCAGGTCCGTGCAGCTGATCTGAAGCTGCGGGTGTCTGCGTGTGCCGCGGCGCCGAGTGCGTGCGCGCTCCCCCTGCGCTGCACAGCTGCTACCCAAGAAAGGCTAAATATACTCATTCCTCCGCTTTCTGCTGAGAACAGAAACTC + ,F:FFFFF:,,,FFFFF,FF,,,F::FFFFFFF:F::::F,F,FF,,,:FF:,FFFF:FFFFFF,FFFF:FFF::FFFFFFFFFFFFFFF,::F,,FFF::FF:FF:FF::F:FFF,FF,F,FFFFFFF:,FFF,F:::FFFF,FFF:F:
bam file:
A00621:126:HMH3GDSXX:1:1101:1000:12962_AGTTGGTTCACGGTAT 65 289 434235 24 105S23M 191 468286 0 TGTGGTCAGGTGGGGCTGGTCTCCTTCTCCAGGCAGCAACTGACAGAAAGAGAGGACACATTTTCAAGCTGTGTCAAGGGAAAATTAGGTTCGAAATTAGAAAAAAGTTTTTCACAGAAAGAGTGATT A00621:126:HMH3GDSXX:1:1101:1000:12962_AGTTGGTTCACGGTAT 129 191 468286 1 78S57M16S 289 434235 0 CCACAGAATCACTGCAATGTAAGAAAACTTCAAGATCATCGAGTCTAAACCTGCTTTAACACCTGAACTAGAGGATGACACCAAGTGCCATATCCAGCCTTGTTTTAAACACATCCAGGGATGGTGACTCCACCAGCTCCCTGGGCAGACC A00621:126:HMH3GDSXX:1:1101:1000:13275_AGGGAGTGTACCGTTA 69 84 1521596 0 * = 1521596 0 GAAAAATCAGGTACAGAATGAATAACATTAATTAAAAGATTTCTTCAAGAGAGATTCTTTATTTGGACTTCAGTCTTATTTCTATACCCTGATATAGACCTCTGTTGGTCTTATGGTCTAATTTACAG A00621:126:HMH3GDSXX:1:1101:1000:13275_AGGGAGTGTACCGTTA 137 84 1521596 30 25S43M83S = 1521596 0 ACATCTGAGTACAGATTTTTTACTATGTGTGTATATATATATATATATATATATATACACATATATATCTATCTCTCTCTCTCTCAAATTTTTCTTAATATAAAAAGAATTTTCAGAAAATTGTGCTGCTCATAAAAAAAATGAGTCTGTA
Regarding the issue that LINKS does not scaffold any contigs, with e up to 50000 there are mainly single shared barcodes. Still, there are contigs with quite a couple of barcodes. Here the ones with most shared barcodes from the tsv file:
`
1097- | 501+ | T | 1016 | 14615 | 46856 | 1576606 |
---|---|---|---|---|---|---|
501- | 1097+ | T | 1016 | 46856 | 14615 | 1576606 |
1001+ | 752- | T | 622 | 20798 | 19404 | 1576606 |
752+ | 1001- | T | 622 | 19404 | 20798 | 1576606 |
1946+ | 752- | T | 587 | 25938 | 19404 | 1576606 |
752+ | 1946- | T | 587 | 19404 | 25938 | 1576606 |
1859- | 1946- | T | 554 | 17820 | 25938 | 1576606 |
1946+ | 1859+ | T | 554 | 25938 | 17820 | 1576606 |
1859- | 752- | T | 532 | 17820 | 19404 | 1576606 |
752+ | 1859+ | T | 532 | 19404 | 17820 | 1576606 |
`
I have run both ARCS with a range of parameters, including different values for c (2,5), m (5-10000), and e (up to 100000). One thing I observe is that with e up to 50000 there are no scaffold-end barcodes. This starts changing from 75000 on, but to no effect on scaffolding.
from arcs.
Hi @burri-wildlight,
If you go to the README documentation, you see this line:
To run the pipeline in default mode, run Examples/arcs-make arcs. For more info check Examples/arcs-make help.
Running that command Examples/arcs-make help
as documented will give you a very detailed description of how to use the Makefile. Using it is much easier and less prone to errors than following the multi-step process, and I suggest that you start there because it is otherwise hard for me to know if the issue is formatting-related or data-related (There are no commands or full logs for me to look at).
Running the makefile is as easy as:
Example: To run arcs with myDraft.fa, myReads.fq.gz (interleaved, processed by longranger basic with the barcode in the BX tag), and a custom c value and multiplicity range, run:
./arcs-make arcs draft=myDraft reads=myReads m=[User defined multiplicity range] c=[c val]
To ensure that the pipeline runs correctly, make sure that the following tools are in your PATH: bwa, tigmint, samtools, arcs (>= v1.1.0), LINKS (>= v1.8.6)
As stated, any non-default parameters you want are specified there and you can add -n
to do a dry-run.
If using the Makefile doesn't solve the problem, maybe open a fresh issue so we can have the discussion on a dedicated thread to keep things more organized.
from arcs.
I'm very glad you figured it out and thank you for posting your solution! Yes, usually the issues with no scaffolding end up being an issue with one step of the pipeline or file formatting, which is why we made the Makefile in the hope of making these things easier for the user.
Also for the benefit of future users -- without seeing your full arcs-make
command, it looks like you probably ran arcs-make arcs draft=oenMen1_1.3.fa reads=reads
or something like that. Looking at the example I put above, you'll see that the Makefile expects the draft and reads files to be provided without their respective .fa
and .fq.gz
suffixes.
from arcs.
Related Issues (20)
- Regarding the error info "File contains unpaired reads" HOT 7
- About Running ARCS in default mode HOT 3
- GCC version update HOT 1
- arcs-1.2.2: abyss-fixmate-ssq sometimes segfaults HOT 9
- Short read length question: can the tool accept 250? HOT 2
- arcs-long: tiny PE-pairs are produced HOT 3
- Scaffolding Expectations HOT 12
- Running ARCS and ARKS HOT 2
- Parameters for corrected ONT reads HOT 2
- arcs-make in bin HOT 2
- Add optional dependency of pigz HOT 1
- unrecognized option '--fastq' HOT 3
- fastq formatting barcodes in BX tag HOT 8
- ARKS fails to create any links with a highly repetitive input HOT 4
- Parameters for PacBio HiFi data HOT 3
- Can this software use HiC data, or need to use the sequencing technology mentioned in your paper? HOT 1
- `arcs-tigmint` and `arks-tigmint` struggle with input files outside working directory HOT 3
- Incorrect program call? HOT 6
- `arcs-long`/`arks-long` vs `LINKS`? HOT 2
- Understanding specific scaffolding output HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arcs.