Giter VIP home page Giter VIP logo

arcs's People

Contributors

jtse02 avatar justinchu avatar jvhaarst avatar jwcodee avatar lcoombe avatar murathangoktas avatar sarahyeo avatar sjackman avatar theottlo avatar warrenlr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arcs's Issues

empty _original.gv file

Hi !
we followed these steps in a pipeline in order to improve a draft genome related to an species of Planaria:

  • run longranger align on 4 fastq files and our draft genome to get a bamfile
  • run the Arcs in a command like this:
    arcs -f ./DjScaff_fn120141213.fa -a ./bamfilename.txt
    but we didn't get our expected output, since it started to work we faced the message
    skipped .. unpaired reads
    and finally the output message was:
Warning: Skipped 531501621 unpaired reads. Read pairs should be consecutive in the SAM/BAM file.
{ "All_barcodes_unfiltered":2297724, "All_barcodes_filtered":1502436, "Scaffold_end_barcodes":33849, "Min_barcode_reads_threshold":50, "Max_barcode_reads_threshold":10000 }

=> Pairing scaffolds... Thu Jul 19 07:51:44 2018

=> Creating the graph... Thu Jul 19 07:51:44 2018

=> Writing graph file... Thu Jul 19 07:51:44 2018

      Max Degree (-d) set to: 0. Will not delete any vertices from graph.
      Writing graph file to ./pre-ref/DjScaff_fnl20141213.fa.scaff_s98_c5_l0_d0_e30000_r0.05_original.gv...

=> Creating the ABySS graph... Thu Jul 19 07:51:44 2018

=> Writing the ABySS graph file... Thu Jul 19 07:51:44 2018

=> Done. Thu Jul 19 07:51:45 2018

and output files:
DjScaff_fnl20141213.fa.scaff_s98_c5_l0_d0_e30000_r0.05_original.gv
which is empty
and
DjScaff_fnl20141213.fa.scaff_s98_c5_l0_d0_e30000_r0.05.dist.gv
which has a content like this:

digraph arcs {
     2 "DjScaffold1+" [l=33983]
     3 "DjScaffold1-" [l=33983]
     4 "DjScaffold2+" [l=33879]
     5 "DjScaffold2-" [l=33879]
     6 "DjScaffold3+" [l=45345]
     7 "DjScaffold3-" [l=45345]
     8 "DjScaffold4+" [l=25271]
     9 "DjScaffold4-" [l=25271]
    10 "DjScaffold5+" [l=39436]
    11 "DjScaffold5-" [l=39436]
    12 "DjScaffold6+" [l=143049]
    13 "DjScaffold6-" [l=143049]
    14 "DjScaffold7+" [l=7664]
    15 "DjScaffold7-" [l=7664]
    16 "DjScaffold8+" [l=45170]
   ...

and obviously running the python script faced the below error:

  File "/s/chopin/a/grad/asharifi/e/Planaria_10X/Fastq/For_10x_Denovo_Data/PG2103_03BE5/ref/refdata-DjScaff_fnl20141213/PG2103/outs/pre-ref/DjScaff_fnl20141213.fa.scaff_s98_c5_l0_d0_e30000_r0.05_original.gv", line 1
    graph G {
          ^
SyntaxError: invalid syntax

Do you have any idea?

thanks

Running runARCSdemo.sh - IOError: [Errno 2] No such file or directory: 'hsapiens-8reformat.fa.scaff_s98_c5_l0_d0_e30000_r0.05_original.gv'

After I installed ARCS, I tryed to test example data:

root@user-lubuntu:/mnt/hgfs/SharedFolders/programas/arcs/Examples/arcs_test-demo# ./runARCSdemo.sh
Downloading sample Chromium read alignment .bam file and human genome assembly draft...
--2018-04-28 23:37:55--  ftp://ftp.bcgsc.ca/supplementary/ARCS/testdata/NA24143_genome_phased_namesorted.bam1.sorted.bam
           => “NA24143_genome_phased_namesorted.bam1.sorted.bam”
Resolvendo ftp.bcgsc.ca (ftp.bcgsc.ca)... 134.87.4.91
Conectando-se a ftp.bcgsc.ca (ftp.bcgsc.ca)|134.87.4.91|:21... conectado.
Acessando como anonymous ... Acesso autorizado!
==> SYST ... feito.    ==> PWD ... feito.
==> TYPE I ... feito.  ==> CWD (1) /supplementary/ARCS/testdata ... feito.
==> SIZE NA24143_genome_phased_namesorted.bam1.sorted.bam ... 23218461706
==> PASV ... feito.    ==> RETR NA24143_genome_phased_namesorted.bam1.sorted.bam ... feito.
Tamanho: 23218461706 (22G) (não autoritário)

NA24143_genome_phased_name  40%[===============>                        ]   8,69G  --.-KB/s    in 4h 33m

2018-04-29 04:11:19 (556 KB/s) - Conexão de dados: Tempo esgotado para conexão; A conexão de controle está fechada.
Tentando novamente.

--2018-04-29 04:26:20--  ftp://ftp.bcgsc.ca/supplementary/ARCS/testdata/NA24143_genome_phased_namesorted.bam1.sorted.bam
  (tentativa: 2) => “NA24143_genome_phased_namesorted.bam1.sorted.bam”
Conectando-se a ftp.bcgsc.ca (ftp.bcgsc.ca)|134.87.4.91|:21... conectado.
Acessando como anonymous ... Acesso autorizado!
==> SYST ... feito.    ==> PWD ... feito.
==> TYPE I ... feito.  ==> CWD (1) /supplementary/ARCS/testdata ... feito.
==> SIZE NA24143_genome_phased_namesorted.bam1.sorted.bam ... 23218461706
==> PASV ... feito.    ==> REST 9335154088 ... feito.
==> RETR NA24143_genome_phased_namesorted.bam1.sorted.bam ... feito.
Tamanho: 23218461706 (22G), 13883307618 (13G) restantes (não autoritário)

NA24143_genome_phased_name 100%[++++++++++++++++=======================>]  21,62G   752KB/s    in 5h 47m

2018-04-29 10:13:43 (651 KB/s) - A conexão de controle está fechada.
Tentando novamente.

--2018-04-29 10:28:45--  ftp://ftp.bcgsc.ca/supplementary/ARCS/testdata/NA24143_genome_phased_namesorted.bam1.sorted.bam
  (tentativa: 3) => “NA24143_genome_phased_namesorted.bam1.sorted.bam”
Conectando-se a ftp.bcgsc.ca (ftp.bcgsc.ca)|134.87.4.91|:21... conectado.
Acessando como anonymous ... Acesso autorizado!
==> SYST ... feito.    ==> PWD ... feito.
==> TYPE I ... feito.  ==> CWD (1) /supplementary/ARCS/testdata ... feito.
==> SIZE NA24143_genome_phased_namesorted.bam1.sorted.bam ... 23218461706
O arquivo já foi obtido.
2018-04-29 10:28:47 (0,00 B/s) - “NA24143_genome_phased_namesorted.bam1.sorted.bam” salvo [23218461706]

--2018-04-29 10:28:47--  ftp://ftp.bcgsc.ca/supplementary/ARCS/testdata/hsapiens-8reformat.fa
           => “hsapiens-8reformat.fa”
Resolvendo ftp.bcgsc.ca (ftp.bcgsc.ca)... 134.87.4.91
Conectando-se a ftp.bcgsc.ca (ftp.bcgsc.ca)|134.87.4.91|:21... conectado.
Acessando como anonymous ... Acesso autorizado!
==> SYST ... feito.    ==> PWD ... feito.
==> TYPE I ... feito.  ==> CWD (1) /supplementary/ARCS/testdata ... feito.
==> SIZE hsapiens-8reformat.fa ... 3074865858
==> PASV ... feito.    ==> RETR hsapiens-8reformat.fa ... feito.
Tamanho: 3074865858 (2,9G) (não autoritário)

hsapiens-8reformat.fa      100%[=======================================>]   2,86G   459KB/s    in 86m 58s

2018-04-29 11:55:48 (575 KB/s) - “hsapiens-8reformat.fa” salvo [3074865858]

Running ARCS...
Converting graph for LINKS...
Traceback (most recent call last):
  File "./makeTSVfile.py", line 96, in <module>
    readGraphFile(infile)
  File "./makeTSVfile.py", line 15, in readGraphFile
    with open(infile, 'r') as f:
IOError: [Errno 2] No such file or directory: 'hsapiens-8reformat.fa.scaff_s98_c5_l0_d0_e30000_r0.05_original.gv'
Running LINKS...
run complete

number of scaffolds == number of contigs

I am trying to improve rat reference genome rn6 with 10x data. I ran longranger basic on my R1 and R2 files to get a barcoded.fastq.gz file. I checked the file to make sure it has the BX:Z:barcode-1 after the @e in the fastq file (with a space in between).

$ cat summary.csv 
barcode_diversity,bc_on_whitelist,num_read_pairs
1064967.05402,0.933687055596,394846319

I already ran tigmint and have a rn6_after_tigmint.fa file. I also have a phased_possorted_bam.bam file from longranger. So far, I have done the analysis a few ways and got the same results: the scaffold.fa is basically the same as the rn6_after_tigmint.fa file, with one contig for one scaffold. I ran assembly-stats on both files and the stats are the same. The
.assembly_correspondence.tsv file contains NA for all the rows in the last 3 columns.

$ tail *correspondence.tsv
scaffold9553    7080    7080    f       NA      NA      NA
scaffold9554    5570    5570    f       NA      NA      NA
scaffold9555    3375    3375    f       NA      NA      NA
scaffold9556    2162    2162    f       NA      NA      NA
scaffold9557    8135    8135    f       NA      NA      NA
scaffold9558    3363    3363    f       NA      NA      NA
scaffold9559    7157    7157    f       NA      NA      NA
scaffold9560    6728    6728    f       NA      NA      NA
scaffold9561    1136    1136    f       NA      NA      NA
scaffold9562    7468    7468    f       NA      NA      NA

The analysis I've done:

  1. arcs-make arcs draft=rn6_after_tigmint reads=myreads
  2. pipeline_example.sh rn6_after_tigmint phased_possorted_bam.bam
  3. runARCSdemo.sh also produced a links_c5r0.05e30000-l5-a0.9.scaffolds.fa that has the same assembly_stat as the hsapiens-8reformat.fa and NAs in the .assembly_correspondence.tsv file.

I am suspecting the barcodes are not being read into the pipeline but not sure how to check it.
Any insights are highly appreciated!

arcs 1.0.5
tigmint 1.1.2
LINKS 1.8.6
longranger 2.2.2

BAM FILES input

Hi,
It's more a question about how to use arcs than to report an issue.

I see arcs need bam files and fasta file as input. I understand the fasta file is the scaffolds sequences but i don't really understand what are the bam files ?

Are they chromium "linked-reads" output ? From which softwares ? supernova produces fasta files, otherwise, i see that cellranger produces bam files.
Can you explain me how i can generate bam files from chromium data ?

cheers

The Influence of PCR duplicates on Arcs

Hi,

I found that chromium sequencing have more bias (GC bias, PCR duplicates) than standard WGS. In the manuscript, it mentioned that:
These form a link between the two sequences, provided that there is sufficient number of read pairs aligned (-c, set to 5 by default)
So I wonder if ARCs will consider PCR duplicates during scaffolding. If not, is it better to remove duplicates before scaffolding using Arcs?

Best,
Danshu

Problem with ARCS with or without Tigmint

I working on a plant genome (~220 Mb)
My starting point is an phased assembly from Pacbio reads with Falcon and Falcon Unzip (774 contigs)

Important : I have renamed the contigs names by simple numeric identifier :

>1
>2
...
>774

What I have done so far :
1/ longranger basic on the reads.fastq from 10X genomics

2/ Tigmint correction with the following commands :

samtools faidx draft.fa
bwa index draft.fa
bwa mem -t8 -p -C draft.fa reads.fq.gz | samtools sort -@8 -tBX -o draft.reads.sortbx.bam
tigmint-molecule draft.reads.sortbx.bam | sort -k1,1 -k2,2n -k3,3n > draft.reads.molecule.bed
tigmint-cut -p8 -o draft.tigmint.fa draft.fa draft.reads.molecule.bed

3/ When I try to run ARCS on the output from tigmint like with :
arcs -f draft.tigmint.fa -a file_of_bamfile -c 5 -e 30000 -r 0.05

I have the following error :

=> Reading alignment files... Mon Jan 21 11:07:40 2019
error: unexpected sequence: 1 of size 2459890+ draft.fa.c5_e30000_r0.05.tigpair_checkpoint.tsv draft.fa
/var/spool/slurmd/job108691/slurm_script: line 50: draft.fa.c5_e30000_r0.05.tigpair_checkpoint.tsv: command not found

4/ When I try to run ARCS without Tigmint (directly on my draft assembly) like this :
arcs -f draft.fa -a file_of_bamfile -c 5 -e 30000 -r 0.05
I have this log :

=> Reading alignment files... Mon Jan 21 10:15:51 2019
Warning: Skipping an unpaired read. Read pairs should be consecutive in the SAM/BAM file.
  Prev read: H9:1:HV2KYBBXX:6:1209:9090:3952
  Curr read: H9:1:HV2KYBBXX:6:2216:4482:43867
Warning: Skipped 1000000 unpaired reads.
Warning: Skipped 2000000 unpaired reads.
Warning: Skipped 3000000 unpaired reads.
...
Warning: Skipped 210000000 unpaired reads.
Warning: Skipped 210477149 unpaired reads. Read pairs should be consecutive in the SAM/BAM file.
{ "All_barcodes_unfiltered":1958917, "All_barcodes_filtered":1085640, "Scaffold_end_barcodes":1008556, "Min_barcode_reads_threshold":50, "Max_barcode_reads_threshold":10000 }

=> Pairing scaffolds... Mon Jan 21 10:39:45 2019

=> Creating the graph... Mon Jan 21 10:39:50 2019

=> Writing graph file... Mon Jan 21 10:39:50 2019

      Max Degree (-d) set to: 0. Will not delete any vertices from graph.
      Writing graph file to draft.fa.scaff_s98_c5_l0_d0_e30000_r0.05_original.gv...

=> Creating the ABySS graph... Mon Jan 21 10:39:50 2019

=> Writing the ABySS graph file... Mon Jan 21 10:39:50 2019

Thanks for your help

small correction for parallelisation of bwa in arcs-make

Hi Thanks a lot for this package! I found a small change for running BWA in parallel that might help:
arks-make command on line 138 is:
/usr/bin/time -v sh -c 'bwa mem -t$t -C -p $< $(reads).fq.gz | samtools view -Sb - | samtools sort -@$t -n - -o $@' |& tee $(patsubst %.sorted.bam,%,$@)_bwa_mem.log
should be:
/usr/bin/time -v sh -c 'bwa mem -t$(threads) -C -p $< $(reads).fq.gz | samtools view -Sb - | samtools sort -@$t -n - -o $@' |& tee $(patsubst %.sorted.bam,%,$@)_bwa_mem.log

error in bam file?

Hi guys,
I recently mapped the demultiplexed reads from 10X chromium to a reference genome using bowtie2. As recommended the barcodes were attached to the read name as READNAME_BARCODE.
After the mapping finished I converted the sam file into a bam file using samtools and then used the bam for finding the links between contigs in a raw assembly. However I keep getting the following error:

On line 940000000
On line 950000000
On line 960000000
On line 970000000
On line 980000000
On line 990000000
On line 1000000000
On line 1010000000
On line 1020000000
On line 1030000000
On line 1040000000
error: `/data/Bioinfo/bioinfo-proj-jmontenegro/DENOVO/Dunnart/Results/Mappings/Bowtie/Abyss/10X_abyss.sam': Expected end-of-file and saw `2    127M    =       2707    -246    AGAAAATCTATGTGTGGGAATCTTCAGATAATCGTATGATTGAGCTTTACTAGCAACCATGGTATGGAGGTCATGGATTTACCCATTTCCCAAAGGAGAAATTCAGTGTTTATTGCCTTCTTAGAAT       IHFFCIHHHDHHCHIIIIHGIIIIIIGIHIHIIIIIIHIIIIIIIIHHFHHIIIIHIHHFIIIIIIIIIIIHGHGIIIIIHGIHIIIIHHHIHHIIIIIIIHHHIGIIIIIIIIIHIIIIIHIIIII        AS:i:-5 XN:i:0  XM:i:1  XO:i:0XG:i:0   NM:i:1  MD:Z:58T68      YS:i:-21        YT:Z:CP'

However I cannot see any issues in the bam file at all. At least while converting to bam, samtools did not detect any issues. Have you seen this error before?

Cheers,

Empty graph when running ARCS

Hi,
I am trying to run arcs on an anopheles genome (300Mbp size), but the graph I get is empty:
graph G {
}
I was hoping you could help me understand what am I doing wrong, or how I can tweak the parameters to get some results?
I am trying to scaffold a SOAPdenovo assembly using 10X genomics reads, and I am using the pipeline_example.sh script you provided with the pipeline.
I don't see any error or warning messages in the pipeline printout (in attachment).

The bam file has been generated by bwa with standard parameters and then ordered by read name.
By the way, just to make sure: my initial read-names include "/1" or "/2" for read 1 or read 2, but I noticed that after bwa all reads are called "/1", and the information on read1 or read2 is included in the bam flag: can the "/1" cause issues with arcs?
Example of a read name: ST-E00143:242:HW7TJCCXX:2:1101:1407:29349/1_AGTTGGTAGTGCGCCT

An example of an alignment is:
ST-E00143:242:HW7TJCCXX:2:1101:1407:29349/1_AGTTGGTAGTGCGCCT 65 2R 61455622 60 128M = 61455622 0 GCAGTTTGCAATGGAGCGGATTTAACCTACAGTGTGGTGTGCAGCCGTTGCACTCGAACATTTCATCAACAGTGCACTAACGTGGATGACTCAGTGTATGACCTACATTGGTTCTGTGACATGTGCAC JJJJF<JJJFJJFJJFF<AFA7FJJJFJJJJF--FJFJ7FA-F<AJJJJFJFFJFFJJJJJJJJJJJJJ<JJJFJJJJJFJJ<F<JJJJJJJFFF777AFAJ7<-FF-77F--JFJJ-AFJ<AAAJF) NM:i:1 MD:Z:22A105 AS:i:123 XS:i:93

Another question I have: about the bam file your instructions say that:
" index must be included in read name e.g read1_indexA"
can you please specify what is this index? Is it the 10X genomics barcode?

Thank you very much!

arcs.output.txt

Recommendation on multiple lanes

Hello ARCS team,

I was wondering if I could get a recommendation on benefits of multiple 10X libraries vs. increased depth per library for ARCS / LINKS scaffolding.

Basically, we would like to increase our scaffolding potential by generating two lanes of 10X data for a genome ~2.5 Gbp (Contig N50 ~100kb). I'm wondering, in your experience, if that is better served with 1) two libraries, each on their own sequencing lane, or 2) 1 library sequenced across two lanes? It would seem to me that option 1 would allow for a greater number of links to span the gap (-l in LINKS), but at the expense of the minimum aligned read pairs per barcode mapping (-c in ARCS). In your opinion, which parameter would be more important in terms of scaffolding, and do you have a recommendation on coverage per library or reads per barcode required to achieve a suitable number of aligned reads (-c in ARCS)? It looks like from the Barcode counts output from ARCS on the first lane we've done, we have an average ~280 reads per barcode.

Thanks for you input and help,

Eric

arcs-make pipeline query

I ran the entire arcs-make (arcs-1.0.5) pipeline on data
from a small marine organism. The makefile was a great help!

Every stage completed successfully according to the output.

However, there is very little change in the draft genome
after running the pipeline. (See Ml.fa, below, the original
draft genome and Ml_c5_m50-10000_s98...scaffolds.fa the final
output from the pipeline)

Can you suggest some stages of the pipeline I can focus on to
to improve the process.

The reads are 10x linked reads prepared by longranger basic
with this summary:

barcode_diversity, bc_on_whitelist, num_read_pairs
188166.842687, 0.896626906719, 110838678

Also included below is a summary of the supernova assembly
of these reads (for iformation only, assembly data not a
part of the the arcs process.)

------------- Ml.fa -------------
number sequences: 5,101
max sequence length: 1,222,598
min sequence length: 987
median sequence length: 1,772
total number residues: 155,875,873
N50: 187314

--Ml_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.scaffolds.fa --
number sequences: 5,084
max sequence length: 1,222,598
min sequence length: 987
median sequence length: 1,764
total number residues: 155,877,573
N50: 189805

(Suoernova assembly of reads .. for information only!
------------- Mlpseudohap.fasta -------------
number sequences: 25,046
max sequence length: 102,547
min sequence length: 1,000
median sequence length: 3,008
total number residues: 145,149,952
N50: 10665

Running problem

Hi,
I got some problems when using arcs:
error: `@SQ SN:000000F_pilon LN:62335206': No such file or directory
My command is:
./arcs -f Pilon.fasta -a test.bam -s 98 -c 5 -l 0 -d 0 -r 0.05 -e 30000 -m 20-10000 -v
It gets same errors whenever ALIGNMENTS is sam or bam file. What's going on?

Thank you very much!
Jing

Mean gaps in .scaffolds csv file are all the same

In your demo and my own genome, all the mean gaps are estimated to 10, Here is an example:
scaffold77,942471,f15191z72983k40a0.07m10_f13167z221844k4a0.5m10_f15280Z481647k5a0.6m10_f530z165997
scaffold78,1050735,r8877z5568k5a0.6m10_r984z132808k2a0.5m10_f127z174327k6a0.5m10_f128Z473863k23a0.17m10_f129z264169
scaffold79,571225,r15084z29962k16a0.12m10_f165z72736k2a0.5m10_f785Z468527
Does this mean LINKS don't do the estimation? and under what circumstances will this function work?

compile error: using ‘typename’ outside of template

g++ -DHAVE_CONFIG_H -I. -I..  -I/home/zhut/src/arcs-1.0.1/Arcs -I/home/zhut/src/arcs-1.0.1/Common -I/home/zhut/src/arcs-1.0.1/DataLayer -I/home/zhut/src/arcs-1.0.1 -I/home/zhut/src/arcs-1.0.1   -isystem/home/zhut/src/arcs-1.0.1/1_58_0 -Wall -Wextra -Werror -std=c++0x -fopenmp -g -O2 -MT arcs-Arcs.o -MD -MP -MF .deps/arcs-Arcs.Tpo -c -o arcs-Arcs.o `test -f 'Arcs.cpp' || echo '/home/zhut/src/arcs-1.0.1/Arcs/'`Arcs.cpp
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/backward/hash_set:60,
                 from /usr/include/boost/graph/adjacency_list.hpp:25,
                 from /usr/include/boost/graph/undirected_graph.hpp:11,
                 from Arcs.h:20,
                 from Arcs.cpp:2:
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/backward/backward_warning.h:28:2: error: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated.
In file included from Arcs.cpp:2:
Arcs.h:86: error: using ‘typename’ outside of template
Arcs.h:100: error: using ‘typename’ outside of template
In file included from Arcs.cpp:3:
../Arcs/DistanceEst.h:49: error: using ‘typename’ outside of template
../Arcs/DistanceEst.h:53: error: using ‘typename’ outside of template
../Arcs/DistanceEst.h:75: error: using ‘typename’ outside of template
../Arcs/DistanceEst.h: In function ‘void buildPairToBarcodeStats(const ARCS::IndexMap&, const std::unordered_map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char> >, int> > >&, const ARCS::ContigToLength&, const ARCS::ArcsParams&, PairToBarcodeStats&)’:
../Arcs/DistanceEst.h:216: error: using ‘typename’ outside of template
../Arcs/DistanceEst.h: In function ‘void addEdgeDistances(const PairToBarcodeStats&, const JaccardToDist&, const ARCS::ArcsParams&, ARCS::Graph&)’:
../Arcs/DistanceEst.h:401: error: expected initializer before ‘:’ token
../Arcs/DistanceEst.h:430: error: expected primary-expression before ‘}’ token
../Arcs/DistanceEst.h:430: error: expected ‘;’ before ‘}’ token
../Arcs/DistanceEst.h:430: error: expected primary-expression before ‘}’ token
../Arcs/DistanceEst.h:430: error: expected ‘)’ before ‘}’ token
../Arcs/DistanceEst.h:430: error: expected primary-expression before ‘}’ token
../Arcs/DistanceEst.h:430: error: expected ‘;’ before ‘}’ token
cc1plus: warnings being treated as errors
../Arcs/DistanceEst.h: At global scope:
../Arcs/DistanceEst.h:393: error: unused parameter ‘pairToStats’
../Arcs/DistanceEst.h:393: error: unused parameter ‘params’
../Arcs/DistanceEst.h:393: error: unused parameter ‘g’
../Arcs/DistanceEst.h: In function ‘void writeDistTSV(const std::string&, const PairToBarcodeStats&, const ARCS::Graph&)’:
../Arcs/DistanceEst.h:457: error: expected initializer before ‘:’ token
Arcs.cpp:1054: error: expected primary-expression at end of input
Arcs.cpp:1054: error: expected ‘;’ at end of input
Arcs.cpp:1054: error: expected primary-expression at end of input
Arcs.cpp:1054: error: expected ‘)’ at end of input
Arcs.cpp:1054: error: expected statement at end of input
Arcs.cpp:1054: error: expected ‘}’ at end of input
../Arcs/DistanceEst.h: At global scope:
../Arcs/DistanceEst.h:433: error: unused parameter ‘pairToStats’
../Arcs/DistanceEst.h:433: error: unused parameter ‘g’
make[2]: *** [arcs-Arcs.o] Error 1
make[2]: Leaving directory `/home/zhut/src/arcs-1.0.1/Arcs'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/zhut/src/arcs-1.0.1'
make: *** [all] Error 2

Problem running makeTSVfile.py

Hello,
I am getting an error when trying to run makeTSVfile.py and I was wondering if you had any ideas on what might be the problem.

Many thanks in advance,
Steve

$ python ~/10X_GENOMICS/arcs/Examples/makeTSVfile.py arcs_test_original.gv arcs_test_original.tigpair_checkpoint.tsv canu_sprai_corrected.consensus.fasta

Output to stdout:
Traceback (most recent call last):
File "/10X_GENOMICS/arcs/Examples/makeTSVfile.py", line 95, in
makeLinksNumbering(args.fasta_file)
File "/10X_GENOMICS/arcs/Examples/makeTSVfile.py", line 26, in makeLinksNumbering
links_numbering[test.group(1)] = str(counter)
AttributeError: 'NoneType' object has no attribute 'group'

Tuning -e parameter

Hi,

I'm currently evaluating the scaffolding capacity of Arcs on a genome that I'm working on currently. The genome has been assembled by Canu using RSII reads and is pretty repeat rich. I'm attaching the stats for the assembly (>1kb contigs) here:

num_seqs 26490
sum_len 898447105 (898Mb)
min_len 2002 (2kb)
avg_len 33916.5 (33.91kb)
max_len 1141553 (1.14Mb)
Q1 14867 (IQR1 14.86Kb)
Q2 21309 (IQR2 21.3 Kb)
Q3 32834 (IQR3 32.83Kb)
N50 44537 (44.5Kb)

I'm wondering what should be adequate end-length parameter (-e) for this assembly and how can I tune it. I went through the paper especially through this part and I'm confused:

Thus, depending on the level of contiguity of the input assembly, adjusting –e to a lower or higher value would account for shorter contigs or focus on longer contigs. When ARCS encounters shorter sequences (less than twice the specified –e length), the length of the head and tail regions are assigned as half the total sequence length. This is important, as the selection of –e will impact how ambiguity is mitigated when creating an edge between any two sequences.

So, in order to efficiently scaffold my assembly should I drop the -e parameter to the IQR2 length rather than the default 30Kb? How do I tune the parameters for an extremely fragmented assembly?

Harish

Error in sorting order of BAM files

Hi,

I have the following problem in running Arcs:

ERROR! BAM file should be sorted in order of read name. Exiting... 
 Prev Read: ST-E00129:523:HC2HTALXX:3:1101:1083:65283_ATGGCCGTCACAGCCG; Curr Read: ST-E00129:523:HC2HTALXX:3:1101:1083:65318_AACCCTCCATGACGGA

I mapped and sorted the bam file following the instructions. I removed duplicate reads using MarkDuplicates and then I have this error in runing Arcs. I sorted my Bam file again using "samtools sort -n" or "picard.jar SortSam SORT_ORDER=queryname" but didn't solve the problem.

Best,
Danshu

How may mis-scaffolding happen

Hi ARCS developers,

I have a draft assembly based on PacBio data and I used ARCS+LINKS pipeline to scaffold it with 10x data. I tuned ARCS parameter -r and -e and aligned scaffolds of different parameter sets to the optical map. The alignments to the optical map seem to show some mis-scaffolding. I read the pre-print paper and there is not much discussion on the mis-scaffolding. How may it happen?
Besides of tuning -r and -e, what other parameters may also be helpful to improve scaffolding based on my data?

Thanks a lot
Qihua

running ARCS multiple times using newly scaffolded assembly as a reference

Hi,

We're trying to improve our extremely fragmented fish genome assembly (number of scaffolds = 175785, scaffold N50 = 8335) using ARCS and 10x data.
We have tested different parameters of ARCS and found that the following adjustments seem to work better with our data: s 98 -c 3 -l 0 -m 25-20000 -d 0 -r 0.05 -e 4000 -z 400. With LINKS we have used quite relaxed parameters: -l 2 -t 2 -a 0.9 -x 1 -z 400.
We have observed a substantial improvement after first run of ARCS+LINKS, number of scaffolds have been reduced to 158971, and scaffold N50 increased to 12248. We repeated the pipeline 4 times, every time using newly scaffolded assembly as a reference.
Now after four rounds we've got 134882 scaffolds with scaffold N50 = 95581. BUSCO analysis is also showing an improvement, completeness has been increased from 53.1% to ca 70%.

The question is: is it sensible to run several rounds of ARCS+LINKS pipeline?

P.S. We're also running Supernova, however, without any result for more than 3 weeks...

chimeric scaffolds detection with ARCS

hello,
I am wondering if I could use ARCS to detect misassemblies in my scaffold-level diploid assembly (produced from PE, MP and 10x data by NRGene), in particular with the caveat that I would be using the same 10x data that built the scaffolds.
If it won't work, do you have any idea of what I could use?
Thanks,
Dario

Length of fasta sequences vs @SQ and masking

I am attempting to scaffold some output from HGAP which marks sequence as upcase or lowercase by default as part of output. I mapped my input sequences using BWA MEM but ran into the error error: mismatched sequence lengths: sequence 004574F|arrow: 0 != 4531 and realised your definition of sequence length is different to BWAs because you exclude softmasked repeats from your length and this causes issues. Mainly because you exclude masked lowercase sequence, and BWA does not but you then check @SQ lengths against the FASTA length by your definition. What is the correct solution to this? It might be helpful to document for other lost travellers such as myself.

[E::bgzf_flush] File write failed (wrong size)

I'm running Arcs with Tigmint. It runs about 16 hours and crashes. I'm using Samtools-1.9. Longranger was used to create the interleaved barcoded file from a Chromium run. Taro.contigs.fa is assembled from Nanopore runs using Canu. I'm running on a single node with 20 cores and 128GB RAM.
arcs-make arcs-tigmint draft=Taro.contigs reads=barcoded
End part of the error file is below:
[M::mem_process_seqs] Processed 573478 reads in 662.421 CPU sec, 81.837 real sec
[M::process] 0 single-end sequences; 573478 paired-end sequences
[E::bgzf_flush] [E::bgzf_flush] File write failed (wrong size)[E::bgzf_flush] File write failed (wrong size)
[E::bgzf_flush] File write failed (wrong size)
[E::bgzf_flush] File write failed (wrong size)
[E::bgzf_flush] File write failed (wrong size)
File write failed (wrong size)
[E::bgzf_close] File write failed
[E::bgzf_close] File write failed
[E::bgzf_close] File write failed
[E::bgzf_flush] File write failed (wrong size)
[E::bgzf_flush] File write failed (wrong size)

[E::bgzf_close] File write failed
[E::bgzf_close] File write failed
[E::bgzf_close] File write failed
[E::bgzf_close] File write failed
[E::bgzf_close] File write failed
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0224.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0225.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0226.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0227.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0228.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0229.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0230.bam": No space left on device
samtools sort: failed to create temporary file "/tmp/Taro.contigs.barcoded.sortbx.bam.FfJGIn.0231.bam": No space left on device
time user=478289.20s system=444.58s elapsed=59684.76s cpu=802% memory=4870 job=bwa mem -t8 -pC Taro.contigs.fa barcoded.fq.gz
time user=952.14s system=103.13s elapsed=59684.72s cpu=1% memory=10 job=samtools view -u -F4
time user=3901.95s system=245.10s elapsed=59684.72s cpu=6% memory=6619 job=
make[1]: *** [Taro.contigs.barcoded.sortbx.bam] Error 1
make: *** [Taro.contigs.tigmint.fa] Error 2

Correct fastq format

HI @warrenlr,

I'm trying to generate the right formatting for the fastq files. I've not found any example, so I tried to deduce it from the bam file NA24143_genome_phased_namesorted.bam1.sorted.bam.
Is this the right way?

@NB501445:179:H2YG7BGX7:2:11312:9490:14604_AAAAAAAAAGCAATTT
AAATTAAGTGATATATCCTCTTACGAAGTTGTTTCGGTTTAAAATTATTGTTATTATTTTTTAATGATGTGCAGAAAGTAAAGCAACATGAAGCTTTATTTACTACCAACTTCATTTCCCTTCAATA
+
EEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEAE<AEA/A/EE/EEAA</EE/EEEEEEE/EEEEEEE//<EE/EEEEEEEEEEEEEE/EAEAEEEAEE<EAE<</<EAEEAEEE</A/E</EE
@NB501445:179:H2YG7BGX7:3:21602:1860:16340_AAAAAAAAATTATAAA
AAAAAAATGTTACCCACAATCCCATGATTTGCCATGCATGTATATAGGGACATATCCAATGATAGACATCGCGTTGTTACCGCGTCAACGGGGCTCTAGTAAGAATAAATTCGCAGTCACAGCTTTG
+
EEEEEEEEE/EE//E/E/E//6A/EE/A</EEAEEEE/E/<E/EEE<///EE//A/EE/EE/AEEEEEAAE6/EE/EA//EEEEEE<EEA</EE/</EE//AEE//<E/EA<AEA</E/AA/E/E/A

and this is the command to align them with bwa:

bwa mem -t32 $ASSEMBLY.fasta $ARCSR1.gz $ARCSR2.gz | samtools view -Sb - | samtools sort -@ $THREADS -n - > $SPECIES.LinkedReads.bam

Correct?
Best
F

AGP file creation

Hello arcs team,

We were able to make a very good genome assembly by using Tigmint+ARCS.Now we are looking to create a AGP file for our ARCS assembly.Is it possible to create a AGP file for ARCS assembly out put?

what is the mean about c and a parameters?

Hello,
I used ARCS-LINKS for genome scaffolding. When I choose different c, a ,l value,
I got different assembly results. what is the mean of c, a, l. it is more is better or accuracy?
Thanks,
Fuyou

Instruction on tuning parameters?

Hi,

Thanks for you help and I have successfully finished this scaffolding pipeline. Now I want to understand better how Arcs works and what parameters fits my genome. Are there any instructions for this? And may I know when will you paper been published or available online as preprint?

Best,
Danshu

End length is not 30k by default.

In the code and manual it states that the end length that is being considered is 30k by default :

" -e, --end_length=N contig head/tail length for masking alignments [30000]\n"

But in the place where it is being set, it is set to zero :

arcs/Arcs/Arcs.h

Lines 65 to 80 in ecd98d1

ArcsParams() :
seq_id(98),
min_reads(5),
dist_est(false),
dist_bin_size(20),
dist_mode(DIST_MEDIAN),
min_links(0),
min_size(500),
gap(100),
min_mult(50),
max_mult(10000),
max_degree(0),
end_length(0),
error_percent(0.05),
verbose(0) {
}

And where it is used, that zero is translated to "use half of the contig length" :

arcs/Arcs/Arcs.cpp

Lines 306 to 314 in e75619f

/*
* If length of sequence is less than 2 x end_length, split
* the sequence in half to determing head/tail
*/
int cutOff = params.end_length;
if (cutOff == 0 || size <= cutOff * 2)
cutOff = size/2;
/*

I don't know which is best, the default (0) might work fine for some projects, although I see that setting it to another number works better in the ones that I have tested so far.

But I think it is better to change either the default to 30K, or to change the manual and code to show the effect of using 0 as default.

I get nothing in original.gv file

Hi,

I'm trying to use arcs pipeline and everything seems to run without error. But I got empty original.gv file. I followed a pipeline aim to Acropora millepora Genome Assembly (Zachary L. Fuller, Yi Liao, Line Bay, Mikhail Matz & Molly Przeworski. August 2, 2018)

(1)
Firstly, I used longranger-basic to deal with 10X data reads:
./longranger basic --id=Ass --fastqs=./10XData/ --sample=10X
It got me output: Ass/outs/barcoded.fastq.gz.

Then convert the fastq file:
gunzip -c Ass/outs/barcoded.fastq.gz | perl -ne 'chomp;$ct++;$ct=1 if($ct>4);if($ct==1) if(/(@\S+)\sBX:Z:(\S{16})/){$flag=1;$head=$1."".$2;print "$head\n";}else{$flag=0;}}else{print "$_\n" if($flag);}' > CHROMIUM_interleaved.fastq.

 The output fastq is just like this:
 ###############################################
 (The first read_1_1)
 @CL100089928L2C009R030_475184AAACACCAGACAATAC
 AAAGTGATTGCGGTGCTGATTGTTGTAGTCTGTTGTTTACGGTATTCCC........
 +
 <CFD8D>DFA3@9=DCFC>@,?9C=3BB;FFE8FF0@5F9DEEBDFB..........
 (The paired read_1_2)
 @CL100089928L2C009R030_475184AAACACCAGACAATAC
  ACTGTAGAGAGCAGTTAAGACACTGGTAGGAGAGGAGAGGATTAAT........
 +
  FEBDFEEED@DD/FFFACEACEEFE9EFDCEEAB>7C.E6EEFFEFEFFEEFF;.....
  ################################################

The fastq format is correct, but what concerns me is that the barcode is added to the behind of reads name. The arcs results show that it cannot identify the barcode. I don't know what kind of fastq the arcs needs. Are there some problems?

(2) Then I mapped CHROMIUM_interleaved.fastq using bwa the the draft assembly.
bwa index --> bwa mem --> samtools sort -n .. > 10x.bam

(3) Then using ARCS:
./arcs -f Polished.Scaff10x_R2-renamed.fa -a AlignmentFile -s 98 -c 5 -l 0 -d 0 -r 0.05 -e 30000 -m 20-10000 -v > arcs.log

In the AlignmentFile is the file of 10x.bam. Then the output of original.gv file has nothing. The log file shows that:

###############################################
.....
.....
{ "All_barcodes_unfiltered":0, "All_barcodes_filtered":0, "Scaffold_end_barcodes":0, "Min_barcode_reads_threshold":20, "Max_barcode_reads_threshold":10000 }
.....
.....
Max Degree (-d) set to: 0. Will not delete any vertices from graph.
.....
.....
###############################################

Sorry for such long text. Could you please help with the problem?

Best~
Jing

N50 become smaller while increasing the reads depth

Hi,
I'm trying to run ARCS in different depth of 10x reads, and find the scaffold N50 become smaller while increasing the reads depth, here is the details:
image
and here is my command:

java -jar trimmomatic-0.33.jar PE -phred33 -threads 4 ${depth}_R1_interleaved.fastq ${depth}_R2_interleaved.fastq ${depth}_trim_1.fq ${depth}_unpair_1.fastq ${depth}_trim_2.fq ${depth}_unpair_2.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75
bwa mem -t 4 p_ctg.rename.fa ${depth}_trim_1.fq ${depth}_trim_2.fq >${depth}.sam
samtools view -bS ${depth}.sam >${depth}_combine.bam
samtools sort --threads 4 -n ${depth}_combine.bam -o ${depth}_combine.sort.bam
arcs -f p_ctg.rename.fa -a alignments.fof -s 80 -c 2 -l 0 -d 0 -r 0.05 -e 30000 -v 1 -m 20-10000 -b ${depth}_combine
python makeTSVfile.py ${depth}_combine_original.gv ${depth}_combine.tigpair_checkpoint.tsv p_ctg.rename.fa
touch empty.fof
LINKS -f p_ctg.rename.fa -s empty.fof -k 20 -b ${depth}_combine -l 5 -t 2 -a 0.9 -x 1

Why this happened? Is there anything wrong with my command? Why the scaffold length become smaller? Hope can get some advice, and thanks in advance.

samtools: Not a directory

Hello,
I always get follows error when run the arcs:

=>Getting scaffold sizes... Tue Jan  2 10:52:14 2018

=>Starting to read BAM files... Tue Jan  2 10:52:27 2018
samtools: Not a directory
PID 6012 exited with status 1

Thanks.

Mapping of Chromium reads

Hi,

I'm interested in using arcs to scaffold my genome assembly using 10X genomics data. For the first mapping step, bwa mem with default parameters was used in the ABySS 2.0 preprint (http://biorxiv.org/content/early/2016/08/07/068338).
My question is that, is "bwa mem with default parameters" enough for mapping Chromium data? Or other specialized linked-read alignment tool such as Lariat will improve final scaffolding results?

Best,
Quan

issue with Autotools

Hello,

I am trying to install ARCS and I have this issue with Automake:

[copettid@kp141-242 arcs]$ ./autogen.sh
+ aclocal
threads object version 2.21 does not match bootstrap parameter 2.16 at /usr/lib64/perl5/DynaLoader.pm line 210.
Compilation failed in require at /usr/share/automake-1.15/Automake/ChannelDefs.pm line 23.
BEGIN failed--compilation aborted at /usr/share/automake-1.15/Automake/ChannelDefs.pm line 26.
Compilation failed in require at /usr/share/automake-1.15/Automake/Configure_ac.pm line 27.
BEGIN failed--compilation aborted at /usr/share/automake-1.15/Automake/Configure_ac.pm line 27.
Compilation failed in require at /usr/bin/aclocal line 39.
BEGIN failed--compilation aborted at /usr/bin/aclocal line 39.
[copettid@kp141-242 arcs]$ cd ..
[copettid@kp141-242 bin]$ sudo dnf install autoconf automake
[sudo] password for copettid:
Last metadata expiration check: 2:45:43 ago on Fri 29 Mar 2019 09:19:18 AM CET.
Package autoconf-2.69-27.fc28.noarch is already installed, skipping.
Package automake-1.15.1-5.fc28.noarch is already installed, skipping.
Dependencies resolved.
Nothing to do.
Complete!

It should be there, right? I don't understand what is wrong at this point.

Absolut zero scaffolding

Hi @lcoombel,

I run my first test, how is it possible that I got zero at all scaffolding. Is something to expect, Is there anything I can do to check to confirm this?

Best
F

Configure complains about unary operator

If I run configure, I get this error:

./configure: line 4899: test: !=: unary operator expected

That line is this one :
configure:if test $ac_cv_header_boost_unordered_unordered_map_hpp != yes; then

I tried to fix it with :
configure:if test "$ac_cv_header_boost_unordered_unordered_map_hpp" != yes; then

But then the configure script kept complaining about missing boost libraries, even if I downloaded the latest version, and placed in the source directory.

Without the fix it compiles fine, by the way.

unusual behavior of tigmint

Unexpected results from using tigmint with arcs pipeline

arcs alone

Ml_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.scaffolds.fa -------------
number sequences/scaffolds: 5,084
max sequence length: 1,222,598
min sequence length: 987
median sequence length: 1,764
total number residues: 155,877,573
Scaffold N50: 189805

arcs with tigmint

Ml.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.scaffolds.fa -------------
number sequences/scaffolds: 14,977
max sequence length: 256,959
min sequence length: 1
median sequence length: 1,469
total number residues: 155,789,698
Scaffold N50: 45459

Why are results with tigmint so very diffrent from arcs alone?

Why does the max scaffold length decrease from over 1,000,000 to 256,000?

Why are there ~3,000 seqs with length < 500 when the min seq length was 987? (Includes five seqs of length 1, three of length 2 !!)

Also the N50 decreases significantly.

Too large BAM file ?

I see this :

Running: arcs 1.0.1
 pid 2515
 -f XXXXX.fasta
 -a fof.in
 -s 98
 -c 5
 -l 0
 -z 500
 -b XXXXX.fasta.scaff_s98_c5_l0_d0_e0_r0.05
 Min index multiplicity: 50
 Max index multiplicity: 10000
 -d 0
 -e 0
 -r 0.05
 -v 1

=>Getting scaffold sizes... Tue May 16 16:20:13 2017
Saw 31431 sequences.

=>Starting to read BAM files... Tue May 16 16:20:26 2017
Reading bam XXXXX.faa.sorted.bam
On line 10000000
On line 20000000
On line 30000000
<SNIP>
On line 2130000000
On line 2140000000
On line -2140000000
On line -2130000000
On line -2120000000
On line -2110000000
On line -2100000000
On line -2090000000
On line -2080000000
On line -2070000000
On line -2060000000
On line -2050000000
arcs: Arcs.cpp:506: void writeGraph(const string&, ARCS::Graph&): Assertion `out' failed.

=>Starting pairing of scaffolds... Tue May 16 21:14:03 2017

=>Starting to create graph... Tue May 16 21:22:44 2017

=>Starting to write graph file... Tue May 16 21:23:25 2017

      Max Degree (-d) set to: 0. Will not delete any verticies from graph.
      Writting graph file to XXXXX.fasta.scaff_s98_c5_l0_d0_e0_r0.05_original.gv...

( I have removed the info on the genome)

I guess I see a overflow problem, is there an easy fix ?

Barcode length for Arcs

Hi,

I have two chromium libraries and these two libraries have some shared barcodes. To scaffolding using the two libraries, I have to change the barcodes to distinguish between the two libraries. I plan to append A to the end of barcodes in library 1 and append C to the end of barcodes in library 2. So does Arcs require the barcode length to be 16? Or the barcode length does not matter?

Best,
Danshu

problem with tigmint-cut

Hello,
I am running arcs with arch-make arcs-tigmint

I am running the command:
./arcs-master/Examples/arcs-make arcs-tigmint draft=draft reads=reads m=40

which I believe run tigmint-make and arch-make, and this with -n are running as follows:

For ./arcs-master/Examples/arcs-make arcs-tigmint -n draft=draft reads=reads m=40

ln -s draft.fasta draft.fa
./tigmint/bin/tigmint tigmint draft=draft reads=reads minsize=2000 as=0.65 nm=5 dist=50000 mapq=0 trim=0 span=20 window=1000 t=40
touch empty.fof
perl -ne 'chomp; if(/>/){$ct+=1; print ">$ct\n";}else{print "$_\n";} ' < draft.tigmint.fa > draft.tigmint.renamed.fa
bwa index draft.tigmint.renamed.fa
sh -c 'bwa mem -t40 -C -p draft.tigmint.renamed.fa reads.fq.gz | /mid_large_2t/important/samtools-1.6/samtools view -Sb - | /mid_large_2t/important/samtools-1.6/samtools sort -@40 -n - -o draft.tigmint.sorted.bam'
echo draft.tigmint.sorted.bam > draft.tigmint_bamfiles.fof
./arcs-master/Arcs/arcs --bx -v -f draft.tigmint.renamed.fa -a draft.tigmint_bamfiles.fof -c 5 -m 40 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b draft.tigmint_c5_m40_s98_r0.05_e30000_z500
python /mid_large_2t/running_arcs/arcs-master/Examples/makeTSVfile.py draft.tigmint_c5_m40_s98_r0.05_e30000_z500_original.gv draft.tigmint_c5_m40_s98_r0.05_e30000_z500.tigpair_checkpoint.tsv draft.tigmint.renamed.fa
ln -s draft.tigmint_c5_m40_s98_r0.05_e30000_z500.tigpair_checkpoint.tsv draft.tigmint_c5_m40_s98_r0.05_e30000_z500_l5_a0.3.tigpair_checkpoint.tsv
/opt/links_v1.8.6/LINKS -f draft.tigmint.renamed.fa -s empty.fof -b draft.tigmint_c5_m40_s98_r0.05_e30000_z500_l5_a0.3 -l 5 -a 0.3 -z 500
rm draft.tigmint_c5_m40_s98_r0.05_e30000_z500_l5_a0.3.tigpair_checkpoint.tsv

./tigmint/bin/tigmint tigmint -n draft=draft reads=reads minsize=2000 as=0.65 nm=5 dist=50000 mapq=0 trim=0 span=20 window=1000 t=40

bwa index draft.fa
bwa mem -t40 -pC draft.fa reads.fq.gz | /mid_large_2t/important/samtools-1.6/samtools view -u -F4 | /mid_large_2t/important/samtools-1.6/samtools sort -@40 -tBX -m 2G -T tmp/ -o draft.reads.sortbx.bam
tigmint/bin/tigmint-molecule -a0.65 -n5 -q0 -d50000 -s2000 draft.reads.sortbx.bam | sort -k1,1 -k2,2n -k3,3n >draft.reads.as0.65.nm5.molecule.size2000.bed
/mid_large_2t/important/samtools-1.6/samtools faidx draft.fa
tigmint/bin/tigmint-cut -p40 -w1000 -n20 -t0 -o draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa draft.fa draft.reads.as0.65.nm5.molecule.size2000.bed
ln -sf draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa draft.tigmint.fa

The arch with tigmint only run until tigmint-molecule and then in the tigmint-cut step it only produces
.fa and .bed but .fa is blank and fa.bed is as follows:

-rw-rw-r-- 1 ubuntu ubuntu 697296771 Jan 7 03:20 draft.reads.as0.65.nm5.molecule.size2000.bed
-rw-rw-r-- 1 ubuntu ubuntu 0 Jan 7 04:00 draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa
-rw-rw-r-- 1 ubuntu ubuntu 342869 Jan 7 04:00 draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa.bed

DETAIL OF fa.bed file out of tigmint-cut
head draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa.bed
0 0 15517657 0-1
0 15517657 15517670 0-2
0 15517670 17147778 0-3
0 17147778 17147785 0-4
0 17147785 23376074 0-5
0 23376074 23376100 0-6
0 23376100 29696924 0-7
0 29696924 29697326 0-8
0 29697326 29697665 0-9
0 29697665 29697943 0-10

tail draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa.bed
204375 0 33106 204375
204415 0 57665 204415
204459 0 25333 204459
204499 0 64113 204499
204541 0 58322 204541
204581 0 3400 204581
204621 0 53185 204621
204661 0 20990 204661
204701 0 60883 204701
204741 0 113995 204741

If you don't mind can you please let me know what I may be doing wrong that I am unable to get .fa file after tigmint-cut but only getting .fa.bed file. Is anything I am missing in this command and as this draft.reads.as0.65.nm5.molecule.size2000.trim0.window1000.span20.breaktigs.fa is input for arcs-tigmint all downstream are empty.

-rw-rw-r-- 1 ubuntu ubuntu 0 Jan 5 19:17 draft.tigmint.renamed.fa
-rw-rw-r-- 1 ubuntu ubuntu 0 Jan 5 19:17 draft.tigmint.renamed.fa.pac
-rw-rw-r-- 1 ubuntu ubuntu 0 Jan 5 19:17 empty.fof

Thanks,

Dharm

How is LINKS using .gv and .tsv files?

Hi,

I'm trying to use arcs pipeline and everything seems to run without error. I'm not getting any improvement on overall genome contiguity, but I'm not necessarily attributing this to the parameters themselves. What I don't understand quite yet is during the:

runLINKS.sh

step I'm not entirely understanding is how, given the script itself, the .gv and .tsv files are being used by LINKS. Is there an example log file available to determine if the program is functioning as expected?

Supernova fasta output

Hi there :).

I work on a highly heterozygous diploid organism. I have performed a Supernova assembly, and would like to use arcs+links to improve it. However, there are several fasta output options from supernova to choose from. I imagine I could either run arcs+links separately on both haplotype fasta files or once on the raw fasta file. Is there a "correct" way to do this?

Could not open @HD VN:1.3 SO:coordinate. --fatal

Hello,

I am trying to run arcs with bam file (generated by 10x longranger align).But getting the following error. below is the command which I used for running arcs

command :
arcs -f genome.fa -a possorted_bam.bam -b arcs_out

Error :

=>Getting scaffold sizes... Sun Feb 26 17:16:31 2017
=>Starting to read BAM files... Sun Feb 26 17:16:36 2017
Could not open @hd VN:1.3 SO:coordinate. --fatal.

empty XXXX_original.gv file

Hello,
I'm pretreatment my 10x pair reads with https://github.com/tiramisutes/proc10xG scripts and then run ARCS as follows:

arcs -f genome.fa -a BAM.list -s 98 -c 5 -m 50-1000 -d 0 -r 0.05 -e 30000 -b XXXX

The output is normal and not any error. But finally the XXXX_original.gv file is empty. What's happend?

Running: arcs 1.0.1
 pid 83323
 -c 5
 -d 0
 -e 30000
 -l 0
 -m 50-1000
 -r 0.05
 -s 98
 -v 0
 -z 500
 --gap=100
 -b AS_2BESST
 -g NA
 --barcode-counts=NA
 --tsv=NA
 -a BAM.list
 -f genome.fa

=>Getting scaffold sizes... Tue Jan  2 10:59:29 2018

=>Starting to read BAM files... Tue Jan  2 10:59:36 2018
{ "All_barcodes_unfiltered":0, "All_barcodes_filtered":0, "Scaffold_end_barcodes":0, "Min_barcode_reads_threshold":50, "Max_barcode_reads_threshold":1000 }

=>Starting to write reads per barcode TSV file... Tue Jan  2 13:17:14 2018


=>Starting pairing of scaffolds... Tue Jan  2 13:17:14 2018

=>Starting to create graph... Tue Jan  2 13:17:14 2018

=>Starting to write graph file... Tue Jan  2 13:17:14 2018

      Max Degree (-d) set to: 0. Will not delete any vertices from graph.
      Writing graph file to XXXX_original.gv...

=>Starting to create ABySS graph... Tue Jan  2 13:17:14 2018

=>Starting to write ABySS graph file... Tue Jan  2 13:17:14 2018


=>Starting to write TSV file... Tue Jan  2 13:17:14 2018

error: `@HD VN:1.5 SO:unknown': No such file or directory

Hello!
I am trying to use arcs for scaffolding Pacbio draft genome sequences (draft.fa) by using 10X Genomics reads (reads.fq.gz).
By far I have used tigmint protocol with default options and it was OK.

samtools faidx draft.fa
bwa index draft.fa
bwa mem -t20 -p -C draft.fa reads.fq.gz | samtools sort -@20 -tBX -T /tmp_brew -o draft.reads.sortbx.bam
tigmint-molecule draft.reads.sortbx.bam | sort -k1,1 -k2,2n -k3,3n >draft.reads.molecule.bed
tigmint-cut -p20 -o draft.tigmint.fa draft.fa draft.reads.molecule.bed

However, at arcs steps I got these error messages "error: `@HD VN:1.5 SO:unknown': No such file or directory"

arcs -f draft.tigmint.fa -a draft.reads.sortbx.bam
Running: arcs 1.0.3
pid 13640
-c 5
-d 0
-e 30000
-l 0
-m 50-10000
-r 0.05
-s 98
-v 0
-z 500
--gap=100
-b draft.tigmint.fa.scaff_s98_c5_l0_d0_e30000_r0.05
-g draft.tigmint.fa.scaff_s98_c5_l0_d0_e30000_r0.05.dist.gv
--barcode-counts=NA
--tsv=NA
-a draft.reads.sortbx.bam
-f draft.tigmint.fa
=> Getting scaffold sizes... Mon Jul 2 14:37:54 2018
=> Reading alignment files... Mon Jul 2 14:38:02 2018
error: `@HD VN:1.5 SO:unknown': No such file or directory

Could you help or recommend me how to handle these errors?

empty graph.gv file

I am currently dealing with the scaffolding of a plant genome with ARCS, but got blocked because there is no content in the output graph.gv file from ARCS (graph{}).

I've attached the Barcode index to the reads name and sorted the alignment file( bwa0.7.15) in order if reads name. I’ve tries both 15 and 85 fold of 10X reads, but neither of them works. There is no error reported and the log file ended with “Done!"

The reference data set was resulted from Supernova assembly, it is noteworthy that only those scaffolds from one specific chromosome were used as input(we have done genetic mapping). I wonder if this gonna trigger the error?

The test dataset of human genome worked quite well so I bet there is no problem with the installation.

Here is my command line for running ARCS:
arcs -f ./10X_asm_megabubbles_2_2.fasta -a bam.list -s 98 -c 5 -l 0 -z 1000 -m 5-10000 -d 0 -e 30000 -r 0.1 -v 1

I've tried -s 90, it didn't work either.

Configure problem with write_graphviz_dp

I have a problem in installing Arcs caused by boost library. When I run ./configure I get this error:

[...]
Making install in Arcs
make[1]: Entering directory `/pica/h1/vpeona/arcs/Arcs'
g++ -DHAVE_CONFIG_H -I. -I..  -I/home/vpeona/arcs/Arcs -I/home/vpeona/arcs/Common -I/home/vpeona/arcs/DataLayer -I/home/vpeona/arcs -I/home/vpeona/arcs -isystem../boost_1_61_0/  -isystem/home/vpeona/arcs/1_58_0 -Wall -Wextra -Werror -std=c++0x -fopenmp -g -O2 -MT arcs-Arcs.o -MD -MP -MF .deps/arcs-Arcs.Tpo -c -o arcs-Arcs.o `test -f 'Arcs.cpp' || echo '/home/vpeona/arcs/Arcs/'`Arcs.cpp
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/backward/hash_set:60,
                 from /usr/include/boost/graph/adjacency_list.hpp:25,
                 from /usr/include/boost/graph/undirected_graph.hpp:11,
                 from Arcs.h:20,
                 from Arcs.cpp:1:
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/backward/backward_warning.h:28:2: error: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated.
Arcs.cpp: In function ‘void writeGraph(const std::string&, ARCS::Graph&)’:
Arcs.cpp:505: error: ‘write_graphviz_dp’ is not a member of ‘boost’
make[1]: *** [arcs-Arcs.o] Error 1
make[1]: Leaving directory `/pica/h1/vpeona/arcs/Arcs'
make: *** [install-recursive] Error 1

I also downloaded the boost library and specified the path to configure:
./configure --with-boost=../boost_1_61_0/ && make install

still not working. I'm using gcc 4.4.7. Do you have some suggestions?

N50 decreased from Supernova to Tigmint-ARCS-LINKS

Hello
I run the supernova using

supernova run --id=Yuc27M --fastqs=analyzing_fastq --maxreads=1080624478 --localcores=64

got the phased fasta file output using following command:

supernova mkoutput --style=pseudohap2 --asmdir=EG9-30/outs/assembly --outprefix=/mid_large_2t/second_10xdenova_assembly_EG9_30/EG9-30/newoutput_FASTA/EG9-30_10xgenomics_consensus --index

The out is

Yuc27M_10xgenomics_consensus.1.fasta
Yuc27M_10xgenomics_consensus.2.fasta

The contigs length wasn't satisfactory so we like to extend the contigs length using Tigmint with ARCS and LINKS

For Tigmint with ARCS and LINKS, first following command were used to get the linked fastq files:

longranger basic --id=sample_Yuc27M_1_arcs --fastqs=samples_for_10x --localcores=64

Reads from longranger basic were renamed as reads.fq.gz and Yuc27M_10xgenomics_consensus.1.fasta from supernova output is renamed as draft.fa. Then arch-make with arcs-tigmint is run using following command:

./arcs-master/Examples/arcs-make arcs-tigmint draft=draft reads=reads

First run was using LINKS default a=0.3 and also gave two additional trial using a=0.6 and 0.9 on LINKS a- flag the results are as follows but only increase N50 length from a=0.3 but not from supernova phased fasta output, but the results were quite opposite, it decreased N50 length from original supernova contigs. The abyss-fas for original supernova assembled contigs Yuc27M_10xgenomics_consensus.1.fasta and three output for LINKS a- 0.3, 0.6, 0.9 are
draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.scaffolds.fa
draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.6.scaffolds.fa
draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.9.scaffolds.fa as follows:

abyss-fac Yuc27M_10xgenomics_consensus.1.fasta
n	n:500	L50	min	N80	N50	N20	E-size	max	sum	name
15157	15121	40	815	5973469	18.56e6	34.71e6	22.1e6	59.6e6	2.476e9	Yuc27M_10xgenomics_consensus.1.fasta

abyss-fac draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.scaffolds.fa
n	n:500	L50	min	N80	N50	N20	E-size	max	sum	name
15600	15057	92	518	2187088	7011644	16.15e6	9827444	41.75e6	2.476e9	draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.scaffolds.fa

abyss-fac draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.6.scaffolds.fa
n	n:500	L50	min	N80	N50	N20	E-size	max	sum	name
15018	14475	84	518	2785433	8790483	17.73e6	10.68e6	41.76e6	2.476e9	draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.6.scaffolds.fa

abyss-fac draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.9.scaffolds.fa
n	n:500	L50	min	N80	N50	N20	E-size	max	sum	name
14619	14076	72	518	3029632	9622543	21.32e6	12.74e6	52.08e6	2.476e9	draft.tigmint_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.9.scaffolds.fa

My question is, if I am missing anything while running the arcs-make with arcs-tigmint that may be shortening the contigs?

I run arch-make with all the default paratment except the above mentioned LINKS a- (0.3/0.6/0.9),

# tigmint Parameters
minsize=2000
as=0.65
nm=5
dist=50000
mapq=0
trim=0
span=20
window=1000

# bwa Parameters
t=40

# ARCS Parameters
c=5
m=50-10000
z=500
s=98
r=0.05
e=30000
D=false
dist_upper=false
d=0
gap=100
B=20
# LINKS Parameters
l=5
a=0.3

are there any other parameter I can change that can increase the contigs length and also decrease the missassemblies?

your input is greatly appreciated,

Thanks

Pick up chromium barcode from BX tag

Hi!

Thanks for releasing arcs, I've started trying it out and it looks promising. I'm wondering if you would consider picking up the chromium barcode from the BX tag instead of from the read name? According to the 10x genomics documentation this is where the verified barcode information should be placed.

With kind regards,
Johan Dahlberg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.