yechengxi / dbg2olc Goto Github PK

A genome assembler that reduces the computational time of human genome assembly from 400,000 CPU hours to 2,000 CPU hours, utilizing long erroneous 3GS sequencing reads and short accurate NGS sequencing reads.

License: GNU General Public License v3.0

C++ 90.34% Makefile 0.01% C 1.00% Python 5.48% Shell 3.16%

dbg2olc's People

Contributors

Stargazers

Watchers

dbg2olc's Issues

output error information

Hi
First of all, thank you so much for writing this great program. DBG2OLC is faster than other assemblers.

I already used it assembling pieces of data and everything was fine. However, I ran another data and output lots of "error: complement_strY" information. Could you tell me how to correct the error?
Thank you ~

raw pacbio

Just want to make sure that DBG2OLC take raw pacbio reads for the hybrid assembly, rather than corrected pacbio reads, right?
Thanks!

larger genome size

hi there,

I am getting twice the genome size as backbone_raw.fasta and as final output, could you please indicate what can be happening? and how to correct it? 30X Pacbio and 70X Illumina (250bp and 150bp).
Bellow an example of the arguments used, but also tried several options with the same results.
DBG2OLC LD1 0 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.01 RemoveChimera 1 k 17

thanks

Core dumped

Hello, I am trying to run DBG2OLC for an illumina-only assembly, and SparseAssembler seems to work correctly, outputting Contigs.txt, but I get this core dumped error when using DBG2OLC with the recommended configuration from the github page for illumina-only assembly:

Loading contigs.
703999571 k-mers in round 1.
668403631 k-mers in round 2.
Scoring method: 3
Match method: 1
Loading long read index
0 selected reads.
0 reads loaded.
./dbg1.sh: line 6: 28378 Floating point exception(core dumped) DBG2OLC LD 0 MinOverlap 31 Contigs Contigs.txt k 21 PathCovTh 2 KmerCovTh 1 i1 $FORWARD i2 $REVERSE RemoveChimera 1 ChimeraTh 2 ContigTh 2

My reads have 150x-200x coverage, and are forward+reverse. Additionally, all the software is being run in a conda environment exclusive to only DBG2OLC and its components.

Do you have any insights on what might be going on here?

Problem in step3-Call consensus

The first two steps in OK, but in the step3 it come the problems by using
$ sh split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info.txt ctg_reads.fasta ./consensus_dir 2 >cns_log.txt
The error as follow. So, what's wrong with it?
/bin/rm: cannot remove './consensus_dir/backbone-': No such file or directory
./split_and_run_sparc.sh: 17: ./DBG2OLC/DBG2OLC/utility/split_and_run_sparc.sh: ./split_reads_by_backbone.py: not found
ls: cannot access './consensus_dir/.reads.fasta': No such file or directory
./split_and_run_sparc.sh: 1: eval: cannot create ./consensus_dir/final_assembly.fasta: Directory nonexistent

settings for 10x and 7x coverage

Hi,
Does anyone have a good setting for DBG2OLC for low coverage such as 10x and 7x?

Thank you in advance,

Michal

truncated backbone_raw.fasta

Hi there,

thanks a lot for this pipeline which looks really cool.

I'm having an issue where some Backbones listed in the DBG2OLC_Consensus_info.txt output are not appearing in the backbone_raw.fasta file.

(Around 1500 Backbones in the consensus info, 528 in the backbone_raw.fasta)

In the output I'm getting some messages like:

Loading contigs.
180001914 k-mers in round 1.
168517309 k-mers in round 2.
...skipping...
4715199 alignments calculated.
165 secs.
Loading non-contained sequences.
22107 loaded.
error: complement_strR
error: complement_strS
error: complement_strM
error: complement_strR
error: complement_strS
error: complement_strY
error: complement_strY
error: complement_strY
error: complement_strY
error: complement_strR
frag sum: 327727465
offset sum: 158812461
Extension warning.
Extension warning.
Extension warning.

Can you point me to the meaning of these error: complement statements?

best wishes

Matt

Zero Unique Matching Kmers

When executing DBG2OLC on contigs (generated from Illumina reads via SPAdes) and 30 cells of PacBio long-reads (approximately 250 gigabase), no matter what parameters I try, I always get a result of "Matching Unique Kmers: 0", followed by a crash.

For one example run, the contigs were found to have 116,907,976 Kmers after round 2 (227,557,509 after round 1). On indexing all the long reads, 462,454,563 Kmers were found. It seems highly unlikely with that number of Kmers to have no matches whatsoever.

Following the zero matching Kmers, there is a segfault, and the program terminates. The total logged outputs from the queue system are below.

Example command: 
For third-gen sequencing: DBG2OLC LD1 0 Contigs contig.fa k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.005 f reads_file1.fq/fa f reads_file2.fq/fa
For sec-gen sequencing: DBG2OLC LD1 0 Contigs contig.fa k 31 KmerCovTh 0 MinOverlap 50 PathCovTh 1 f reads_file1.fq/fa f reads_file2.fq/fa
Parameters:
MinLen: min read length for a read to be used.
Contigs:  contig file to be used.
k: k-mer size.
LD: load compressed reads information. You can set to 1 if you have run the algorithm for one round and just want to fine tune the following parameters.
PARAMETERS THAT ARE CRITICAL FOR THE PERFORMANCE:
If you have high coverage, set large values to these parameters.
KmerCovTh: k-mer matching threshold for each solid contig. (suggest 2-10)
MinOverlap: min matching k-mers for each two reads. (suggest 10-150)
AdaptiveTh: ÆSpecific for third-gen sequencingÅ adaptive k-mer threshold for each solid contig. (suggest 0.001-0.02)
PathCovTh: ÆSpecific for Illumina sequencingÅ occurence threshold for a compressed read. (suggest 1-3)
Author: Chengxi Ye [email protected].
last update: Jun 11, 2015.
Loading contigs.
277557509 k-mers in round 1.
116907976 k-mers in round 2.
Analyzing reads...
File1: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio1.fastq.gz
File2: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio1.fastq.gz
File3: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio2.fastq.gz
File4: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio3.fastq.gz
File5: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio4.fastq.gz
File6: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio5.fastq.gz
File7: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio6.fastq.gz
File8: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio7.fastq.gz
File9: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio8.fastq.gz
File10: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio9.fastq.gz
File11: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio10.fastq.gz
File12: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio11.fastq.gz
File13: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio12.fastq.gz
File14: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio13.fastq.gz
File15: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio14.fastq.gz
File16: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio15.fastq.gz
File17: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio16.fastq.gz
File18: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio17.fastq.gz
File19: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio18.fastq.gz
File20: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio19.fastq.gz
File21: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio20.fastq.gz
File22: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio21.fastq.gz
File23: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio22.fastq.gz
File24: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio23.fastq.gz
File25: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio24.fastq.gz
File26: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio25.fastq.gz
File27: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio26.fastq.gz
File28: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio27.fastq.gz
File29: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio28.fastq.gz
File30: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio29.fastq.gz
File31: /users/PAS1172/osu0330/Storage/PacBio/SarraceniaPacBio30.fastq.gz
Long reads indexed. 
Total Kmers: 462454563
Matching Unique Kmers: 0
Compression time: 237 secs.
/var/spool/torque/mom_priv/jobs/2112484.owens-batch.ten.osc.edu.SC: line 60: 123486 Segmentation fault      ~/DBG2OLC/DBG2OLC KmerCovTh 2 AdaptiveTh 0.005 MinOverlap 20 RemoveChimera 1 Contigs ~/Scratch/SPAdes_Out/misc/assembled_scaffolds.fasta k 79 f ~/Storage/PacBio/SarraceniaPacBio1.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio1.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio2.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio3.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio4.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio5.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio6.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio7.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio8.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio9.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio10.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio11.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio12.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio13.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio14.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio15.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio16.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio17.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio18.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio19.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio20.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio21.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio22.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio23.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio24.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio25.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio26.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio27.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio28.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio29.fastq.gz f ~/Storage/PacBio/SarraceniaPacBio30.fastq.gz

-----------------------
Resources requested:
mem=1500gb
nodes=1:ppn=48:hugemem
-----------------------
Resources used:
cput=00:43:20
walltime=00:44:23
mem=25.477GB
vmem=19.941GB`

I do know that it is recommended to use outputs from SparseAssembler to run DBG2OLC, and I have SparseAssembler running on the Illumina data currently to try. When other used contigs not from SparseAssembler, though, the main issue was smaller-than-expected genome outputs. This suggests to me that something more complicated may be occurring here.

Segmentation fault with large genome

Hi yechengxi,

I have been trying to assemble a ~5-6 Gb (diploid) genome using DBG2OLC. I have 4.7Gb of Illumina contigs generated with Megahit (50x coverage) with an N50 of 1.7Kb. The Pacbio dataset consists in 150Gb of sequel subreads, which should be ~25-30x. I used the following command: DBG2OLC Contigs $CTG k 17 RemoveChimera 1 KmerCovTh 3 MinOverlap 30 AdaptiveTh 0.01 f $PBR

The assembly runs well up to generating the backbone_raw.fasta file step, point at which it crashes on a Segmentation fault (core dumped, see attached log dbg2olc4_5185358.out.txt). The assembly stats look good but the file appears incomplete. I tried to tune the parameters to match different potential coverages (KmerCovTh=2,3,6 MinOverlap=10,30,60 and AdaptiveTh=0.001,0.01,0.02), but all these attempts have resulted in a Segmentation fault. Do you have any recommendations?

Best wishes,
dgavr

KeyError from split_reads_by_backbone.py

Hello,

First, thank you very much for this excellent software!

I am trying to perform consensus calling. I run 'split_and_run_sparc.sh', and split_and_run_sparc.sh shows an error message like:

Traceback (most recent call last):
File "./split_reads_by_backbone.py", line 131, in
main()
File "./split_reads_by_backbone.py", line 114, in main
id = backbone_to_id[backbone]
KeyError: 'seq1435979_len218_cov188'

I have been trying to find what causes this, but I cannot yet.

It must be very grateful if you tell me some idea how to solve my problem.

Kiwoong,

Floating point exception

I have been running DBG2OLC on few datasets, and for some of them, when I run "SparseAssembler", 0 contigs are reported and most of the generated files are empty. When I next run "DBG2OLC" I get "80415 Floating point exception" for the DBG2OLC step. I tried running these assemblies with different values and combinations of the parameters, but I still get "Floating point exception" and empty files. I would appreciate if you can tell me how to avoid this error and what combination of parameters I should use to get an output.

DBG2OLC parameters for huge dataset of PacBio

Hi yechengxi,

I have a plant genome of ~350 Mb (assuming diploid) to assemble. I would try DBG2OLC with ~20x of Illumina (3 dataset) and ~70x of PacBio (1 dataset).
I have a couple of questions:

As regards the parameter genome size "GS", the software is expecting the haploid or diploid size of the genome?
With such a huge amount of PaBio (~70x), do you suggest some specific values for "AdaptiveTh #", "KmerCovTh #", "MinOverlap #" and "RemoveChimera #"?
I'm not sure of understanding the functionality of the parameter "LD 0/1" in the process SparseAssembler. Can you explain it to me please?

Thank you very much for your patience

combining ReadsInfoFrom_* fails

After running individual steps for subsets of fasta files I've combined the ReadsInfoFrom_*.fasta with:
/opt/kgapps/DBG2OLC-20170411/DBG2OLC k 17 AdaptiveTh 0.004 KmerCovTh 2 MinOverlap 20 LD0 1 Contigs ../../../../Illumina_contigs/genome.contig.fasta RemoveChimera 1 LD 1 f cell10.fasta f cell11.fasta f cell12.fasta f cell13.fasta f cell14.fasta f cell15.fasta f cell16.fasta f cell17.fasta ... f cell8.fasta f cell9.fasta
seemed to run well until it ends with:

...
total alignments: 8472039
Avg alignment size: 2
Avg sparse alignment size: 2
8644693 alignments calculated.
177 secs.
Loading non-contained sequences.
0 loaded.
frag sum: 508751022
offset sum: 192311552
Empty sequence loaded. It looks like you have messed up the data.
Assembly finished.

Can you help?

Problem with step III

Hi,
I have used contigs from platanus to make the step II and fasta pacbio reads. Now I'm stuck at step III do you have any advice?

./split_and_run_sparc.sh: 27: ./split_and_run_sparc.sh: cmd+=blasr -nproc 64 ./consensus_dir/backbone-1358.reads.fasta ./consensus_dir/backbone-1358.fasta -bestn 1 -m 5 -minMatch 19 -out ./consensus_dir/backbone-1358.mapped.m5; : not found

A problem with the number of files (Python)

Traceback (most recent call last):
File "/export/home/aay2c/app2/DBG2OLC-master/utility/split_reads_by_backbone.py", line 131, in
File "/export/home/aay2c/app2/DBG2OLC-master/utility/split_reads_by_backbone.py", line 122, in main
IOError: [Errno 24] Too many open files: './consensus_dir/backbone-2778.reads.fasta'

Short read contig selection for calculation of compressed reads

For "multi-threaded" assembly of large genomes, here you say:

Run DBG2OLC with the same set of parameters and Illumina Contigs in each of these directories...

Forgive a possibly stupid question but does it always need to have the same set of Illumina contigs? In other words, is the calculation of compressed reads dependent on the Illumina contigs?

I ask because it would otherwise be useful to run this step on a bunch of PacBio files and then use the combined compressed set with different Illumina assemblies.

Consensus step - empty output

Hi Chengxi,

Thank you for developing an easy-to-use hybrid assembler for large genomes.

When I run 'split_and_run_sparc.sh' script at the consensus step, both .m5 and .consensus.fasta files are created for each chunk and I am getting the following output message:

[INFO] 2017-05-02T16:27:28 [blasr] started.
[INFO] 2017-05-02T16:27:29 [blasr] ended.
For help: Sparc -h
Backbone size: 181906
Empty ouput. Backbone copied.

[INFO] 2017-05-02T16:27:28 [blasr] started.
[INFO] 2017-05-02T16:27:29 [blasr] ended.
For help: Sparc -h
Backbone size: 145661
Empty ouput. Backbone copied.

[INFO] 2017-05-02T16:27:28 [blasr] started.
[INFO] 2017-05-02T16:27:28 [blasr] ended.
For help: Sparc -h
Backbone size: 59113
Empty ouput. Backbone copied.

Is this expected? I was wondering whether it could be because of my PacBio coverage which is only 5x.

Thank you for your help,
Seyhan

Extension warning

Hi,
Often when I perform a parameter sweep of kmercovth, minoverlap and adaptiveTh I get some "Extension warning" messages and the end of the assembly process.
For some parameters, I get one or two, some times more than 10 and sometimes not at all. (with the same dataset)

How should we interpret these messages? And how bad are these messages? Are they for sure represent misassemblies , like 2 falsely joined contigs?

I am working or plant genomes, currently a few hundreds of magabases in size with 10-20x pacbio coverage.

For example:
``....
total alignments: 4526698
Avg alignment size: 9
Avg sparse alignment size: 3
4959403 alignments calculated.
67 secs.
Loading non-contained sequences.
180272 loaded.
frag sum: 823679464
offset sum: 333620515
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Assembly finished.

关于Contigs问题

老师您好：
在使用软件中，Contig的长度如果大于20，就会卡在loading contigs.这一步，我的命令是
./DBG2OLC k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 20 RemoveChimera 1 Contigs Contigs.txt f 200k.fas，请问是什么原因造成的？

shrinking contigs during Sparc consensus step

Hi,

Thanks for developing these interesting tools. I found out about dbg2olc here: http://biorxiv.org/content/early/2015/10/16/029306
And have since saw your arxiv article (https://arxiv.org/pdf/1410.2801.pdf).

It looks like DBG2OLC can be useful for me since I have both long and short read data. I have an insect genome of ~300 Mb to assemble (in case that info helps with suggestions later).

I gave it a try recently. The commands I used were;

SparseAssembler LD 0 k 51 g 15 NodeCovTh 2 EdgeCovTh 1 GS 300000000 f Illumina_50x.fastq

DBG2OLC k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 20 RemoveChimera 1 Contigs Contigs.txt f Pacbio_20x.fasta

cat Contigs.txt Pacbio_20x.fasta > ctg_pb.fasta #was pb_reads.fsta

mkdir consensus_dir
split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info.txt ctg_pb.fasta ./consensus_dir 2 >cns_log.txt

Side note: split_reads_by_backbone.py gave a Too many open files error. So I made a fork with 2 other versions of that script (split_reads_by_backbone_readdict.py and split_reads_by_backbone_openclose.py) as well as a modified split_and_run_sparc.sh called split_and_run_sparc.path.sh. The latter script allows one to choose which version of `to use and also runs assuming all scripts it calls on are in the PATH instead of the current working directory (e.g.`Sparc`instead of`./Sparc`). The fork is here: https://github.com/JohnUrban/DBG2OLC -- that stuff is in the branch called` split_reads_by_backbone_minimize_open_files if interested.

So back to my issue - I ran the commands above.
The dbg2olc output contigs in backbone_raw.fasta give an N50 comparable to long read assemblers and have a total assembly size close to the expected genome size. When backbone_raw.fasta is broken up into separate fastas (backbone-0.fasta ..... backbone-N.fasta), the sequences are unchanged. However, after split_and_run_sparc.sh finishes the final_assembly.fasta has fewer contigs, a much smaller assembly size (~5-7% of original size), and much smaller contigs in general. When I look in the backbone-n.fasta files, they are also smaller than they originally started out. Ultimately, in terms of size, backbone_raw.fasta seems more correct than final_assembly.fasta. Nonetheless, it should go through the consensus step - so I need to figure out a solution whether through Sparc or other. It looks like the authors of the biorxiv article used /split_and_run_pbdagcon.sh that I assume was originally one of the utilities offered in DBG2OLC, but I cannot find it.

Can you provide me with any insights into why this is happening and/or any advice to prevent it?

best,

John Urban

Racon vs sparc

Hi,
did anyone try to use Racon after DBG2OLC?

# Correction 1
source activate minimap2
minimap2 -t 8 -ax map-pb \
    output.gfa.fasta \
    input.fasta > output.gfa1.sam

source activate racon
racon -t 8 \
    input.fasta \
    output.gfa1.sam \
    output.gfa.fasta > output.racon1.fasta

# Correction 2
source activate minimap2
minimap2 -t 8 -ax map-pb \
    output.racon1.fasta \
    input.fasta > output.gfa2.sam

source activate racon
racon -t 8 \
    input.fasta \
    output.gfa2.sam \
    output.racon1.fasta > output.racon2.fasta

Thank you in advance,

Michal

error

Hi,

I have downlowaded the precompiled versions of AssemblyStatistics, DGBLOC, etc, and am getting this error ./AssemblyStatistics: 6: ./AssemblyStatistics: Syntax error: newline unexpected

Any advice?
Thanks

Segmentation fault

Hi! I am trying to perform the Step2 - Overlap and layout - using the contigs generated with Soapdenovo2 (using illumina reads) and pacbio long reads. I keep getting this error message of "Segmentation fault" above:

/var/spool/gridengine/execd/localhost/job_scripts/390: line 4: 13803 Segmentation fault (core dumped) ./DBG2OLC k 17 AdaptiveTh 0.001 KmerCovTh 2 MinOverlap 10 RemoveChimera 1 Contigs /home/isabela/scaf_DBG2OLC/sugaracane_soap.fasta f /home/isabela/scaf_DBG2OLC/pacbio_ccs_cor10x.fasta

Would you please enlighten me?

Thank you!

sparse assembly on longest illumina paired reads

Hi,
I am trying to use DBG-assembler to construct short but accurate contigs (step 1) using Illumina paired end reads.
I had ran step 0 to select longest reads. My short reads are coming from Illumina paired end reads. So, in the command below, I have all the paired reads files as inputs for SelectLongestReads.

My commands:
fastq_ls="" cd /gpfs0/home/gdlessnicklab/cxt050/Data/A673_WGS_VCFs/A673_10x_fastq/ mkdir selectedReads for file in *.fastq; do fastq_ls+=" f "$file; done ~/opt/AssemblyUtility-master/compiled/SelectLongestReads sum 96000000000 longest 1 o selectedReads/Illumina_30x.fastq $fastq_ls

Then in the next step I simply input the Illumina selected reads file. with the 'f' option.
~/opt/SparseAssembler g 10 k 51 LD 0 GS 9600000000 NodeCovTh 1 EdgeCovTh 0 f selectedReads/Illumina_30x.fastq

Is this the way you are supposed to run it?
Also, is there a way to speed up the SparseAssembler (step 1)? It's been running for almost 2 days.

Thank you in advance for your advice.

Choose the best assembly

Dear Chengxi Ye,

First, I want to say that I'm new to this kind of processes. So, thank you in advance for your time and patience.
I have performed numerous assemblies with DBG2OLC (about forty!). I noticed that, in general, using more stringent conditions (so increasing KmerCovTh, MinOverlap and AdaptiveTh) the most important qualitative statistics (i.e. N50, average contig size, number of contigs, ecc.) tend to improve. On the other hand, this leads to a reduction of the total assembled bases.
I'm assembling a genome 360 Mbp and I have obtained assemblies like these (simplifying):

(1) Number of assembled bases: 358,530,138
N50: 591,435
(2) Number of assembled bases: 324,113,093
N50: 822,237
(3) Number of assembled bases: 305,918,712
N50: 1,130,615

As you can see, the N50 value of the test number (3) is much higher than the test number (1), but the number of assembled bases is much lower than the estimated genome size.
Going to the point, my questions are:
(a) What kind of sequences are typically eliminated by increasing the stringent conditions?
(b) How can I decide which of the assemblies I've obtained turns out to be the best? Is it better to keep close to the estimated size of the genome?

Thanks for any advice!

Step1 with illumina reads of different insert size ?

Hi,

I would like to perform a hybrid assembly but I have paired end reads with different insert size. How should I process them ? just concatenate everything ?

Segmentation fault (core dumped) when analyzing reads

Hi, I'm trying to run dbg2olc with a small amount of nanopore data and a pre-existing assembly. It has repeatedly had a segmentation fault at the same step. If you have any insight as to what is going on, I would really appreciate it! I did run selectLongestReads before hand but it just selected all of the reads since the coverage is so low.

Thanks,
Jen

Here is the command:

"$dbg2olcPATH"/./DBG2OLC k 17 AdaptiveTh 0.001 KmerCovTh 2 MinOverlap 20 MinLen 500 RemoveChimera 0 Contigs "$assembly" f "$selectedReads"

Here is the end of the outfile:

Loading contigs.
238092557 k-mers in round 1.
194244643 k-mers in round 2.
Analyzing reads...
File1: /data/processedData/dbg2olc_selectReads/nanoporeLongestReadsAllRuns.fasta

multithreading

Hi, I was just wondering if there's a multiple threading option for DBG2OLC - cause it seems to be quite slow and takes only 5% cpu on a multicore server ....
What am I missing?

Consensus step: using blasr and Sparc without the scrip split_and_run_sparc.py

Dear all,

I am using this pipeline to assemble a plant genome which estimated size is between 1 and 1.5 GB. I have got the backbone. However, the script is not working for me, each time that I want to run the script I receive this message: slurmstepd: error: execve():permission denied.
Therefore, I decided to tackle the consensus step by running blasr and Sparce independently. Thus, I am running blasr to obtain the reads mapped to run Sparce. Is there a problem with this or I should try to fix the script?

I am using the same command options that are indicated in the script to run blasr

Empty sequence loaded. It looks like you have messed up the data.

Hi,

I try to assemble a genome from Illumina short reads and MinION long reads.
I first ran SparseAssembler on Illumina reads :
SparseAssembler LD 0 NodeCovTh 1 EdgeCovTh 0 k 31 g 15 PathCovTh 100 GS 80000000 i1 $IlluminaReadsR1 i2 $IlluminaReadsR2

Then DBG2OLC using the Contigs.txt file generated by SparseAssembler and MinION fastq.
DBG2OLC LD1 0 k 17 KmerCovTh 10 MinOverlap 20 AdaptiveTh 0.01 Contigs Contigs.txt f longReads.fastq

Everything seems to going fine until DBG2OLC tries to produce the backbone_raw.fasta file :

Loading contigs.
36497978 k-mers in round 1.
34042305 k-mers in round 2.
Analyzing reads...
File1: longReads.fastq
Long reads indexed.
Total Kmers: 10098923862
Matching Unique Kmers: 2127935195
Compression time: 15430 secs.
Scoring method: 3
Match method: 2
Loading long read index
Loading file: ReadsInfoFrom_longReads.fastq
1926913 reads loaded.
Average size: 9
Loaded.
1926913 reads.
Calculating reads overlaps, round 1
1000000 reads aligned.
Avg alignment size: 9
total alignments: 11084129
Avg alignment size: 9
Avg sparse alignment size: 4
20453471 alignments calculated.
Round 1 takes 2799 secs.
Calculating reads overlaps, round 2
1000000 reads aligned.
Avg alignment size: 8
total alignments: 8081896
Avg alignment size: 8
Avg sparse alignment size: 3
15774478 alignments calculated.
Round 2 takes 1577 secs.
1374023 contained out of 1926913
80366 tips in the graph.
Graph simplification.
Iteration: 0
42279 branching positions.
14715 linear nodes.
Iteration: 1
18949 branching positions.
18899 linear nodes.
Iteration: 2
11995 branching positions.
20207 linear nodes.
Iteration: 3
10289 branching positions.
20375 linear nodes.
57091 edges deleted.
0 chimeric reads deleted.
2802 bad nodes removed in aggrassive cleaning.
52312 tips in the graph.
Loading contigs.
Collecting information for consensus.
1926913 reads.
Calculating reads overlaps.
1000000 reads aligned.
Avg alignment size: 13
Avg sparse alignment size: 3
total alignments: 119110145
Avg alignment size: 12
Avg sparse alignment size: 3
226226058 alignments calculated.
27238 secs.
Loading non-contained sequences.
0 loaded.
frag sum: 564254854
offset sum: 172184934
Empty sequence loaded. It looks like you have messed up the data.
Assembly finished.

I tried to put the Contigs.txt and longreads.fastq in the working directory of DBG2OLC, but I still get the empty sequence loaded. Any idea of what could cause the problem ?

Thanks

KeyError: 'Backbone_9625'

Hi, I am getting the following message:

sh ./split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info.txt ctg_pb.fasta ./consensus_dir 2 >cns_log.txt
Traceback (most recent call last):
File "./split_reads_by_backbone.py", line 131, in
main()
File "./split_reads_by_backbone.py", line 114, in main
id = backbone_to_id[backbone]
KeyError: 'Backbone_9625'

any help will be appreciated,

thanks

Consensus not working

Good morning,

I'm trying to go through the tutorial and have made it as far as the consensus step and this is my output

[meeseeks] ~/Sparc/utility$ sh ./split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info.txt ctg_pb.fasta ./consensus_dir >cns_log.txt
rm: cannot remove './consensus_dir/backbone-*': No such file or directory
  File "./split_reads_by_backbone.py", line 43
    print tuple[0]
              ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(print tuple[0])?
ls: cannot access './consensus_dir/*.reads.fasta': No such file or directory
cat: './consensus_dir/*.consensus.fasta': No such file or directory

I have changed the blasr code to the double-dashes as specified, but I cannot seem to get this off the ground. Is ./consensus_dir an empty folder I need to make myself (which I have), and there is no reads.fasta file. I'm just a little confused at this step and would greatly appreciate any help.

DBG2OLC proovreads

Does DBG2OLC can take as input data processed with proovreads? I have been trying but so far is not working...

Why is the genome small by assembled use DBG2OLC?

Hi,
my genome size about 400M,but the genome size just 200M by assembled use DBG2OLC.Can you give me some good advice to improve genome size?

Best,
Monica

DBG2OLC': free(): invalid pointer

@yechengxi
This command:

DBG2OLC k 17 AdaptiveTh 0.001 KmerCovTh 2 MinOverlap 15 RemoveChimera 1 Contigs contigs.fasta f PacBio_CorrectTrimmed_30x.fasta

failed with exceptions:

*** Error in `/software/bioinformatics/DBG2OLC/DBG2OLC': free(): invalid pointer: 0x0000000022ddec42 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7d053)[0x2b508ba0d053]
/software/DBG2OLC/DBG2OLC[0x40db64]
/software/DBG2OLC/DBG2OLC[0x403d52]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b508b9b1b15]
/software/DBG2OLC/DBG2OLC[0x408d91]
======= Memory map: ========
00400000-00455000 r-xp 00000000 00:25 4726198077 /software/DBG2OLC/DBG2OLC
00655000-00656000 r--p 00055000 00:25 4726198077 /software/DBG2OLC/DBG2OLC
00656000-00657000 rw-p 00056000 00:25 4726198077 /software/DBG2OLC/DBG2OLC
01371000-19a4c3000 rw-p 00000000 00:00 0 [heap]
2b508af4c000-2b508af6d000 r-xp 00000000 fd:00 396321 /usr/lib64/ld-2.17.so
2b508af6d000-2b508af6f000 rw-p 00000000 00:00 0
......
....
7ffc1d5e4000-7ffc1d606000 rw-p 00000000 00:00 0 [stack]
7ffc1d67e000-7ffc1d680000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
/home/user1/.lsbatch/1477646807.85583: line 8: 11668 Aborted (core dumped)

Is this caused by not enough RAM? How to get it fixed? Any suggestion is very much appreciated.

478 Floating point exception

Hello,

I get '478 Floating point exception' when running a particular contigs fasta file plus only a partial set of DBG2OLC output files:

ContigKmerIndex_HT_idx.txt
ContigKmerIndex_HT_content
ReadsInfoFrom_Pacbio.fasta
LongReadContigIndex_log.txt
LongReadContigIndex_Histogram.txt
selected_reads.txt
CleanedLongReads.txt
ChimeraIdx.txt

I was wondering if this error is known to happen under certain circumstances specific to a particular dataset (the contigs file).

Thanks
Jorge

setting ulimit -n unlimited is impossible on Linux OS

On Linux OS, setting ulimit -n unlimited is impossible. The default value is 1024. We could enlarge this value by edit /etc/security/limits.conf and set:

*   hard    nofile  65535
*   soft    nofile  65535

However, we could not set it to unlimited, as this would destroy the system and make user login fail

How DBG2OLC deals heterozygosity?

Dear Chengxi Ye,
First of all, thank you for this very efficient assembler. I would like to know how DBG2OLC deals with heterozygosity. I'll tell you my specific case. I have a quite heterozygotic genome of 350 Mb (haploid size) to assemble. With an OLC-based assembler I obtained an assembly of 400 Mb. I'm quite sure that these 50 Mb extra are due to the heterozygosity. With DBG2OLC, tuning the parameters and doing lots of assemblies thanks to its speed, I get a set of assemblies whose size varies from 400 Mb to the correct estimated 350 Mb. So, where is gone the heterozygosity? Has it been collapsed? Or discarded?
Thank you in advance!

error correction of pacbio reads

Does DBG2OLC do error correction of PacBio reads?

step0 option total length

Hi,
I'm a beginner with assembly. In the step0 "SelectLongestReads", the option "sum" confuses me.
What's the "total length" meaning? Is it the genome size? or is it sequencing all reads bases?
and
In the manual, "SelectLongestReads" only input Illumina_R1 file. So, in DBG2OLC, is Illumina_R2 file not necessary?
Thank you so much

segmentation fault

Hello,
I have been using DBG2OLC for a while, so far never encountered any problem.
Now I am encountering a segmentation fault (during extension step apparently) using two new files:
Command:
DBG2OLC k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.005 LD1 0 MinLen 200 Contigs /path-to-shortread-assembly/Platanus_contigs.fa RemoveChimera 1 f /path-to-longreads/nanopore_raw.fasta
STDERR file:
/pbs/ocelote/i0n3/mom_priv/jobs/1878034.head1.cm.cluster.SC: line 26: 26147 Segmentation fault DBG2OLC k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.005 LD1 0 MinLen 200 Contigs /path-to-shortread-assembly/Platanus_contigs.fa RemoveChimera 1 f /path-to-longreads/nanopore_raw.fasta
LOG file:
Loading contigs.
129303309 k-mers in round 1.
118727528 k-mers in round 2.
Analyzing reads...
File1: /path-to-longreads/nanopore_raw.fasta
Long reads indexed.
Total Kmers: 7571672670
Matching Unique Kmers: 1651108081
Compression time: 5243 secs.
Scoring method: 3
Match method: 2
Loading long read index
Loading file: ReadsInfoFrom_nanopore_raw.fasta
1491967 reads loaded.
Average size: 8
Loaded.
1491967 reads.
Calculating reads overlaps, round 1
Multiple alignment for error correction.
1000000 sequences aligned.
Avg alignment size: 12
total alignments: 46693535
Avg alignment size: 16
Avg sparse alignment size: 3
total alignments: 104831203
Loading file: CleanedLongReads.txt
1398904 reads loaded.
Average size: 6
Done.
MSA time: 2414 secs.
1000000 reads aligned.
Avg alignment size: 9
total alignments: 1676627
Avg alignment size: 15
Avg sparse alignment size: 4
2323906 alignments calculated.
Round 1 takes 2479 secs.
Calculating reads overlaps, round 2
1000000 reads aligned.
Avg alignment size: 25
total alignments: 138384
Avg alignment size: 43
Avg sparse alignment size: 4
363757 alignments calculated.
Round 2 takes 18 secs.
1371784 contained out of 1398904
1121 tips in the graph.
Graph simplification.
Iteration: 0
2766 branching positions.
4953 linear nodes.
Iteration: 1
1298 branching positions.
5853 linear nodes.
Iteration: 2
233 branching positions.
6689 linear nodes.
Iteration: 3
211 branching positions.
6696 linear nodes.
2263 edges deleted.
70538 chimeric reads deleted.
44 bad nodes removed in aggrassive cleaning.
615 tips in the graph.
Loading contigs.
Collecting information for consensus.
1398904 reads.
Calculating reads overlaps.
1000000 reads aligned.
Avg alignment size: 35
Avg sparse alignment size: 2
total alignments: 1744676
Avg alignment size: 42
Avg sparse alignment size: 3
15484865 alignments calculated.
526 secs.
Loading non-contained sequences.
77857 loaded.
frag sum: 423189042
offset sum: 157479740
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.

I first suspected the longread files to cause the problem, since when I am using an alternative file instead of this one, DBG2OLC completes successfully. Also, when I split my nanopore_raw.fasta into multiple files and feed it to DBG2OLC it completes successfully as well (obviously the output would not make real sense with just a portion of raw reads). So I thought it might be a memory problem ?
Then I see that others had segmentation faults problems and you recommended to look at the shortread assembly file. I have used Platanus, and it worked well all other times I have used it (I'm not quite sure why it would cause formatting problems with this new data set). I have given a try with SparseAssembler instead with the new data set, but DBG2OLC would still end up with the segmentation fault.

Do you have any other ideas that I could try to solve the problem ?
Thank you so much for your advice.
Coline Jaworski._

Small contigs and slurm_script: Floating point exception

Hi,

I am assembling a genome of 150 MB using 2x150 illumina reads (~70x Coverage) and Pacbio reads with 25 x coverage.
When using SparseAssembler i use the following command:

./SparseAssembler g 15 k 99 LD 0 GS 310000000 NodeCovTh 1 EdgeCovTh 0 i1 ./lumina/lili/LILI_R1_001.fastq i2 ./illumina/lili/LILI_R2_001.fastq

The SparseAssembler step finishes but results in a very small contigs, i tried different parameters and a high k gives me higher but still small contigs:

stats for Contigs.txt
sum = 307302645, n = 1810137, ave = 169.77, largest = 55932
N50 = 170, n = 434542

When proceeding and using DBG2OLC with the following command:

./BG2OLC k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 25 RemoveChimera 1 Contigs ./data/sparseAssembler/Contigs.txt f ./pacbio/LILI_cell1_m54278_180927_190653.subreads.fasta

I get the following std_out:

Example command:
For third-gen sequencing: DBG2OLC LD1 0 Contigs contig.fa k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.005 f reads_file1.fq/fa f reads_file2.fq/fa
For sec-gen sequencing: DBG2OLC LD1 0 Contigs contig.fa k 31 KmerCovTh 0 MinOverlap 50 PathCovTh 1 f reads_file1.fq/fa f reads_file2.fq/fa
Parameters:
MinLen: min read length for a read to be used.
Contigs: contig file to be used.
k: k-mer size.
LD: load compressed reads information. You can set to 1 if you have run the algorithm for one round and just want to fine tune the following parameters.
PARAMETERS THAT ARE CRITICAL FOR THE PERFORMANCE:
If you have high coverage, set large values to these parameters.
KmerCovTh: k-mer matching threshold for each solid contig. (suggest 2-10)
MinOverlap: min matching k-mers for each two reads. (suggest 10-150)
AdaptiveTh: [Specific for third-gen sequencing] adaptive k-mer threshold for each solid contig. (suggest 0.001-0.02)
PathCovTh: [Specific for Illumina sequencing] occurence threshold for a compressed read. (suggest 1-3)
Author: Chengxi Ye [email protected].
last update: Jun 11, 2015.
Loading contigs.
80213454 k-mers in round 1.
41789206 k-mers in round 2.
Analyzing reads...
File1: /home/ls752/genomes/data/raw_data/pacbio/LILI_cell1_m54278_180927_190653.subreads.fasta
Long reads indexed.
Total Kmers: 0
Matching Unique Kmers: 0
Compression time: 0 secs.
Scoring method: 3
Match method: 2
Loading long read index
Loading file: ReadsInfoFrom_LILI_cell1_m54278_180927_190653.subreads.fasta
0 reads loaded.

And the following std_err:

/var/spool/slurmd/job32968/slurm_script: line 15: 44064 Floating point exception(core dumped) ./DBG2OLC k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 25 RemoveChimera 1 Contigs ./sparseAssembler/Contigs.txt f ./pacbio/LILI_cell1_m54278_180927_190653.subreads.fasta

Am i overlooking something? Do you suggest any changes to my parameters to increase the contig length?

Many thanks for creating and maintaining a hybrid assembler,

Bests,
L

split_and_run_sparc has typos

blasr command in split_and_run_sparc gives flags with one hyphen, blasr currently only excepts commands with double hyphens (e.g. '-nproc' should be '--nproc').

full_layout.fasta and graph outputs

Hi,

thanks for a nice and very quick program. It seems to have improved my assembly substantially already.

Two questions:

1- what exactly is this file (full_layout.fasta)?

Background- the backbone_raw.fasta is only about only 60% of the expected genome size (1.3 of ~2.3 gb), yet the full_layout is 1.8gb. I expect many extra repeats will be present in the full_layout, as this is a repeat rich plant genome with > 70% repeats.

Can I use the full_layout instead of the backbone, or will downstream programs not work ?

2 - what is the best way to view the .dot files. Also which .dot file might be useful ?

I thought the package bandage might be ideal, yet the .dot format output is not supported. Gephi ist struggling with 130000+ nodes...... Bandage supports lastgraph, gfa, trinity fasta .....

https://rrwick.github.io/Bandage/

Thanks!
Colin

Create_Local_Contig_Kmer_Index(GraphConstruction.h) SegFault

Hey yechengxi,

Thank you for this fast and effective assembly program! I have been assembling a genome with DBG2OLC and encountered an issue while generating contigs. DBG2OLC gets through nearly all local contig kmer indices (writes 1.4 Gb of 1.5 Gb) but encounters two such indices that cause SegFaults. I have tracked the issue down to line 4497 as an index-out-of-bounds error. It appears that the values for aligned_contig and LongReadContigIndex1.CTG2LR[aligned_contig].size() / 2 are both 0 and trigger this SegFault when attempting to use those values as keys in the LongReadContigIndex1.CTG2LR map.

The debugging output for the first problematic index compressed read-read alignment is:

result=5
<blank line>
<blank line>
19236 -34074 32495 86089 59772 -19042 -33188 35243 97100 -59673 -89937 65689 -58388 
119758 97100 -59673 -89937 60211 -58388 -34298 93059 -33188

For the second problematic index compressed read-read alignment is:

result=5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 69682 69682 69682 67666 69682 67666 69682 67666 69682 67666 69682 67666 67666 69682 67666 69682 
-49100 41853 -49100 67666 -49100 -49100 -49100 -49100 41853 -49100 41853 -49100 -49100 41853 69682 -49100 -49100 -49100 -49100 41853 -49100 -49100 -49100 -491
-6008 -59287 96539 41853 -71445 67667 96538 -6008 -69683 67666 -49100 69682 -71444 
67666 -49100 41853 69682 -49100 -49100 -49100 73346 -49100 73344 -49100 -71515 -49502 10153 -49100 -49502 -49502 73345 -49502 -49502

Any help you could provide for fixing/handling this would be appreciated. Thank you!

Best wishes,
bredeson

SelectLongestReads: 'std::length_error'

I tried to run
DBG2OLC/compiled/SelectLongestReads sum 18000000000 longest 0 o Illumina_PElib_30x.fasta f PElib_1.fastq.gz f PElib_2.fastq.gz

File Illumina_PElib_30x.fasta (with small file size) was created, but the command exit with

terminate called after throwing an instance of 'std::length_error'
what(): basic_string::resize
/home/hrachd/.lsbatch/1476836888.83252: line 8: 28640 Aborted (core dumped)

The FASTQC reports for both PElib_[1/2].fastq.gz files confirmed the read length is 101 bases for all reads.

Any idea how to fix the problem? Thank you very much.

no more ulimit necessary

The file DBG2OLC/utility/split_reads_by_backbone.py can be changed to this (line 118 and further):

                new_fp = open(options.output_dir + '/' + str(id) + '.reads.fasta', 'a')
                new_fp.write('>' + tuple[0] + '\n' + tuple[1] + '\n')
                new_fp.close()

This will open and close the file each time it is needed. Might create additional IO overhead, but ulimit can be left intact. Very handy when you're not root.

Any information for Hardware requirements and expected run times for large genomes hybrid assembly?

Hi chengxi,
Thanks for your excellent work on hybrid assembly of large genomes. Now we are working on the assembly of larege plant genome(genome size more than 10G) with 30X pacbio and 100x illumina PE150(insert size 500bp), is it possible to use DBG2OLC to do the hybrid assemble? By the way, is there any hardware requirements (eg. RAM, cups, disks space)and expected run times for large genome hybrid assemble (eg. genome size 2G, 3G, 6G or 10G and above)?

Thanks a lot for any reply!

sample data download

Dear Chengxi,

I can not download the sample data of S. cer (illumina and Pacbio). Is there an issue with the links ? Could you redirect me to another place I could download them?

Thanks,
Alex

k-mer size

k-mer size of 17 takes a very long time on a large genome - i'd like to increase it to speed things up. Is it safe to go to 31? What are the expected disadvantages

yechengxi / dbg2olc Goto Github PK

dbg2olc's People

Contributors

Stargazers

Watchers

Forkers

dbg2olc's Issues

Recommend Projects

Recommend Topics

Recommend Org