chjiao / pehaplo Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 2.0 10.75 MB

Python 100.00%

pehaplo's People

Contributors

Stargazers

Watchers

Forkers

broadinstitute skshashank74

pehaplo's Issues

warn that pehaplo.py deletes all existing files in the directory where it is run

@chjiao pehaplo.py deletes all existing files in the directory where it is run . Maybe warn about that?

subprocess.CalledProcessError: Command 'sga index Contigs.fa' returned non-zero exit status 1

Hi, I'm experiencing the following error when trying to assemble the paired-end reads:

Size of File:0
Number of Strings: K 0
Size of strings:0
Creating tree: 
User Time for Constructing tree with no sorting required: 0 ms
Segmentation fault (core dumped)
The nodes of the whole graph is: 0; the edges of the whole graph is: 0.
index: input file is empty
Traceback (most recent call last):
  File "../pehaplo.py", line 157, in <module>
    main()
  File "../pehaplo.py", line 102, in main
    subprocess.check_call('sga index Contigs.fa', shell=True)
  File "/home/thermite/anaconda3/envs/tarvir/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'sga index Contigs.fa' returned non-zero exit status 1

I'm running Python 2.7 and followed the steps as stated in the Readme as well as my paired end reads are in the 'reads1/1' format.
Thanks!

Fail in contig correction

The command is: python pehaplo.py -f1 PV194D.virus_recruit_1.fa -f2 PV194D.virus_recruit_2.fa -l 100 -l1 120 -t 30 -r 150 -F 375 -std 100 -n 3 -correct yes

The error info is: Traceback (most recent call last):
File "pehaplo.py", line 157, in
main()
File "pehaplo.py", line 115, in main
subprocess.check_call('samtools view -F 4 -S contigs_alignment.sam > contigs_mapped.sam', shell=True)
File "/home/xiuwan/miniconda2/envs/pehaplo/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'samtools view -F 4 -S contigs_alignment.sam > contigs_mapped.sam' returned non-zero exit status 1

The reason I think is:
grep -n "contig_529_0" Contigs.fa
115:>contig_529_0 contig_529_0 NumDuplicates=1
117:>contig_529_0 contig_529_0 NumDuplicates=1

Why there would be two same id in Contigs.fa?

I have completed the other 10 samples, everything is successful, just this sample meet this problem. It's so strange. Right?

Thanks in advance.

python 3 compatibility

I tried to make a bioconda recipe ( https://bioconda.github.io/ ) for installing PEHaplo. To install it in Python 3 environments, I tried to use the 2to3 program which converts python 2 code to python 3. The problem is that the division operator (/) works differently in Python 3: in Python 2 it's integer division if operands are integer, while in Python 3 it's floating-point division while // is integer division. If possible, could you change your code to be Python 3 compatible, by adding
from future import division
at the top of each script, and then changing / to // wherever you want integer division rather than floating-point division? Then PEHaplo could be installed in Python 3 conda environments alongside other tools.

Failure running PEHaplo

Hi,
I have been trying to run PEHaplo during the last two days with no success. Several failures have jumped, but step by step I could solve them. Unfortunately, the last one it is out of my control:
[ERROR REPORTED]
File "XERMADE/PEHaplo-master/tools/get_pairs_title_from_nondup.py", line 28, in
assert (title_dict[base].endswith('1') and title.endswith('2')) or (title_dict[base].endswith('2') and title.endswith('1'))
AssertionError
But I have changed my reads (they are in fasta format), the current format is: 'read1/1' 'read1/2', which I think is the right one. Am I wrong?
Besides, I have tried to execute aswell the gold benchmark (https://github.com/cbg-ethz/5-virus-mix); but the code fails:
[ERROR REPORTED]
XERMADE/PEHaplo-master/tools/Seperate_strand_single_pair.py", line 466, in
des_list, read_map, read_db = get_seq_from_fa(fa_file, des_file) # read dictionary
File "/home/bfreire/XERMADE/PEHaplo-master/tools/Seperate_strand_single_pair.py", line 14, in get_seq_from_fa
with open(des_file,'r') as f:
IOError: [Errno 2] No such file or directory: 'samp.des'
I think that samp.des is a file created by the code. If not I did not realize about how it should be.
In this case I have trimmed the reads with the same paremeters as SAVAGE did, but I have not applied PEAR. Is everything correct?
Thanks in advice :)

some needed imports are not mentioned in the readme

numpy, scipy, pygraphviz . others?

Error with get_pairs_title_from_nondup.py

Hello, I´m having an error when PEHaplo runs get_pairs_title_from_nondup.py. It exits with non zero status.

I have installed Networkx version 1.11
My reads names are CP1p_1.fa and CP1p_2.fa

Can you guide me through to the solution?

Thanks!

get_pairs_title_from_nondup.py error

Hi,

I installed PEHaplo with conda - this requires and installs python 3.6 (is that correct?).
I tried to install PEHaplo using conda into a py26 environment - but that causes inconsistencies.
Maybe the py version is not the problem?

kept_num.fa is produced and produce some 200k reads with fasta headers - e.g.:

>M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669 NumDuplicates=1
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTCTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669 NumDuplicates=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>M03777:103:000000000-B655P:1:1101:18625:1726 M03777:103:000000000-B655P:1:1101:18625:1726 NumDuplicates=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAATTTAATAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
....

the kept_pairs.txt however contain only 2 entries:
M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669

Traceback (most recent call last): File "/home/ngs/miniconda3/envs/pehaplo/share/pehaplo-v0.1_dev05-0/tools/get_pairs_title_from_nondup.py", line 26, in <module> assert title_dict[base].endswith('1') and title.endswith('2') AssertionError Traceback (most recent call last): File "/home/ngs/miniconda3/envs/pehaplo/bin/pehaplo.py", line 141, in <module> main() File "/home/ngs/miniconda3/envs/pehaplo/bin/pehaplo.py", line 87, in main subprocess.check_call('python "%s"/tools/get_pairs_title_from_nondup.py kept_num.fa kept_pairs.txt' % base_path, shell=True) File "/home/ngs/miniconda3/envs/pehaplo/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python "/home/ngs/miniconda3/envs/pehaplo/share/pehaplo-v0.1_dev05-0"/tools/get_pairs_title_from_nondup.py kept_num.fa kept_pairs.txt' returned non-zero exit status 1.

read quality scores

@chjiao I notice that you don't use read quality scores (fastq files), although some of the tools you invoke (Karect, sga) can use them. Is that just to make things simpler, or you've found in your experiments that using quality scores makes results worse?

subprocess.CalledProcessError exit status -11

An issue was arised as followed:
(pehaplo) root 17:36:35 /public/home/hanzhenzhi/ngs-project/WFM/D2587/quasispecies
$ pehaplo.py -f1 tmp/clean_1.fq -f2 tmp/clean_2.fq -l 50 -r 150 -std 100 -t 36 -m 200GB -correct no
Error correction begins -------------------------------------------------
Error Correction Started
Estimated Coverage = 14.745907
Global Num Reads = 145153888
Global Num Bases = 21695686234
Global Num Kmers = 20534455130
Global Max Read Len = 150
Global Avg Read Len = 149.466794
Min Overlap = 35
Min Overlap Percentage = 0.200000
Max Overlap Error Rate = 0.250000
Truncate Factor = 0.000000
Min Straight Edge Val = 73.729537
Min Straight Edge Per = 1.000000
Min Read Weight = 0.000000
Max Mul Len Matches = 13422818
Max Kmer Slots = 100000
Use Qual = 1
Match Type = Hamming Distance
Kmer Size = 13
Max Kmer Errors = 2
Kmer With Errors Size = 13
Read Index Bits = 40
Num Threads = 36
Memory Budget = 10995116277760[10240g]
File Block Size = 10485760[10m]
Cache Block Size = 134217728[128m]
Max Reads Per Step = 1000
Max Part Res = 110
Main CorrectErrors StartFile=0
Total Reads Size = 43681680245
Building Index [0 - 43536526356]
Traceback (most recent call last):
File "/public/Software/bin/pehaplo.py", line 160, in
main()
File "/public/Software/bin/pehaplo.py", line 52, in main
subprocess.check_call("karect -correct -inputfile=%s -inputfile=%s -threads=%s -celltype=haploid -matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25" % (args.input_f1, args.input_f2, args.threads), shell=True)
File "/public/Software/Miniconda3/envs/pehaplo/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'karect -correct -inputfile=tmp/clean_1.fq -inputfile=tmp/clean_2.fq -threads=36 -celltype=haploid -matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25' returned non-zero exit status -11

I performed several methods to solve this issue, such as reducing the overlap threshold as showed for other problems. But this issue could not be solve. I also change to "root" role, for the highest permission of computer. And the command "-correct no" or "-correct yes" return the same issue. Could you check this error and provide a solution?

Zhenzhi

bug?

@chjiao at https://github.com/chjiao/PEHaplo/blob/master/tools/Seperate_strand_single_pair.py#L98
did you mean to assign overlap_len not overlap?

Read quality trimming and adapter removal

The article mentions:

There are five major components in the pipeline of PEHaplo as shown
in Figure 3. In the first pre-processing stage, reads with low-quality
or ambiguous base calls are filtered or trimmed.

however, it's not implemented in the code, as far as I can tell.
There is also no mention of any adapter removal in the methods section.
Should we do the usual adapter removal and/or quality trimming procedure before using PEHaplo, or is the algorithm somehow immune to this noise?

My primary concern is that after trimming the read lengths will be widely distributed and many of them will be shorter than the calculated min overlap parameter.

tag a release

@chjiao Can you tag a github release? This is needed to create a bioconda recipe for installing pehaplo ( https://bioconda.github.io/ ). Thanks!

reading .des files

At https://github.com/chjiao/PEHaplo/blob/master/tools/Seperate_strand_single_pair.py#L14
the .des file is read as a text file. But at the end of this file, after the list of read names, there is binary data, which is currently read in as an (incorrect) read name. It'd be better to use the official API for reading EncSeq files, from https://github.com/genometools/genometools/blob/master/gtpython/gt/core/encseq.py .

division by zero

@chjiao sometimes at
https://github.com/chjiao/PEHaplo/blob/master/tools/identify_misjoin_contigs.py#L427
count_pair is zero, and the tool crashes with division by zero.

AssertionError in get_pairs_title_from_nondup.py

Hi Jiao,

I've been trying to get PEHaplo running for a while now, but I keep moving from error to error so I was hoping you could help me out. First it complained when running "join_pair_end_fasta.py". Since the input given to the script in th main pipeline is fastq and there is also a "join_pair_end_fastq.py" script available in /tools, I replaced the first by the latter script in the pipeline. This seemed to fix the issue. But then it crashed when running "get_pairs_title_from_nondup.py" because of an AssertionError:

assert (title_dict[base].endswith('1') and title.endswith('2')) or (title_dict[base].endswith('2') and title.endswith('1'))

So I renamed my input reads to ensure that forward readnames end with /1 and reverse readnames end with /2 (my input now looks similar to the example data) but PEHaplo still crashes. If I remove the assert statement it will continue to the next step and crash when running "Seperate_strand_single_pair.py" because of a KeyError after the graph construction finishes.

Any help would be very much appreciated!

Best,
Jasmijn

Test run fails

Hi Jiao,

I am having problem getting PEHaplo to run. I think all the requirements are installed correctly, but I am not sure how to test whether that is true. I tried following the README and created an assembly folder in the root folder of the repo. Then changed directory to the assembly folder and ran python ../pehaplo.py -f1 ../raw_test_data/virus_1.fa -f2 ../raw_test_data/virus_2.fa -l 180 -l1 210 -r 250 -F 600 -std 150 -n 3 -correct yes that crashed after a while with the following error:

# gt readjoiner overlap (version 1.2)
# number of reads in filtered readset = 27066
# number of irreducible suffix-prefix matches = 27088
# average irreducible SPM/read = 1.00
# number of transitive suffix-prefix matches = 1351014
Number of total reads:  27066
Number of paired reads: 11472
Number of unpaired reads:       15594
Graph construction finished!
Traceback (most recent call last):
  File "/home/microadmin/Programs/PEHaplo/tools/Seperate_strand_single_pair.py", line 480, in <module>
    plus_reads_dict, minus_reads_dict = BFS_tranverse_graph_pair(G, read_db, des_list, starting_nodes[0], pair1_dict, pair2_dict)
  File "/home/microadmin/Programs/PEHaplo/tools/Seperate_strand_single_pair.py", line 280, in BFS_tranverse_graph_pair
    candidate_nodes.extend(G.predecessors(vertex))
AttributeError: 'dictionary-keyiterator' object has no attribute 'extend'
Traceback (most recent call last):
  File "../pehaplo.py", line 157, in <module>
    main()
  File "../pehaplo.py", line 85, in main
    subprocess.check_call('python "%s"/tools/Seperate_strand_single_pair.py samp.des samp_edges.txt kept_num.fa kept_pairs.txt' % base_path, shell=True)
  File "/opt/miniconda3/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python "/home/microadmin/Programs/PEHaplo"/tools/Seperate_strand_single_pair.py samp.des samp_edges.txt kept_num.fa kept_pairs.txt' returned non-zero exit status 1

Is this a bug or is there something wrong with the installation? Could you maybe create docker image of a working setup that could be used as an alternative to the complicated manual installation of the requirements?

Thanks for your help in advance!

Kind regards,
Balazs