Giter VIP home page Giter VIP logo

pehaplo's People

Contributors

chjiao avatar kennthshang avatar yannisun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pehaplo's Issues

subprocess.CalledProcessError: Command 'sga index Contigs.fa' returned non-zero exit status 1

Hi, I'm experiencing the following error when trying to assemble the paired-end reads:

Size of File:0
Number of Strings: K 0
Size of strings:0
Creating tree: 
User Time for Constructing tree with no sorting required: 0 ms
Segmentation fault (core dumped)
The nodes of the whole graph is: 0; the edges of the whole graph is: 0.
index: input file is empty
Traceback (most recent call last):
  File "../pehaplo.py", line 157, in <module>
    main()
  File "../pehaplo.py", line 102, in main
    subprocess.check_call('sga index Contigs.fa', shell=True)
  File "/home/thermite/anaconda3/envs/tarvir/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'sga index Contigs.fa' returned non-zero exit status 1

I'm running Python 2.7 and followed the steps as stated in the Readme as well as my paired end reads are in the 'reads1/1' format.
Thanks!

Fail in contig correction

The command is: python pehaplo.py -f1 PV194D.virus_recruit_1.fa -f2 PV194D.virus_recruit_2.fa -l 100 -l1 120 -t 30 -r 150 -F 375 -std 100 -n 3 -correct yes

The error info is: Traceback (most recent call last):
File "pehaplo.py", line 157, in
main()
File "pehaplo.py", line 115, in main
subprocess.check_call('samtools view -F 4 -S contigs_alignment.sam > contigs_mapped.sam', shell=True)
File "/home/xiuwan/miniconda2/envs/pehaplo/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'samtools view -F 4 -S contigs_alignment.sam > contigs_mapped.sam' returned non-zero exit status 1

The reason I think is:
grep -n "contig_529_0" Contigs.fa
115:>contig_529_0 contig_529_0 NumDuplicates=1
117:>contig_529_0 contig_529_0 NumDuplicates=1

Why there would be two same id in Contigs.fa?

I have completed the other 10 samples, everything is successful, just this sample meet this problem. It's so strange. Right?

Thanks in advance.

python 3 compatibility

I tried to make a bioconda recipe ( https://bioconda.github.io/ ) for installing PEHaplo. To install it in Python 3 environments, I tried to use the 2to3 program which converts python 2 code to python 3. The problem is that the division operator (/) works differently in Python 3: in Python 2 it's integer division if operands are integer, while in Python 3 it's floating-point division while // is integer division. If possible, could you change your code to be Python 3 compatible, by adding
from future import division
at the top of each script, and then changing / to // wherever you want integer division rather than floating-point division? Then PEHaplo could be installed in Python 3 conda environments alongside other tools.

Failure running PEHaplo

Hi,
I have been trying to run PEHaplo during the last two days with no success. Several failures have jumped, but step by step I could solve them. Unfortunately, the last one it is out of my control:
[ERROR REPORTED]
File "XERMADE/PEHaplo-master/tools/get_pairs_title_from_nondup.py", line 28, in
assert (title_dict[base].endswith('1') and title.endswith('2')) or (title_dict[base].endswith('2') and title.endswith('1'))
AssertionError
But I have changed my reads (they are in fasta format), the current format is: 'read1/1' 'read1/2', which I think is the right one. Am I wrong?
Besides, I have tried to execute aswell the gold benchmark (https://github.com/cbg-ethz/5-virus-mix); but the code fails:
[ERROR REPORTED]
XERMADE/PEHaplo-master/tools/Seperate_strand_single_pair.py", line 466, in
des_list, read_map, read_db = get_seq_from_fa(fa_file, des_file) # read dictionary
File "/home/bfreire/XERMADE/PEHaplo-master/tools/Seperate_strand_single_pair.py", line 14, in get_seq_from_fa
with open(des_file,'r') as f:
IOError: [Errno 2] No such file or directory: 'samp.des'
I think that samp.des is a file created by the code. If not I did not realize about how it should be.
In this case I have trimmed the reads with the same paremeters as SAVAGE did, but I have not applied PEAR. Is everything correct?
Thanks in advice :)

Error with get_pairs_title_from_nondup.py

Hello, I´m having an error when PEHaplo runs get_pairs_title_from_nondup.py. It exits with non zero status.

I have installed Networkx version 1.11
My reads names are CP1p_1.fa and CP1p_2.fa

Can you guide me through to the solution?

Thanks!

get_pairs_title_from_nondup.py error

Hi,

I installed PEHaplo with conda - this requires and installs python 3.6 (is that correct?).
I tried to install PEHaplo using conda into a py26 environment - but that causes inconsistencies.
Maybe the py version is not the problem?

kept_num.fa is produced and produce some 200k reads with fasta headers - e.g.:

>M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669 NumDuplicates=1
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTCTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669 NumDuplicates=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>M03777:103:000000000-B655P:1:1101:18625:1726 M03777:103:000000000-B655P:1:1101:18625:1726 NumDuplicates=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAATTTAATAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
....

the kept_pairs.txt however contain only 2 entries:
M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669

Traceback (most recent call last): File "/home/ngs/miniconda3/envs/pehaplo/share/pehaplo-v0.1_dev05-0/tools/get_pairs_title_from_nondup.py", line 26, in <module> assert title_dict[base].endswith('1') and title.endswith('2') AssertionError Traceback (most recent call last): File "/home/ngs/miniconda3/envs/pehaplo/bin/pehaplo.py", line 141, in <module> main() File "/home/ngs/miniconda3/envs/pehaplo/bin/pehaplo.py", line 87, in main subprocess.check_call('python "%s"/tools/get_pairs_title_from_nondup.py kept_num.fa kept_pairs.txt' % base_path, shell=True) File "/home/ngs/miniconda3/envs/pehaplo/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python "/home/ngs/miniconda3/envs/pehaplo/share/pehaplo-v0.1_dev05-0"/tools/get_pairs_title_from_nondup.py kept_num.fa kept_pairs.txt' returned non-zero exit status 1.

read quality scores

@chjiao I notice that you don't use read quality scores (fastq files), although some of the tools you invoke (Karect, sga) can use them. Is that just to make things simpler, or you've found in your experiments that using quality scores makes results worse?

subprocess.CalledProcessError exit status -11

An issue was arised as followed:
(pehaplo) root 17:36:35 /public/home/hanzhenzhi/ngs-project/WFM/D2587/quasispecies
$ pehaplo.py -f1 tmp/clean_1.fq -f2 tmp/clean_2.fq -l 50 -r 150 -std 100 -t 36 -m 200GB -correct no
Error correction begins -------------------------------------------------
Error Correction Started
Estimated Coverage = 14.745907
Global Num Reads = 145153888
Global Num Bases = 21695686234
Global Num Kmers = 20534455130
Global Max Read Len = 150
Global Avg Read Len = 149.466794
Min Overlap = 35
Min Overlap Percentage = 0.200000
Max Overlap Error Rate = 0.250000
Truncate Factor = 0.000000
Min Straight Edge Val = 73.729537
Min Straight Edge Per = 1.000000
Min Read Weight = 0.000000
Max Mul Len Matches = 13422818
Max Kmer Slots = 100000
Use Qual = 1
Match Type = Hamming Distance
Kmer Size = 13
Max Kmer Errors = 2
Kmer With Errors Size = 13
Read Index Bits = 40
Num Threads = 36
Memory Budget = 10995116277760[10240g]
File Block Size = 10485760[10m]
Cache Block Size = 134217728[128m]
Max Reads Per Step = 1000
Max Part Res = 110
Main CorrectErrors StartFile=0
Total Reads Size = 43681680245
Building Index [0 - 43536526356]
Traceback (most recent call last):
File "/public/Software/bin/pehaplo.py", line 160, in
main()
File "/public/Software/bin/pehaplo.py", line 52, in main
subprocess.check_call("karect -correct -inputfile=%s -inputfile=%s -threads=%s -celltype=haploid -matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25" % (args.input_f1, args.input_f2, args.threads), shell=True)
File "/public/Software/Miniconda3/envs/pehaplo/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'karect -correct -inputfile=tmp/clean_1.fq -inputfile=tmp/clean_2.fq -threads=36 -celltype=haploid -matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25' returned non-zero exit status -11

I performed several methods to solve this issue, such as reducing the overlap threshold as showed for other problems. But this issue could not be solve. I also change to "root" role, for the highest permission of computer. And the command "-correct no" or "-correct yes" return the same issue. Could you check this error and provide a solution?

Zhenzhi

Read quality trimming and adapter removal

The article mentions:

There are five major components in the pipeline of PEHaplo as shown
in Figure 3. In the first pre-processing stage, reads with low-quality
or ambiguous base calls are filtered or trimmed.

however, it's not implemented in the code, as far as I can tell.
There is also no mention of any adapter removal in the methods section.
Should we do the usual adapter removal and/or quality trimming procedure before using PEHaplo, or is the algorithm somehow immune to this noise?

My primary concern is that after trimming the read lengths will be widely distributed and many of them will be shorter than the calculated min overlap parameter.

AssertionError in get_pairs_title_from_nondup.py

Hi Jiao,

I've been trying to get PEHaplo running for a while now, but I keep moving from error to error so I was hoping you could help me out. First it complained when running "join_pair_end_fasta.py". Since the input given to the script in th main pipeline is fastq and there is also a "join_pair_end_fastq.py" script available in /tools, I replaced the first by the latter script in the pipeline. This seemed to fix the issue. But then it crashed when running "get_pairs_title_from_nondup.py" because of an AssertionError:

assert (title_dict[base].endswith('1') and title.endswith('2')) or (title_dict[base].endswith('2') and title.endswith('1'))

So I renamed my input reads to ensure that forward readnames end with /1 and reverse readnames end with /2 (my input now looks similar to the example data) but PEHaplo still crashes. If I remove the assert statement it will continue to the next step and crash when running "Seperate_strand_single_pair.py" because of a KeyError after the graph construction finishes.

Any help would be very much appreciated!

Best,
Jasmijn

Test run fails

Hi Jiao,

I am having problem getting PEHaplo to run. I think all the requirements are installed correctly, but I am not sure how to test whether that is true. I tried following the README and created an assembly folder in the root folder of the repo. Then changed directory to the assembly folder and ran python ../pehaplo.py -f1 ../raw_test_data/virus_1.fa -f2 ../raw_test_data/virus_2.fa -l 180 -l1 210 -r 250 -F 600 -std 150 -n 3 -correct yes that crashed after a while with the following error:

# gt readjoiner overlap (version 1.2)
# number of reads in filtered readset = 27066
# number of irreducible suffix-prefix matches = 27088
# average irreducible SPM/read = 1.00
# number of transitive suffix-prefix matches = 1351014
Number of total reads:  27066
Number of paired reads: 11472
Number of unpaired reads:       15594
Graph construction finished!
Traceback (most recent call last):
  File "/home/microadmin/Programs/PEHaplo/tools/Seperate_strand_single_pair.py", line 480, in <module>
    plus_reads_dict, minus_reads_dict = BFS_tranverse_graph_pair(G, read_db, des_list, starting_nodes[0], pair1_dict, pair2_dict)
  File "/home/microadmin/Programs/PEHaplo/tools/Seperate_strand_single_pair.py", line 280, in BFS_tranverse_graph_pair
    candidate_nodes.extend(G.predecessors(vertex))
AttributeError: 'dictionary-keyiterator' object has no attribute 'extend'
Traceback (most recent call last):
  File "../pehaplo.py", line 157, in <module>
    main()
  File "../pehaplo.py", line 85, in main
    subprocess.check_call('python "%s"/tools/Seperate_strand_single_pair.py samp.des samp_edges.txt kept_num.fa kept_pairs.txt' % base_path, shell=True)
  File "/opt/miniconda3/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python "/home/microadmin/Programs/PEHaplo"/tools/Seperate_strand_single_pair.py samp.des samp_edges.txt kept_num.fa kept_pairs.txt' returned non-zero exit status 1

Is this a bug or is there something wrong with the installation? Could you maybe create docker image of a working setup that could be used as an alternative to the complicated manual installation of the requirements?

Thanks for your help in advance!

Kind regards,
Balazs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.