nickjd / orforise Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 4.0 186 MB

Comparison pipeline for Prokaryote Protein Coding Gene Predictors

License: GNU General Public License v3.0

Python 100.00%

orforise's People

Stargazers

Watchers

Forkers

sivasan matt-schmitz sjaenick chantisakee

orforise's Issues

Reverse complement efficiency

Division by zero error when running Annotation_Compare on GeneMarkS gff

I'm having a problem analyzing a gff made with GeneMarksS. Running the following command:

python3 -m ORForise.Annotation_Compare \ -dna ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.dna.chromosome.Chromosome.fa \ -ref ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.52.chromosome.Chromosome.gff3 \ -t Prodigal \ -tp ../genemarks_out/CGT1029_contigs_genemarks_out.gff

throws the following error:

Traceback (most recent call last):
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 173, in
comparator(**vars(args))
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 85, in comparator
orfs = tool_(tool_prediction, genome_Seq)
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/ORForise/src/ORForise/Tools/Prodigal/Prodigal.py", line 18, in Prodigal
if "Prodigal" in line[1] and "CDS" in line[2]:
IndexError: list index out of range
(base) [cnaughton7@biogenome2022 sandbox]$ less ../genemarks_out/CGT1029_contigs_genemarks_out.gff
(base) [cnaughton7@biogenome2022 sandbox]$ python3 ../sandbox/ORForise/src/ORForise/Annotation_Compare.py -dna ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.dna.chromosome.Chromosome.fa -ref ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.52.chromosome.Chromosome.gff3 -t GeneMark_S -tp ../genemarks_out/CGT1029_contigs_genemarks_out.gff
Traceback (most recent call last):
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 173, in
comparator(**vars(args))
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 86, in comparator
all_Metrics, all_rep_Metrics, start_precision, stop_precision, other_starts, other_stops, perfect_Matches, missed_genes, unmatched_orfs, undetected_gene_metrics, unmatched_orf_metrics, orf_Coverage_Genome, matched_ORF_Coverage_Genome, gene_coverage_genome, multi_Matched_ORFs, partial_Hits = tool_comparison(
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/ORForise/src/ORForise/Comparator.py", line 389, in tool_comparison
atg_P, gtg_P, ttg_P, att_P, ctg_P, other_Start_P, other_Starts = start_Codon_Count(orfs)
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/ORForise/src/ORForise/Comparator.py", line 177, in start_Codon_Count
atg_P = format(100 * atg / len(orfs), '.2f')
ZeroDivisionError: division by zero

I've checked the gff and confirmed there are genes in the file. Do you see anything obvious that I'm doing wrong?

Negative r_Start values

Hi Nick, I've tried to use ORForise with a gff file produced by Balrog, but I'm getting the following error:

  File "/home/matt/ORForise_1.3.0/ORForise/src/ORForise/Comparator.py", line 79, in nuc_Count
    gc_content = (g + c) * 100 / (a + t + g + c + n)
ZeroDivisionError: division by zero

I've found the cause as well. Balrog predicts genes that go past the end and start back at the beginning (2764701 > 2764699).

##gff-version 3
##sequence-region Chromosome 1 2764699
Chromosome balrog CDS 114 308 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein

Chromosome balrog CDS 2760738 2762504 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein
Chromosome balrog CDS 2762556 2763095 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein
Chromosome balrog CDS 2763494 2764701 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein

This leads to a negative value of r_Start.

def nuc_Count(start, stop, strand):  # Gets correct seq then returns GC
    if strand == '-':
        r_Start = comp.genome_Size - stop
        r_Stop = comp.genome_Size - start
        seq = (comp.genome_Seq_Rev[r_Start:r_Stop + 1])

balrogcpp-Staph.gff

GFF adder/Intersector

Thank you for the nice code. I am new to python and TnSeq analysis. Currently i use different gene annotation tool and try to merge them for use in TnSeq analysis. For the analysis, I require the locus tag for each orf. I wrote a code which does the job but I found the ORFwise useful with more desired features. I was wondering if there is any way you will be adding the feature of writing the “ID” attribute or 9th column from the both the reference and tool generated gff files to the output files of gff adder/intersector? The resulting output gff file will be helpful for me to be used in further TnSeq analysis.
Thank you in advance

GFF input

I tried downloading the not yet released ORForise 1.3.0 and I got an error for my GFF pyrodigal input.

  File "/home/matt/ORForise_1.3.0/ORForise/src/ORForise/Tools/GFF/GFF.py", line 14, in GFF
    types = args[2]
IndexError: tuple index out of range

args[2] doesn’t exist, but the types variable is compared to line[2] later on.

gene_types = types.split(',')
if any(gene_type == line[2] for gene_type in gene_types)and len(line) == 9:

I checked what line[2] was equal to. It’s value is “CDS”, so I just set types equal to “CDS” as well to see what would happen (types = "CDS"). The code ran and I got an output. It’s a pretty hacky solution though and you have a commented out part later on where it looks like you're working on “CDS”.

                # elif "CDS" in line[2]:
                #     sys.exit("SAS")

Here is the file:
pyrodigal_E-coli.gff

Inconsistent results for forward vs backward frames

I tried gaming your script by creating six ORFs that span the entirety of all six frames (of Myco). I expected it to get 100% partial, but it only got 41%

$ cat game.gff 
##gff-version 3
Chromosome	None	CDS	1	580074	.	+	.	
Chromosome	None	CDS	1	580074	.	-	.	
Chromosome	None	CDS	2	580075	.	+	.	
Chromosome	None	CDS	2	580075	.	-	.	
Chromosome	None	CDS	3	580076	.	+	.	
Chromosome	None	CDS	3	580076	.	-	.	
$ Annotation-Compare -dna Genomes/Myco.fa -ref Genomes/Myco.gff -t GFF -tp game.gff
Thank you for using ORForise
Please report any issues to: https://github.com/NickJD/ORForise/issues
#####
Genome Used: Myco.gff
Reference Used: Genomes/Myco.gff
Tool Compared: GFF
Perfect Matches: 0 [476] - 0.00%
Partial Matches: 199 [476] - 41.81%
Missed Genes: 277 [476] - 58.19%

I tracked down which CDS were missed, and they were all (277) forward CDS while all the (199) reverse CDS were 100% found

$ grep CDS Genomes/Myco.gff | cut -f7 | sort | uniq -c
    277 +
    199 -

Calculate gene length

Hi Nick, it was great chatting with you today. I just want to mention one more tiny thing I noticed. In Comparator.py, you calculate gene length in the following way:
gene_Length = (g_Stop - g_Start)
Are g_Start and g_Stop the first and last bases of the gene? Wouldn't you have to add 1, because the length is start and stop inclusive (i.e. gene_Length = (g_Stop - g_Start) + 1)?

I look forward to giving you updates on my project as I continue my analysis.

ORForise.Annotation_Compare failing after a few matches

ORForise.Annotation_Compare seems to be having an issue after matching a few genes. Running the command:

python3 -m ORForise.Annotation_Compare -dna ../reference/Listeria_ref.fasta \ -ref ../reference/Listeria_geneAnnotations.gff3 \ -t Prodigal -tp ../prodigal_output_tool/prodigal_gffCGT1005_contigs.gff \ -v True

gives the output:

No Hit
Out of Frame Predicted CDS
Out of Frame Predicted CDS
Partial Match
Traceback (most recent call last):
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/site-packages/ORForise/Annotation_Compare.py", line 167, in
comparator(**vars(args))
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/site-packages/ORForise/Annotation_Compare.py", line 80, in comparator
all_Metrics, all_rep_Metrics, start_precision, stop_precision, other_starts, other_stops, perfect_Matches, missed_genes, unmatched_orfs, undetected_gene_metrics, unmatched_orf_metrics, orf_Coverage_Genome, matched_ORF_Coverage_Genome, gene_coverage_genome, multi_Matched_ORFs, partial_Hits = tool_comparison(
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/site-packages/ORForise/Comparator.py", line 332, in tool_comparison
previously_Covered_Gene = comp.matched_ORFs[g_pos][-1]
KeyError: '4869,5981'

Can you please advise on this?

Median_Stop_Difference_of_Matched_ORFs is NA

Hey Nick,

When running ORForise on Prodigal, or GeneMarkS, we get N/A as "Median_Stop_Difference_of_Matched_ORFs";

Percentage_of_Genes_Detected,Percentage_of_ORFs_that_Detected_a_Gene,Percent_Difference_of_All_ORFs,Median_Length_Difference,Percentage_of_Perfect_Matches,Median_Start_Difference_of_Matched_ORFs,Median_Stop_Difference_of_Matched_ORFs,Percentage_Difference_of_Matched_Overlapping_CDSs,Percent_Difference_of_Short-Matched-ORFs,Precision,Recall,False_Discovery_Rate
98.05,93.22,5.13,-2.48,94.21,-3.0,N/A,-3.61,-16.87,0.93,0.98,0.07

Any idea what is causing this and how it might be fixed?

Thanks,
Tom

AttributeError: module 'numpy' has no attribute 'int'

Can .gff3 instead of .gff be used as input? I am new to python. Is this error specific to gff or due to numpy?

(orfwise) C:\Users\tmu258>Annotation-Compare -dna C:/Users/tmu258/orfwise/NZ_CP008776.fa -ref C:/Users/tmu258/orfwise/NZ_CP008776.gff -t Prodigal -tp C:/Users/tmu258/orfwise/prokka_GAS.gff
Thank you for using ORForise
Please report any issues to: https://github.com/NickJD/ORForise/issues

Traceback (most recent call last):
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\tmu258\Anaconda3\envs\orfwise\Scripts\Annotation-Compare.exe_main.py", line 7, in
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\ORForise\Annotation_Compare.py", line 184, in main
comparator(options)
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\ORForise\Annotation_Compare.py", line 70, in comparator
all_Metrics, all_rep_Metrics, start_precision, stop_precision, other_starts, other_stops, perfect_Matches, missed_genes, unmatched_orfs, undetected_gene_metrics, unmatched_orf_metrics, orf_Coverage_Genome, matched_ORF_Coverage_Genome, gene_coverage_genome, multi_Matched_ORFs, partial_Hits = tool_comparison(
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\ORForise\Comparator.py", line 392, in tool_comparison
gene_Nuc_Array = np.zeros((comp.genome_Size), dtype=np.int)
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\numpy_init.py", line 284, in getattr
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'int'

tool_comparison function in Comparator.py

Hi Nick, I just used a cpu profiler, scalene, to see where the bottlenecks in ORForise were. Line 280 in Comparator.py seems to be the bottleneck orf_Set = set(range(o_Start, o_Stop + 1))

By getting rid of that line and calculating the overlap differently, I was able to run a GFF file in about 10 seconds instead of 3 minutes and 50 seconds.

Delete: orf_Set = set(range(o_Start, o_Stop + 1)) on line 280
Instead of: overlap = len(gene_Set.intersection(orf_Set))
use this instead: overlap = max(min(o_Stop, g_Stop) - max(o_Start, g_Start), -1) + 1 on lines 287 and 296 (they become 286 and 295)

nickjd / orforise Goto Github PK

orforise's People

Stargazers

Watchers

Forkers

orforise's Issues

Reverse complement efficiency

Division by zero error when running Annotation_Compare on GeneMarkS gff

Negative r_Start values

GFF adder/Intersector

GFF input

Inconsistent results for forward vs backward frames

Calculate gene length

ORForise.Annotation_Compare failing after a few matches

Median_Stop_Difference_of_Matched_ORFs is NA

AttributeError: module 'numpy' has no attribute 'int'

tool_comparison function in Comparator.py

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent