Giter VIP home page Giter VIP logo

orforise's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

orforise's Issues

Division by zero error when running Annotation_Compare on GeneMarkS gff

I'm having a problem analyzing a gff made with GeneMarksS. Running the following command:

python3 -m ORForise.Annotation_Compare \ -dna ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.dna.chromosome.Chromosome.fa \ -ref ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.52.chromosome.Chromosome.gff3 \ -t Prodigal \ -tp ../genemarks_out/CGT1029_contigs_genemarks_out.gff

throws the following error:

Traceback (most recent call last):
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 173, in
comparator(**vars(args))
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 85, in comparator
orfs = tool_(tool_prediction, genome_Seq)
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/ORForise/src/ORForise/Tools/Prodigal/Prodigal.py", line 18, in Prodigal
if "Prodigal" in line[1] and "CDS" in line[2]:
IndexError: list index out of range
(base) [cnaughton7@biogenome2022 sandbox]$ less ../genemarks_out/CGT1029_contigs_genemarks_out.gff
(base) [cnaughton7@biogenome2022 sandbox]$ python3 ../sandbox/ORForise/src/ORForise/Annotation_Compare.py -dna ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.dna.chromosome.Chromosome.fa -ref ../reference/Listeria_monocytogenes_serotype_4b_str_ll195_gca_000318055.ASM31805v1.52.chromosome.Chromosome.gff3 -t GeneMark_S -tp ../genemarks_out/CGT1029_contigs_genemarks_out.gff
Traceback (most recent call last):
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 173, in
comparator(**vars(args))
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/../sandbox/ORForise/src/ORForise/Annotation_Compare.py", line 86, in comparator
all_Metrics, all_rep_Metrics, start_precision, stop_precision, other_starts, other_stops, perfect_Matches, missed_genes, unmatched_orfs, undetected_gene_metrics, unmatched_orf_metrics, orf_Coverage_Genome, matched_ORF_Coverage_Genome, gene_coverage_genome, multi_Matched_ORFs, partial_Hits = tool_comparison(
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/ORForise/src/ORForise/Comparator.py", line 389, in tool_comparison
atg_P, gtg_P, ttg_P, att_P, ctg_P, other_Start_P, other_Starts = start_Codon_Count(orfs)
File "/home/groupc/analysis/Team3-GenePrediction/sandbox/ORForise/src/ORForise/Comparator.py", line 177, in start_Codon_Count
atg_P = format(100 * atg / len(orfs), '.2f')
ZeroDivisionError: division by zero

I've checked the gff and confirmed there are genes in the file. Do you see anything obvious that I'm doing wrong?

Negative r_Start values

Hi Nick, I've tried to use ORForise with a gff file produced by Balrog, but I'm getting the following error:

  File "/home/matt/ORForise_1.3.0/ORForise/src/ORForise/Comparator.py", line 79, in nuc_Count
    gc_content = (g + c) * 100 / (a + t + g + c + n)
ZeroDivisionError: division by zero

I've found the cause as well. Balrog predicts genes that go past the end and start back at the beginning (2764701 > 2764699).

##gff-version 3
##sequence-region Chromosome 1 2764699
Chromosome balrog CDS 114 308 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein

Chromosome balrog CDS 2760738 2762504 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein
Chromosome balrog CDS 2762556 2763095 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein
Chromosome balrog CDS 2763494 2764701 . - 0 inference=ab initio prediction:Balrog;product=hypothetical protein

This leads to a negative value of r_Start.

def nuc_Count(start, stop, strand):  # Gets correct seq then returns GC
    if strand == '-':
        r_Start = comp.genome_Size - stop
        r_Stop = comp.genome_Size - start
        seq = (comp.genome_Seq_Rev[r_Start:r_Stop + 1])

balrogcpp-Staph.gff

GFF adder/Intersector

Thank you for the nice code. I am new to python and TnSeq analysis. Currently i use different gene annotation tool and try to merge them for use in TnSeq analysis. For the analysis, I require the locus tag for each orf. I wrote a code which does the job but I found the ORFwise useful with more desired features. I was wondering if there is any way you will be adding the feature of writing the “ID” attribute or 9th column from the both the reference and tool generated gff files to the output files of gff adder/intersector? The resulting output gff file will be helpful for me to be used in further TnSeq analysis.
Thank you in advance

GFF input

I tried downloading the not yet released ORForise 1.3.0 and I got an error for my GFF pyrodigal input.

  File "/home/matt/ORForise_1.3.0/ORForise/src/ORForise/Tools/GFF/GFF.py", line 14, in GFF
    types = args[2]
IndexError: tuple index out of range

args[2] doesn’t exist, but the types variable is compared to line[2] later on.

gene_types = types.split(',')
if any(gene_type == line[2] for gene_type in gene_types)and len(line) == 9:

I checked what line[2] was equal to. It’s value is “CDS”, so I just set types equal to “CDS” as well to see what would happen (types = "CDS"). The code ran and I got an output. It’s a pretty hacky solution though and you have a commented out part later on where it looks like you're working on “CDS”.

                # elif "CDS" in line[2]:
                #     sys.exit("SAS")

Here is the file:
pyrodigal_E-coli.gff

Inconsistent results for forward vs backward frames

I tried gaming your script by creating six ORFs that span the entirety of all six frames (of Myco). I expected it to get 100% partial, but it only got 41%

$ cat game.gff 
##gff-version 3
Chromosome	None	CDS	1	580074	.	+	.	
Chromosome	None	CDS	1	580074	.	-	.	
Chromosome	None	CDS	2	580075	.	+	.	
Chromosome	None	CDS	2	580075	.	-	.	
Chromosome	None	CDS	3	580076	.	+	.	
Chromosome	None	CDS	3	580076	.	-	.	
$ Annotation-Compare -dna Genomes/Myco.fa -ref Genomes/Myco.gff -t GFF -tp game.gff
Thank you for using ORForise
Please report any issues to: https://github.com/NickJD/ORForise/issues
#####
Genome Used: Myco.gff
Reference Used: Genomes/Myco.gff
Tool Compared: GFF
Perfect Matches: 0 [476] - 0.00%
Partial Matches: 199 [476] - 41.81%
Missed Genes: 277 [476] - 58.19%

I tracked down which CDS were missed, and they were all (277) forward CDS while all the (199) reverse CDS were 100% found

$ grep CDS Genomes/Myco.gff | cut -f7 | sort | uniq -c
    277 +
    199 -

Calculate gene length

Hi Nick, it was great chatting with you today. I just want to mention one more tiny thing I noticed. In Comparator.py, you calculate gene length in the following way:
gene_Length = (g_Stop - g_Start)
Are g_Start and g_Stop the first and last bases of the gene? Wouldn't you have to add 1, because the length is start and stop inclusive (i.e. gene_Length = (g_Stop - g_Start) + 1)?

I look forward to giving you updates on my project as I continue my analysis.

ORForise.Annotation_Compare failing after a few matches

ORForise.Annotation_Compare seems to be having an issue after matching a few genes. Running the command:

python3 -m ORForise.Annotation_Compare -dna ../reference/Listeria_ref.fasta \ -ref ../reference/Listeria_geneAnnotations.gff3 \ -t Prodigal -tp ../prodigal_output_tool/prodigal_gffCGT1005_contigs.gff \ -v True

gives the output:

No Hit
Out of Frame Predicted CDS
Out of Frame Predicted CDS
Partial Match
Traceback (most recent call last):
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/site-packages/ORForise/Annotation_Compare.py", line 167, in
comparator(**vars(args))
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/site-packages/ORForise/Annotation_Compare.py", line 80, in comparator
all_Metrics, all_rep_Metrics, start_precision, stop_precision, other_starts, other_stops, perfect_Matches, missed_genes, unmatched_orfs, undetected_gene_metrics, unmatched_orf_metrics, orf_Coverage_Genome, matched_ORF_Coverage_Genome, gene_coverage_genome, multi_Matched_ORFs, partial_Hits = tool_comparison(
File "/home/groupc/anaconda3/envs/team3_gene_prediction/lib/python3.10/site-packages/ORForise/Comparator.py", line 332, in tool_comparison
previously_Covered_Gene = comp.matched_ORFs[g_pos][-1]
KeyError: '4869,5981'

Can you please advise on this?

Median_Stop_Difference_of_Matched_ORFs is NA

Hey Nick,

When running ORForise on Prodigal, or GeneMarkS, we get N/A as "Median_Stop_Difference_of_Matched_ORFs";

Percentage_of_Genes_Detected,Percentage_of_ORFs_that_Detected_a_Gene,Percent_Difference_of_All_ORFs,Median_Length_Difference,Percentage_of_Perfect_Matches,Median_Start_Difference_of_Matched_ORFs,Median_Stop_Difference_of_Matched_ORFs,Percentage_Difference_of_Matched_Overlapping_CDSs,Percent_Difference_of_Short-Matched-ORFs,Precision,Recall,False_Discovery_Rate
98.05,93.22,5.13,-2.48,94.21,-3.0,N/A,-3.61,-16.87,0.93,0.98,0.07

Any idea what is causing this and how it might be fixed?

Thanks,
Tom

AttributeError: module 'numpy' has no attribute 'int'

orforise

Can .gff3 instead of .gff be used as input? I am new to python. Is this error specific to gff or due to numpy?

(orfwise) C:\Users\tmu258>Annotation-Compare -dna C:/Users/tmu258/orfwise/NZ_CP008776.fa -ref C:/Users/tmu258/orfwise/NZ_CP008776.gff -t Prodigal -tp C:/Users/tmu258/orfwise/prokka_GAS.gff
Thank you for using ORForise
Please report any issues to: https://github.com/NickJD/ORForise/issues

Traceback (most recent call last):
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\tmu258\Anaconda3\envs\orfwise\Scripts\Annotation-Compare.exe_main
.py", line 7, in
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\ORForise\Annotation_Compare.py", line 184, in main
comparator(options)
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\ORForise\Annotation_Compare.py", line 70, in comparator
all_Metrics, all_rep_Metrics, start_precision, stop_precision, other_starts, other_stops, perfect_Matches, missed_genes, unmatched_orfs, undetected_gene_metrics, unmatched_orf_metrics, orf_Coverage_Genome, matched_ORF_Coverage_Genome, gene_coverage_genome, multi_Matched_ORFs, partial_Hits = tool_comparison(
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\ORForise\Comparator.py", line 392, in tool_comparison
gene_Nuc_Array = np.zeros((comp.genome_Size), dtype=np.int)
File "C:\Users\tmu258\Anaconda3\envs\orfwise\lib\site-packages\numpy_init
.py", line 284, in getattr
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'int'

tool_comparison function in Comparator.py

Hi Nick, I just used a cpu profiler, scalene, to see where the bottlenecks in ORForise were. Line 280 in Comparator.py seems to be the bottleneck orf_Set = set(range(o_Start, o_Stop + 1))

By getting rid of that line and calculating the overlap differently, I was able to run a GFF file in about 10 seconds instead of 3 minutes and 50 seconds.

Delete: orf_Set = set(range(o_Start, o_Stop + 1)) on line 280
Instead of: overlap = len(gene_Set.intersection(orf_Set))
use this instead: overlap = max(min(o_Stop, g_Stop) - max(o_Start, g_Start), -1) + 1 on lines 287 and 296 (they become 286 and 295)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.