tseemann / prokka Goto Github PK

:zap: :aquarius: Rapid prokaryotic genome annotation

Shell 1.04% Perl 98.34% TSQL 0.61%

genome-annotation functional-assignment bacterial-genomes gene-finding

prokka's Introduction

Prokka: rapid prokaryotic genome annotation

Introduction

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Installation

Bioconda

If you use Conda you can use the Bioconda channel:

conda install -c conda-forge -c bioconda -c defaults prokka

Brew

If you are using the MacOS Brew or LinuxBrew packaging system:

brew install brewsci/bio/prokka

Docker

Maintained by https://hub.docker.com/u/staphb


docker pull staphb/prokka:latest
docker run staphb/prokka:latest prokka -h

Singularity

singularity build prokka.sif docker://staphb/prokka:latest
singularity exec prokka.sif prokka -h

Ubuntu/Debian/Mint

sudo apt-get install libdatetime-perl libxml-simple-perl libdigest-md5-perl git default-jre bioperl
sudo cpan Bio::Perl
git clone https://github.com/tseemann/prokka.git $HOME/prokka
$HOME/prokka/bin/prokka --setupdb

Centos/Fedora/RHEL

sudo yum install git perl-Time-Piece perl-XML-Simple perl-Digest-MD5 perl-App-cpanminus git java perl-CPAN perl-Module-Build
sudo cpanm Bio::Perl
git clone https://github.com/tseemann/prokka.git $HOME/prokka
$HOME/prokka/bin/prokka --setupdb

MacOS

sudo cpan Time::Piece XML::Simple Digest::MD5 Bio::Perl
git clone https://github.com/tseemann/prokka.git $HOME/prokka
$HOME/prokka/bin/prokka --setupdb

Test

Type prokka and it should output its help screen.
Type prokka --version and you should see an output like prokka 1.x
Type prokka --listdb and it will show you what databases it has installed to use.

Invoking Prokka

Beginner

# Vanilla (but with free toppings)
% prokka contigs.fa

# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
% cat PROKKA_yyyymmdd/*.txt

Moderate

# Choose the names of the output files
% prokka --outdir mydir --prefix mygenome contigs.fa

# Visualize it in Artemis
% art mydir/mygenome.gff

Specialist

# Have curated genomes I want to use to annotate from
% prokka --proteins MG1655.gbk --outdir mutant --prefix K12_mut contigs.fa

# Look at tabular features
% less -S mutant/K12_mut.tsv

Expert

# It's not just for bacteria, people
% prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC

# Search for your favourite gene
% exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less

Wizard

# Watch and learn
% prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa

# Check to see if anything went really wrong
% less mydir/EHEC_06072012.err

# Add final details using Sequin
% sequin mydir/EHEC_0607201.sqn

NCBI Genbank submitter

# Register your BioProject (e.g. PRJNA123456) and your locus_tag prefix (e.g. EHEC) first!
% prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa

# Check to see if anything went really wrong
% less PRJNA123456/EHEC-Chr1.err

# Add final details using Sequin
% sequin PRJNA123456/EHEC-Chr1.sqn

European Nucleotide Archive (ENA) submitter

# Register your BioProject (e.g. PRJEB12345) and your locus_tag (e.g. EHEC) prefix first!
% prokka --compliant --centre UoN --outdir PRJEB12345 --locustag EHEC --prefix EHEC-Chr1 contigs.fa

# Check to see if anything went really wrong
% less PRJNA123456/EHEC-Chr1.err

# Install and run Sanger Pathogen group's Prokka GFF3 to EMBL converter
# available from https://github.com/sanger-pathogens/gff3toembl
# Find the closest NCBI taxonomy id (e.g. 562 for Escherichia coli)
% gff3_to_embl -i "Submitter, A." \
    -m "Escherichia coli EHEC annotated using Prokka." \
    -g linear -c PROK -n 11 -f PRJEB12345/EHEC-Chr1.embl \
    "Escherichia coli" 562 PRJEB12345 "Escherichia coli strain EHEC" PRJEB12345/EHEC-Chr1.gff

# Download and run the latest EMBL validator prior to submitting the EMBL flat file
# from http://central.maven.org/maven2/uk/ac/ebi/ena/sequence/embl-api-validator/
# which at the time of writing is v1.1.129
% curl -L -O http://central.maven.org/maven2/uk/ac/ebi/ena/sequence/embl-api-validator/1.1.129/embl-api-validator-1.1.129.jar
% java -jar embl-api-validator-1.1.129.jar -r PRJEB12345/EHEC-Chr1.embl

# Compress the file ready to upload to ENA, and calculate MD5 checksum
% gzip PRJEB12345/EHEC-Chr1.embl
% md5sum PRJEB12345/EHEC-Chr1.embl.gz

Crazy Person

# No stinking Perl script is going to control me
% prokka \
        --outdir $HOME/genomes/Ec_POO247 --force \
        --prefix Ec_POO247 --addgenes --locustag ECPOOp \
        --increment 10 --gffver 2 --centre CDC  --compliant \
        --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
        --kingdom Bacteria --gcode 11 --usegenus \
        --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
        --evalue 1e-9 --rfam \
        plasmid-closed.fna

Output Files

Extension	Description
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk	This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna	Nucleotide FASTA file of the input contig sequences.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn	An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa	Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl	Feature Table file, used by "tbl2asn" to create the .sqn file.
.err	Unacceptable annotations - the NCBI discrepancy report.
.log	Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled.
.txt	Statistics relating to the annotated features found.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Command line options

General:
  --help            This help
  --version         Print version and exit
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix (default 'PROKKA')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --prodigaltf [X]  Prodigal training file (default '')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    Fasta file of trusted proteins to first annotate from (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
Computation:
  --fast            Fast mode - skip CDS /product searching (default OFF)
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

Option: --proteins

The --proteins option is recommended when you have good quality reference genomes and want to ensure gene naming is consistent. Some species use specific terminology which will be often lost if you rely on the default Swiss-Prot database included with Prokka.

If you have Genbank or Protein FASTA file(s) that you want to annotate genes from as the first priority, use the --proteins myfile.gbk. Please make sure it has a recognisable file extension like .gb or .gbk or auto-detect will fail. The use of Genbank is recommended over FASTA, because it will provide /gene and /EC_number annotations that a typical .faa file will not provide, unless you have specially formatted it for Prokka.

Option: --prodigaltf

Instead of letting prodigal train its gene model on the contigs you provide, you can pre-train it on some good closed reference genomes first using the prodigal -t option. Once you've done that, provide prokka the training file using the --prodgialtf option.

Option: --rawproduct

Prokka annotates proteins by using sequence similarity to other proteins in its database, or the databases the user provides via --proteins. By default, Prokka tries to "cleans" the /product names to ensure they are compliant with Genbank/ENA conventions. Some of the main things it does is:

set vague names to hypothetical protein
consistifies terms like possible, probable, predicted, ... to putative
removes EC, COG and locus_tag identifiers

Full details can be found in the cleanup_product() function in the prokka script. If you feel your annotations are being ruined, try using the --rawproduct option, and please file an issue if you find an example of where it is "behaving badly" and I will fix it.

Databases

The Core (BLAST+) Databases

Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchical approach to make it fast.
A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.

The three core databases, applied in order, are:

ISfinder: Only the tranposase (protein) sequences; the whole transposon is not annotated.
NCBI Bacterial Antimicrobial Resistance Reference Gene Database: Antimicrobial resistance genes curated by NCBI.
UniProtKB (SwissProt): For each --kingdom we include curated proteins with evidence that (i) from Bacteria (or Archaea or Viruses); (ii) not be "Fragment" entries; and (iii) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.

Making a Core Databases

If you want to modify these core databases, the included script prokka-uniprot_to_fasta_db, along with the official uniprot_sprot.dat, can be used to generate a new database to put in /opt/prokka/db/kingdom/. If you add new ones, the command prokka --listdb will show you whether it has been detected properly.

The Genus Databases

⚠️ This is no longer recommended. Please use --proteins instead.

If you enable --usegenus and also provide a Genus via --genus then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka --listdb to see what they are.

Adding a Genus Databases

If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called prokka-genbank_to_fasta_db to help. For example, if you had four annotated "Coccus" genomes, you could do the following:

% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
% makeblastdb -dbtype prot -in Coccus
% mv Coccus.p* /path/to/prokka/db/genus/

The HMM Databases

Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in /opt/prokka/db/hmm. Type prokka --listdb to confirm they are recognised.

FASTA database format

Prokka understands two annotation tag formats, a plain one and a detailed one.

The plain one is a standard FASTA-like line with the ID after the > sign, and the protein /product after the ID (the "description" part of the line):

>SeqID product

The detailed one consists of a special encoded three-part description line. The parts are the /EC_number, the /gene code, then the /product - and they are separated by a special "~~~" sequence:

>SeqID EC_number~~~gene~~~product~~~COG

Here are some examples. Note that not all parts need to be present, but the "~~~" should still be there:

>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase~~~COG1234
MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
>YP_492697.1 ~~~traB~~~transfer complex protein TraB~~~
MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
>YP_492694.1 ~~~~~~transposase~~~
MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*

The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:

NAME  PRK00001
ACC   PRK00001
DESC  2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase~~~COG1234
LENG  284

FAQ

Where does the name "Prokka" come from?
Prokka is a contraction of "prokaryotic annotation". It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.
Can I annotate by eukaryote genome with Prokka?
No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
Why does Prokka keeps on crashing when it gets to the "tbl2asn" stage?
It seems that the tbl2asn program from NCBI "expires" after 6-12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.
The hmmscan step seems to hang and do nothing?
The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the --gnu option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change parallel to parallel --gnu.
Why does prokka fail when it gets to hmmscan?
Unfortunately HMMER keeps changing its database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to "re-press" the files. This can be done as follows:

cd /path/to/prokka/db/hmm
mkdir new
for D in *.hmm ; do hmmconvert $D > new/$D ; done
cd new
for D in *.hmm ; do hmmpress $D ; done
mv * ..
rmdir new

Why can't I load Prokka .GBK files into Mauve?
Mauve uses BioJava to parse GenBank files, and it is very picky about Genbank files. It does not like long contig names, like those from Velvet or Spades. One solution is to use --centre XXX in Prokka and it will rename all your contigs to be NCBI (and Mauve) compliant. It does not like the ACCESSION and VERSION strings that Prokka produces via the "tbl2asn" tool. The following Unix command will fix them: egrep -v '^(ACCESSION|VERSION)' prokka.gbk > mauve.gbk
How can I make my GFF not have the contig sequences in it?

sed '/^##FASTA/Q' prokka.gff > nosequence.gff

Bugs

Submit problems or requests to the Issue Tracker.

Changes

Read the release notes
Read the ChangeLog.txt
Look at the Github commits

Citation

Seemann T.
Prokka: rapid prokaryotic genome annotation
Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063

Dependencies

Mandatory

BioPerl
Used for input/output of various file formats
Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.
GNU Parallel
A shell tool for executing jobs in parallel using one or more computers
O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.
BLAST+
Used for similarity searching against protein sequence libraries
Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.
Prodigal
Finds protein-coding features (CDS)
Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.
TBL2ASN Prepare sequence records for Genbank submission Tbl2asn home page

Aragorn
Finds transfer RNA features (tRNA)
Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6.
Barrnap
Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.
Manuscript under preparation.
HMMER3
Used for similarity searching against protein family profiles
Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.

Optional

minced
Finds CRISPR arrays Minced home page
RNAmmer
Finds ribosomal RNA features (rRNA)
Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.
SignalP
Finds signal peptide features in CDS (sig_peptide)
Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.
Infernal
Used for similarity searching against ncRNA family profiles
D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011.

Licence

GPL v3

Author

Torsten Seemann
Web: https://tseemann.github.io/
Twitter: @torstenseemann
Blog: The Genome Factory

prokka's People

Contributors

Stargazers

Watchers

Forkers

sjackman aleimba ctskennerton delafont rstabler lguy hjanime moonizer gitter-badger jvollme andrewjpage shaman-narayanasamy cometsong envgen tfuji fw1121 ofanoyi wenchaolin cnthornton mdcao spock claczny ptmckenney peterjc nsoranzo linearregression bachev rafalcode bretonics audy celiosantosjr jaredo slugger70 blawlor jessicalumian lskatz dennisj4995 glwinsor zhangyuwinnie gopalamannala dzif jennahd abelew zhssakura tolot27 kelvin-wcl hurwitzlab odiogosilva ucpete rpetit3 mruehlemann eschatonchamp inbalb rajaldebnath nickp60 zachcp haoziyeung a7032018 tianxiongbb yemilawal vbonnici zdk123 lknlkn315 mikeraiko naespinas bioinfoacademy avrajit uma04 camilla-ip zhangxixi6688 ramkh brwnj laxeye nunoalexandrefaria abdo3a smallcrayfish macman123 mz-cy-han1998 nasfizina nasmab pkerpedjiev-zymergen azolin gtonkinhill ssarria 18874851654 eternal-bug crazyrabbit007 zwets mysoldier karlnyr kodrzywolek arkadiy-garber nakeene buihoangphuc412 githublilo sarahisme thexiyang bioforensics pythseq nedatavakoli

prokka's Issues

tbl2asn: no .gbk or .sqn files

I'm sorry to bother you with this issue here, but despite installing a new version of tbl2asn and being able to call it from the command line, we've not been able to produce .gbk files using prokka. An example final set of lines from the .log follow:

[21:15:14] Writing outputs to /home/cooper/Vaughn/tmp/pneumo/ref//
[21:15:17] Generating annotation statistics file
[21:15:17] Generating Genbank and Sequin files
[21:15:17] Running: tbl2asn -V b -a r10k -l paired-ends -M n -N 1 -y 'Annotated using prokka 1.10 from http://www.vicbioinformatics.com' -Z /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.err -i /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.fsa 2> /dev/null
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//errorsummary.val
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.dr
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.fixedproducts
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.ecn
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.val
[21:15:17] Output files:
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.faa
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.tbl
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.txt
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.log
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.fsa
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.fna
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.err
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.gff
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.ffn

minced version upgrade error

Hi,

I recently queued Prokka for a large dataset - split into 7-8 parts so I could run them all together. AFter they all finished, my next script to extract output data failed for some runs. Upon some investigation I saw that there were a few failed runs with the log entry -
" Prokka needs minced 1.6 or higher. Please upgrade and try again. "

I am not sure why this error came up for only some runs while the others finished successfully. It was an extremely small fraction of the total runs, and I just queued those contigs again - but I thought worthwhile to mention it here, in case someone else had a similar issue.

Thanks!
Chandni

tbl2asn failes silently with pipe >> | << character in contig names

James Doonan: I couldn't get the full compliment of files from the prokka output for two of my whole genome sequences. I discovered that it was to do with the contig names. The whole genome contig was called; >scf7180000000002|quiver. The output from prokka was missing the genbank and sequin files. When I changed the contig name to just '>scf71' it gave me all the files as output.

Support .GBK/.GFF for --proteins option

Instead of having to prepare a .faa file from it manually, perhaps support within prokka.

For GBK would be simple to run "prokka-genbank_to_fasta_db" from within prokka.

Select Barrnap or RNAmmer

I have both Barrnap and RNAmmer installed, Prokka detects both, and seems to use Barrnap by default. How do I select which is used?

[14:36:48] Looking for 'barrnap' - found /usr/local/bin/barrnap
[14:36:48] Determined barrnap version is 0.4
[14:36:49] Looking for 'rnammer' - found /usr/local/bin/rnammer
[14:36:49] Determined rnammer version is 1.2
[14:36:49] Predicting Ribosomal RNAs
[14:36:49] Running Barrnap with 4 threads

prokka-genbank_to_fasta_db does not use the correct translation table

I've tried to make a local database of a genome from candidate division SR1, which uses translation table 25: http://www.ncbi.nlm.nih.gov/nuccore/CP006913

However, it seems like prokka-genbank_to_fasta_db does not use the correct translation table (it is incoded in the genbank file).

Checksum for tarball

It would be useful if there was an md5 checksum for the stable-release tarball, for verifying the download from http://www.vicbioinformatics.com/software.prokka.shtml.

Changing annotations to Hypothetical Protein

Hi again,
I was going through the prokka script as well as the log file, and I noticed that some of the annotations change themselves to Hypothetical protein, even though they don't look like they are annotated as "Hypothetical Protein". I could not find a suitable explanation for the same in the script. Can you help me out with this and let me know why it is changing some particular annotations and making them hypothetical?

Thanks!
Chandni

Include more info from minced in the CRISPR annotations

Lizzy Wilbanks has left a new comment on your post "Prokka - rapid prokaryotic annotation":

Thanks for this great tool! So useful!! One thing that might be a nice addition for future releases would be providing more of the information from minced about the CRISPR regions - maybe as a separate output file? I've been re-running this to get the locations of the direct repeats and spacer sequences.

Posted by Lizzy Wilbanks to The Genome Factory at 31 July 2014 04:26

signalp 3.0b version checking problem

Peter Cock ‏@pjacock Mar 26

prokka 1.8 dependency version checking unhappy with 3.0b (as in a,b,c not beta):

$ signalp -v
3.0b, Dec 2005

Exception: Bad end parameter

Running prokka 1.9, with --metagenome option.

Prokka falls down with same bad end parameter exception on two separate contigs from two separate assemblies.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (5209). End must be less than the total length of sequence (total=5208)
STACK: Error::throw
STACK: Bio::Root::Root::throw /srv/sw/cpan-modules/lib/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeq.pm:432
STACK: Bio::PrimarySeq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeq.pm:387
STACK: Bio::Seq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/Seq.pm:630
STACK: Bio::PrimarySeqI::trunc /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeqI.pm:435
STACK: /srv/sw/prokka/1.9/prokka-1.9/bin/prokka:1054

E.g. Troublesome contig:

707_L1_merged_contig_150143
CGTATAAAGGCATTGCTTGCTGAATTTATGAATCCGGAATATGGGGTTGAAAATGTTCGTCCTTATTCGCCAAGTCAGCAAGAAATATTGCGGATTTATGAGGATACGGTTTTGAAAGGGGAAGAACAGATTCCGGAAGATATAGATGTAATATTGAAAAAATTCAATAATAGCAAACTACCGACAAAATCAGAGTTTTTGCGTTATAAATTATGGTTGGAACAGAAGTATCGTTCGCCTTATACCGGTGAGTTGATACCTTTGGGAAAATTGTTTACGGCTGCGTATGAGATAGAACATATAATTCCTCAATCTCGTTATTTTGATGATTCTTTTTCTAACAAGGTGATATGTGAATCTGCTGTGAATAAATTGAAAGATAATCAATTGGGGTATGAGTTTATCAAGAATCATCACGGGCAGAAAGTTGAAGTGGGTTTTGGAAAAACGGTAGAAATTCTTTCTGTGGATAGCTACGAATGTTTTGTAAAAGAACAATATGCTAAATCGGGCGTGAAAATGAAGAAATTGTTGATGGATGATATTCCCGAGCAATTTATTGAGCGCCAATTGAACGATAGCCGGTATATCAGCAAGGTTGTTAAAGGGCTTTTGTCGAATATTGTTCGTGAAAAGAATGATAGCGGTGAATATGAGCCGGAGGCTGTTTCAAAAAATATATTAGTTTGTACGGGAAGCGTGACGGACAGGCTGAAAAAGGATTGGGGGATGAATGATGTTTGGAACAGTATTGTATATCCTCGTTTTGAACGTTTAAACGCTTTGACTGGAACACAGTGCTTTGGGCATTGGGAGAATAAAGATGGAAAAAAAGTTTTTCAGACGGAATTGCCCCTTGAATATCAGAAAGGGTTTAGTAAGAAACGTATTGACCATAGGCATCATGCCATGGATGCAATAGTGATAGCTTGCGCTACGCGGAATCATGTGAACTATTTGAGCAATGAGTCTGCAAGCCGTAATGCCAAAATCTCCCGTTATGATTTGCAGAGATTGTTGTGTGATAAGAGCAGAGTAGATGGTACTGGTAATTATAGATGGATTATAAAGAAACCATGGAATACTTTTACACAAGATGCAAGGGAGGCATTGGATAAAATAGTGATTAGCTCGAAGCAGAATTTGCGTATAATAAATAAAACAACTAATATTTATCAACATTTTGATACAGAAGGAAATCGTGTTTATAAGAAACAGGAAACCGGTGATAGTTGGGCTATTCGTAAACCGATGCATAAAGATACGGTTTTTGGAACAGTGAATTTACGAAAAGTAAAAAGTGTACGATTGTCTGTGGCTTTGGATACTCCTACCATGATTGTTGATAAGAGAGTGAAAGGCAAGGTTCTTGAATTGTTATCATATAAATATGATAAGAAGAAAATTGAAAAATATTTCAAAGAGAATGTTTTCTTTTGGAAGGATTTGGATATAGCTAAAGTTGCAGTCTATTATTTTACAGAAAATACTTCTGAACCTTTGGTTGCGGTGCGTAAACCACTTGATTCTACTTTCAATGAGAAGAAAATAAAAGAATCGGTAACGGATACTGGCATACAGAAAATTCTTTTGAATCATTTATCTGCAAAAGAAGGAAAGACGGATTTGGCTTTTTCTGCAGAAGGAATAGAAGAAATGAATCGTAATATTTTACAGTTGAATGATGGAAAAGAACATCAGCCAATATATAAAGTGAGAGTGTATGAACCACGTGGAAATAAATTTAGAGTTGGTGCATTTGGTAATAAAGGGACTAAATGGGTGGAAGCCGCTAAGGGTACTAATTTGTTCTTTGCTATTTATGCAACAGAAGATGGAAAAAGGACGTATGAGACTGTCCCCTTAAATTTGGTTATAGAACGTGAGAAACAAGGGCTTATTCCTGTTCCGGATAGGAACGAAAAAGGGGATAAACTGTTGTTTTGGTTATCTCCTAATGATTTGGTGTATCTGCCAACTGAAGAAGAACGGGAATTTGGTAGGATAAATGAGCCGATAGATAGGGGGCGGGTTTATAAAATGGTAAGTTGTACTGGGAATGAGGGACATTTTATTCCTGTAAATGTGGCTAATCCAATATTGCCGACTATTGAATTAGGAAGTAATAATAAGGCCCAGAGAGCATGGAATAATGAAATGGTAAAAGATATTTGTATCCCAGTAAAAGTTGATAGATTGGGTCGTATTATAGAAGTTAAGTATAAAGCAAATGAATAATATAAAGTTATTTCAAGAAAAGAAAATCCGTTCCATGTGGAACGAAGAAGAGCAGCAATGGTACTTTTCTGTTGTTGATGTAGTTGGTGTATTGACTGATAGCGTGAATCCTACGGACTATCTGAAGAAGATGAGAAAACGGGATGAAGAACTGGCTACTTACCTGGGGACAAATTGTCCCCAGGTAGAAATGCTGACAGATACAGGAAAAAAAAGAAAAACTTTGGCGGCAAATGTACAGGCTTTATTCCGTATCATTCAATCCATCTCCTCTCCTAAAGCTGAACCTTTTAAACTTTGGCTGGCACAGGTGGGGTATGAGCGTGTGCAGGAAATTGAAAATCCGGAATTGGCTCAGGAACGCATGAAAGAACTTTATGAGCAGAAGGGTTATCCAAAGGATTGGATTGATAAACGTCTGAGAGGAATTGCCATTCGTCAGAATTTGACGGATGAGTGGAAAGAAAGGGGAATCACGGATGCCATTCTTACGGCAGAAATATCTAAGGCAACGTTTGGATTAAGCCCTTCGGATTATAAAATATATAAAGGACTGACAAAGAAGAATCAGAATCTTCGTGACCATATGTCCGATTTGGAATTGATATTCACGATGCTTGGCGAGCGTGTCACTACGGAAATCTCTCAGAAAGAGAAACCGGATACATTTACTAAAAGTAAACAAGTTGCACAGCGTGGTGGAAATGTTGCCGGAGTAGCACGTGAACAGGCTGAAAAAGAACTGGGTAGAAGTATTATTTCTTCCGACAATTTTTTGTTGGATTCAGATAAGCAAGATGATACCTTAAAACTTCCTTTTGAGGAAAATGATGAATGAATAATTTGTAAAATCTGTATACTATGATTAAGAAAACGCTTTATTTCGGAAATCCTGTTTATCTCTCTTTGAAAAATGCTCAGTTGGTGATTAAATTGCCGGAGGTCGTAAAAAGCTGTGCTTTGCCCGAAGGGTTCAAGCAAGTGTCTGAGGTGACTAAGCCAATAGAGGATATTGGGATAGTGGTATTGGATAATAAACAGATAACTGTTACTTCGGGAGTGTTGGAGGCTTTACTTGAAAATAATTGTGCAGTCATAACTTGTGACTCTAAAAGTATGCCGGTTGGTCTGATGCTTCCTTTGTATGGAAATACTACACAAAATGAGAGGTTTCGACAGCAACTTGGCGCTTCTCTGCCATTGATGAAACAACTTTGGCAGCAAACGATAAAGGCTAAAATAGAAAATCAGGCGGCGGTATTGAGTAAATGTACTGGAGAGGAAATAAAGTGTATGAAGATATGGGCTGCTGATGTGAAAAGTGGAGATCCGGATAACTTGGAGGCTCGTGCAGCTGCTTATTATTGGAAAAATTTGTTCAAAATAAAAGGTTTTACAAGAGATAGAGAAGGTATTCCACCTAATAATCTGTTGAATTATGGGTATGCTATTTTGCGGGCGGTCGTTGCCCGTGGTTTGGTTGCAAGTGGACTTTTACCTACTTTGGGAATACATCATCATAATCGTTATAATGCTTATTGTTTGGCGGATGATATAATGGAGCCTTATCGCCCCTATGTGGATAGGTTGGTATATGATATGATTAAAGGAGAAGAAATAAATTGTATTGGATTGACAAAAGAATTGAAAGCACAGCTGCTTACTATTCCTACGTTGGATACTATTATTTCGGGAAAACGTAGTCCGTTGATGGTGGCTGTTGGGCAGACTACGGCTTCTCTATATAAATGTTTTAGCGGTGAGTTACGCAGAATATCTTATCCGGAGATGTAATGGAACGGTTTAGTGAATATCGGATTATGTGGGTACTTGTATTGTTTGATTTGCCAACCGAAACAAAAAAAGATAAAAAGGCATATGCGGACTTTAGAAAAAATCTGCAAAAGGATGGATTTACGATGTTTCAATTTTCTATATATGTTCGCCATTGCGCAAGTAGTGAGAATGCGGAGGTACATATAAAAAGAGTTAAGTCTATTTTGCCTGAGCACGGAAGTATTGGAATAATGTGTATTACAGATAAACAATTTGGAAATATAGAACTTTTTTATGGGAAAAAAACAGTAGATGTGAATACTCCCGGGCAGCAGTTAGAACTATTCTGAAAAGAAAATCCCGCTATATAGCGGGATTTCTTTCTTGGAAACTATATCTTTTTTAAATTCTAATGTTTAATATAACTGTATGTATATTAGTTTGTTACTGATGTTCGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTGATACTTTCTTTGTCTTTCATCTTTTAACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTCGCAAAGAACAGCAACGATAAAATGATTGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAAGTTAATCCCAATTCGCTTAATCCTTTGTGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAAACATTGGACGCTTGAAGCAAAGTACAGGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACCAGGAGAAACGGAGAAAAACCGGCATATATGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACGGGATAATGCCATTTATCCTGAAACTAACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACATGTTGATTACGGATGCAAAATTAGACGATGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAATATGCTTTTTGATAATAATAGTTGGACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTCCTTAACTTCATCAAACTTATCTGCCGTTACTGTTTTCTATGGTTCAAAGATACTAAAATGAAAGCAAATCACAA

'--proteins' reference introduces too many "paralogs"

I noticed that if you use a reference genome for annotation with option '--proteins', lots of false paralogs will be annotated (gene names which are numerated with an underscore '_'; see hash %collide in lines https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L985-1013).

For annotation Prokka just takes the best BLASTP hit (lowest evalue). Might it be useful to have some more options to include the possibility of subject/query coverage/identity cutoffs, in addition to option '--evalue'?

For this purpose BioPerl includes blast HSP tiling via the 'frac*' methods, which can be used to skip BLASTP results which don't satisfy these restraints, e.g.:
$hit->frac_identical('query')
$hit->frac_aligned_hit
These could be included in the BLASTP parsing routine (line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L921 onwards).

prokka running always using rnammer

I am running Prokka for annotating several genomes. It worked well till now, but suddenly it starts to look for rnammer that I do not have installed, even though I did not select the flag -rnammer.
Noting change if I type the flag -rnammer.
Should not it use barrnap (which I have installed and running) as defualt?

Mitochondrial mode for plants

prokka --kingdom mito sets the genetic code to metazoa mt (5) and enables the metazoan mitochondrial tRNA mode of Aragron aragorn -mt. Is -kingdom mito intended specifically for metazoa, or should it also work for plants?

https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L274

Check database indexes exist before spawning searches

If --setupdb failed for some reason, prokka will still attempt to run BLAST , HMMER etc giving a strange error. Best to check all is well before starting.

Problem running prokka on isolate genome

Hi,

this is a little feature request.

I have the following genome Abiotrophia defectiva ATCC 49176 (s.a. http://www.ncbi.nlm.nih.gov/genome/?term=txid592010[Organism:noexp]) in fasta format and wanted to run prokka on it for test purposes.
However, I get the following error when running the following command: prokka --notrna --norrna --cpus 1 Abiotrophia_defectiva_ATCC_49176.fasta with the prokka-binary directory being in my PATH.

[17:25:50] Loading and checking input file: Abiotrophia_defectiva_ATCC_49176.fasta
[17:25:50] Wrote 20 contigs
[17:25:50] Skipping tRNA search at user request.
[17:25:50] Disabling rRNA search: --kingdom=Bacteria or --norrna=1
[17:25:50] Skipping ncRNA search, enable with --rfam if desired.
[17:25:50] Total of 0 tRNA + rRNA features
[17:25:50] Predicting coding sequences
[17:25:50] Contigs total 629 bp, so using meta mode
[17:25:50] Running: prodigal -i PROKKA_09042014/PROKKA_09042014.fna -c -m -g 11 -p meta -f sco -q
[17:26:17] Found 1875 CDS
[17:26:17] Connecting features back to sequences
[17:26:17] Option --gram not specified, will NOT check for signal peptides.
[17:26:17] Not using genus-specific database. Try --usegenus to enable it.
[17:26:17] Annotating CDS, please be patient.
[17:26:17] Will use 1 CPUs for similarity searching.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (834). End must be less than the total length of sequence (total=629)
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/users/claczny/perl5/lib/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::subseq /home/users/claczny/perl5/lib/perl5/Bio/PrimarySeq.pm:452
STACK: Bio::Seq::subseq /home/users/claczny/perl5/lib/perl5/Bio/Seq.pm:630
STACK: Bio::PrimarySeqI::trunc /home/users/claczny/perl5/lib/perl5/Bio/PrimarySeqI.pm:458
STACK: Bio::SeqFeature::Generic::seq /home/users/claczny/perl5/lib/perl5/Bio/SeqFeature/Generic.pm:705
STACK: /work/projects/ecosystem_biology/local_tools/prokka-1.7/bin/prokka:754
-----------------------------------------------------------

I found the line [17:25:50] Contigs total 629 bp, so using meta mode suspicious. After looking into this, I found out that it appears to be related to the fasta headers. For this genome, the fasta header of the first contig is >Abiotrophia defectiva ATCC 49176 : ACIN03000001 (and similar for the other contigs). After replacing the whitespaces with underscores, prokka appears to run nicely through. The corresponding line now says [17:31:08] Contigs total 2041839 bp, so using single mode, which appears to be correct.
Hence, I suspect prokka needs unique fasta headers (which is not the case here).
Accordingly, I think it would be a useful feature to integrate or extend the input format check and let the user know when the fasta headers are not unique.

Looking forward to your comments.

Best,

Cedric

[EDIT]
The above is with respect to prokka-1.7. I installed prokka-1.10 now and discovered that some input format check is applied that was apparently not in the prior version (1.7) -> [11:11:27] WARNING: Contig IDs must be less than 38 characters for Genbank compliance - Abiotrophia_defectiva_ATCC_49176_:_ACIN03000001. I do not know though if there is a check already integrated as suggested above (uniqueness of IDs).

COORDINATES: qualifier for Infernal/Aragorn output

I have been using Prokka to annotate de novo generated whole genome sequences of bacteria, based on species or a trusted database of proteins. I use the GBK output of Prokka to import the genome sequence into Artemis, where I do tweaks to the annotation, such as missed pseudogenes, for instance. I save the files as EMBL flat files for submission to ENA/SRA. Before submission I run the EnaValidator.jar to check for issues with the EMBL file. During these checks, it gives an error that turns out to be because of a space after the " COORDINATES: " qualifier. When I remove this in Artemis manually, the error is gone. I don't know where in the Prokka pipeline this space gets inserted, but it would be helpful to fix this (if possible).

Bug for option '--hypo' in 'prokka-genbank_to_fasta_db'

Option --hypo in prokka-genbank_to_fasta_db doesn't work.

Line 41 (https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka-genbank_to_fasta_db#L41) has to be changed from:

next if $prod eq 'hypothetical protein';

next if !$hypo and $prod eq 'hypothetical protein';

Option to keep intermediate files

I'd like to inspect the .bls file for debugging (trying to track down some missing genes found by MAKER-P). I've disabled delfile( $faa_name, $bls_name);. A command line option for this purpose would be useful.

https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L934

Equivalent option for HMMs like --proteins for BLASTP ?

New --hmms option to prioritise a custom HMM (like --proteins does for BLASTP)

Connor Driscoll

Gene name attribute from --proteins evidence

The genes annotated using the --proteins evidence don't get gene= attributes in the GFF file. My FASTA file of proteins is formatted like so:

>psbK photosystem II protein K
MPVMLNIFLDDAFIYSNNIFFGKLPEAYAISDPIVDVMPIIPVLSFLLAFVWQAAVSFR
>psbI photosystem II protein I
MLTLKLFVYTVVIFFISLFIFGFLSNDPGRNPGRKE
>ycf12 hypothetical protein
MNLEVIAQLTVLTLTVVSGPLVIVLLAVRKGNL

Proper /inference when --proteins has gi/gb/ref ID

/inference="similar to AA sequence:trusted.faa:gi|302750786|gb|ADL64963.1|"

should be :Genbank:ADLXXXXX.1

Missing sequence ID on 'gene' features (via Chris Fields)

Hi Torsten! Got something for you re: Prokka. I have a small bug fix, but it’s not worth a fork if you have the time.

BTW, are the Prokka scripts available on Github? Just curious...

We’re running Prokka 1.8 (BTW, great tool!) using the following:

prokka --locustag 'CBEIJ_B593' --gram pos
--cpus $PBS_NUM_PPN
--genus Clostridium
--species beijerinckii
--strain B593
--addgenes
--mincontiglen 200
--centre 'CBC'
--rfam
454Scaffolds.fna.GC2

Everything looks fine except the GFF; the reference seq ID for the added ‘gene’ feature looks like this:

gnl|CBC|contig000001 Prodigal:2.60 CDS 378 1526 . - 0 ID=CBEIJ_B593_00001;gene=mlc;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P50456;locus_tag=CBEIJ_B593_00001;product=Making large colonies protein;protein_id=gnl|CBC|CBEIJ_B593_00001
SEQ prokka gene 378 1526 . - 1 gene=mlc;locus_tag =CBEIJ_B593_00001
gnl|CBC|contig000001 Prodigal:2.60 CDS 1717 3219 . - 0 ID=CBEIJ_B593_00002;eC_number=2.7.1.17;gene=xylB_1;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P35850;locus_tag=CBEIJ_B593_00002;product=Xylulose kinase;protein_id=gnl|CBC|CBEIJ_B593_00002
SEQ prokka gene 1717 3219 . - 1 gene=xylB_1;locus_tag =CBEIJ_B593_00002
…

(note the replacement of the reference with ‘SEQ’). It’s easy enough to fix on my end, as the generic ‘SEQ’ comes from Bio::SeqFeature::Generic when no seq_id is present, just need to pass the seq_id along. Starting at line 957 in the main prokka script:

if ($addgenes) {
  # make a 'sister' gene feature for the CDS feature
  # (ideally it would encompass the UTRs as well, but we don't know them)
  my $g = Bio::SeqFeature::Generic->new(
    -primary    => 'gene',
    -seq_id     => $f->seq_id,  # <---
    -start      => $f->start,
    -end        => $f->end,
    -strand     => $f->strand,
    -source_tag => $EXE,
    -tag        => { 'locus_tag '=> $ID },
  );

chris

Overly long locustag/prefix results in bad GenBank LOCUS lines

Prokka 1.10 can produce broken GenBank output with over-long identifiers in the LOCUS lines.

Sample input (anonymised since this is for a collaborator):

/opt/prokka-1.10/bin/prokka --outdir XYZ123draft2_prokka --prefix XYZ123draft2 --locustag XYZ123draft2 --compliant --kingdom Bacteria --gram neg --genus ... --species ... --strain XYZ123 --quiet XZY123_draft2.fasta

Sample output:

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig000001119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig000002170983 bp   DNA   linear       19-AUG-2014
...

The LOCUS identifier is too long for the strict GenBank format, and there is no white space between the (truncated) identifier and the sequence length, meaning for example Biopython complains.

Possible output with truncation (not ideal) to ensure a white space would be something like this:

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig0000 1119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig0000 2170983 bp   DNA   linear       19-AUG-2014
...

Possible output abusing the LOCUS line (also not ideal, but some parsers will cope):

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig000001 1119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig000002 2170983 bp   DNA   linear       19-AUG-2014
...

Expected output: Fail early complaining about the overly long identifiers which will cause problems, specifying which option should be changed.

Remove doc/LICENSE.TIGRFAMs

Since TIGRFAMs HMM has been removed in Prokka 1.9, its license file can also be deleted.

Support for --kingdom ALL for mixed metagenomes

Allow Kingdom=ALL or ANY for metagenomes [Andreas Bremges]

Tag stable releases

Tagging stable releases could be a good way to download the Prokka code without downloading the large databases.

Circular genome

The first line of the genbank file indicates the genome is linear. The default should be circular for bacteria (perhaps with a linear override option?).

LOCUS       205522                129078 bp    DNA     linear       20-MAY-2014

should be

LOCUS       205522                129078 bp    DNA     circular       20-MAY-2014

prokka --setupdb should check binaries

I think that the check of versions of tools and the PATH extension with $BINDIR should be done before running setup_db sub:

$ ./bin/prokka --setupdb
[19:59:35] Cleaning databases in /tmp/prokka-1.9/bin/../db
[19:59:35] Cleaning complete.
[19:59:35] Making kingdom BLASTP database: /tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot
[19:59:35] Running: makeblastdb -dbtype prot -in '/tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot' -logfile /dev/null
sh: 1: makeblastdb: not found
[19:59:35] Could not run command: makeblastdb -dbtype prot -in '/tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot' -logfile /dev/null

Also the item "use included binary if PATH one is wrong version [Simon Gladman]" from TODO in doc/ChangeLog.txt would be helpful, since having a wrong version of hmmpress in the PATH leads to this error:

$ ./bin/prokka --setupdb
...
[20:01:03] Pressing HMM database: /tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm
[20:01:03] Running: hmmpress '/tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Error: File /tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm does not appear to be in a recognized HMM format.

[20:01:03] Could not run command: hmmpress '/tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

New/Customized HMM Databases

Hi,
I've been using Prokka a bit with the default options and databases. I recently added the vFAM HMM database in /opt/prokka/db/hmm. After indexing, it was recognized successfully (prokka --listdb).
However, upon running Prokka, I see (from the log) that hmmer3 runs only for the default HMM databases (Pfam,CLUSTERS,HAMAP). Is there any way to confirm that the Prokka run actually used the new database?
Also, we work mostly on metagenomics projects so we are really looking forward to the kingdom=ALL option from the To-Do list.

Thank you,
Chandni

prokka and parallel parallel-20140222

HI.
Working with Prokka, really nice package and super-easy to run.

I noticed a small bug: when using Prokka with parallel-20140222 installed I got an error during the annotation step, this:
[16:05:51] Could not run command: cat MyAnnotation_MyGenomeproteins.faa | parallel --gnu -j 4 --block 166030 --recstart ...........

launching the command out of the pipeline I found that parallel was crashing. This is the message:

parallel: Error: -g has been retired. Use --group.
parallel: Error: -B has been retired. Use --bf.
parallel: Error: -T has been retired. Use --tty.
parallel: Error: -U has been retired. Use --er.
parallel: Error: -W has been retired. Use --wd.
parallel: Error: -Y has been retired. Use --shebang.
parallel: Error: -H has been retired. Use --halt.
parallel: Error: --tollef has been retired. Use -u -q --arg-sep -- and --load for -l.

so I reverted back to parallel-20130422 and now everything seems to work properly, even inside the prokka pipeline.

Maybe this is of some help.
best
m.

Marco Fondi, PhD
Dep. of Biology, University of Florence
Via Madonna del Piano 6, S. Fiorentino, Florence, Italy
Tel. +39 055 4574736

Could I use prokka with scaffolds?

Hello everybody,

I'd like to use prokka and I have scaffolds of a draft genome. Could I use prokka to annotate it or should I use contigs?

Best Regards,

Daniel

sig_peptide coordinates still in protein space

Signal peptide annotation is incorrect. Amino acid coordinates are used without x3 multiplication.

Yevgeny Nikolaichik

CLUSTERS.hmm corrupted in the tarball?

Greetings
when updating for Prokka 1.9 and running the 'prokka --setupdb' got this error message and the setup aborted:

[16:37:51] Running: hmmpress '/home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Error: File /home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm does not appear to be in a recognized HMM format.

[16:37:51] Could not run command: hmmpress '/home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Any idea on how to solve this? Is it a corrupted file? I've just downloaded the tarball few minutes ago. Thanks in advance!

Batch run issues

Thanks for the great software! I have several files to be processed. Running PROKKA on them either serially individually or in batches of say 10 or 50 or 100 often results in partially completed outputs (> 80% of inputs are incomplete). The most common error is:

Could not run command: cat ~/proteins.faa | parallel --gnu -j 8 --block 943 --recstart '>' --pipe hmmscan --noali --notextw --acc -E 1e-06 --cpu 1 ~/tools/prokka/prokka-1.10/bin/../db/hmm/CLUSTERS.hmm /dev/stdin > ~/proteins.bls 2> /dev/null

Output directories usually have only the final *fna completed.

Any suggestions? Many thanks for your time and efforts.

parallel version

I am really sorry because bother you with this question. Is there any problem with the new parallel20141022 version? The prokka do not recognise it's version number correctly and always ask me to update it? I use the most up to date version! Should I downgrade it? Witch is the preferred parallel version?
Thank for every help!

Could not run command: makeblastdb -dbtype prot

I previously installed prokka in Biolinux8 and everything worked well.
I had to create a new Biolinux account now and I tried to reisntall prokka-1.10.
Everything worked but when I try

prokka --setupdb

I got the followng erro:
manager@bl8vbox[lib] prokka --setupdb [12:01PM]
[12:02:05] Cleaning databases in /usr/local/lib/prokka-1.10/bin/../db
[12:02:05] Cleaning complete.
[12:02:05] Looking for 'makeblastdb' - found /usr/bin/makeblastdb
[12:02:06] Determined makeblastdb version is 2.2
[12:02:06] Making kingdom BLASTP database: /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot
[12:02:06] Running: makeblastdb -dbtype prot -in /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot -logfile /dev/null
[12:02:06] Could not run command: makeblastdb -dbtype prot -in /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot -logfile /dev/null

suggestions?

Included aragron OSX binary hangs

aragorn binary in OS X distribution doesn’t work (Prokka hangs at tRNA prediction stage), at least on my mac (with OS X 10.9.4). Recompiling aragorn from the source fixes this.

Yevgeny Nikolaichik

*.faa result files methionine/stop codon

The resulting .faa files from Prokka include stop codons '' and atypical (non-ATG) start codons don't start with methionine (both not NCBI standard). This is remedied by changing line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L1101:

$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode) );

to:

$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode, -complete => 1) );

See BioPerl HOWTO: http://www.bioperl.org/wiki/HOWTO:Beginners#Translating

Namespace collisions with default contig ID naming

Having just used Prokka 1.8 on several strains I am left with *.fna and *.gbk (etc) files with ambiguous identifiers like gnl|PROKKA|contig000001 which appear in all my strains.

Referring to http://www.ncbi.nlm.nih.gov/genomes/static/Annotation_pipeline_README.txt (linked to in the Prokka script - thank you) the NCBI say:

The fasta file should look like this:
 >gnl|center|<ID1> [organism=<ORGANISM NAME STRAIN NAME>] [strain=<STRAIN NAME>] [gcode=11]
 <NUCLEOTIDE SEQUENCE>

NOTE: The |center|<ID1> part of the header must be less than 38 characters

An example of a fasta header for the Bacterium bacterius 253 is:

>gnl|LrgU|Contig01 [organism=Bacterium bacterius 253] [strain=253] [gcode=11]

I propose that rather than using gnl|$centre|contig%06d where the ID is just contig000001 etc, Prokka prefixes this (if a prefix is specified, and under 38 - 6 = 32 characters). This prefix could be a new command line option, or perhaps reuse the existing strain or locus tag prefix?

e.g. I would like to be able to request gnl|PROKKA|XXX_contig000001 and gnl|PROKKA|YYY_contig000001 for strains XXX and YYY.

Missing short genes

Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.

Translation table 25 is not supported

Prokka does not support translation table 25: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG25

Prokka fails when --outdir has spaces in it

I have been very lazy and not been shell-quoting my pipe commands. Bad programmer.

Rfam Update

I am a Prokka user, thanks for providing, maintaining and updating the Prokka regularly. Could you guys please update the RFAM database to RFAM 12 in the next version of the Prokka release. Seems much changes have been there compared to the earlier version of Rfam.

I suggest that Prokka should provide Rfam.cm or Rfam.fasta what ever file using for the ncRNA "Rfam analysis" should be visible so that people can easily change/update the Rfam database and then no longer need to wait an update from the Prokka authors (wait for the next version, especially Rfam). Thank you and have a nice day.

Prokka reorders contigs

If you give prokka a contig set, ordered by reference, it reorders the contigs in the output genbank alphabetically. Would be nice if preserved the original contig order (preferably without renaming the contigs? We submit contigs with genbank format friendly names)

Improve the cleanup_product() function

This function makes lots of mistakes:

Bug: HI0933-like protein => -like protein
Bug: IS1251-like transposase => -like transposase
Bug: transcription termination factor Rho => hypothetical protein
Bug: xx kDa SS-A/Ro ribonucleoprotein homolog => hypothetical protein
[12:09:23] Modify product: conserved protein with nucleoside triphosphate hydrolase domain => hypothetical protein
[12:09:23] Modify product: 23S rRNA m(2)G2445 methyltransferase => 23S rRNA m(2) methyltransferase
[12:09:23] Modify product: DNA replication terminus site-binding protein => hypothetical protein
[12:09:24] Modify product: conserved inner membrane protein => hypothetical protein
[12:09:24] Modify product: 16S ribosomal RNA m2G1207 methyltransferase => 16S ribosomal RNA methyltransferase
[12:09:25] Modify product: hypothetical protein TTC0453 => hypothetical protein
[12:09:25] Modify product: type VI secretion protein, VC_A0107 family => type VI secretion protein, family
[12:09:26] Modify product: conserved hypothetical pathogenicity island protein => hypothetical protein
[12:09:26] Modify product: IS1400 transposase B => transposase B
[12:09:26] Modify product: Dyp-type peroxidase family => Dyp-type peroxidase family protein

Minor .gbk file issues

The LOCUS entries in the .gbk file don't put a space in between the name of the contig (in this case) and its size, which means BioPerl gets upset when it tries to read the file.
Additionally, it would be helpful if there was a default entry for the ACCESSION and VERSION fields, just a '.' would do, as other programs read the file incorrectly when these entries are blank (such as pmauve).

Output a .PTT file for the CDS features

Andrew Buultjens requests PTT file output: https://www.biostars.org/p/16405/

Perl Exceptions

Installed Prokka and it ran fine. Wanted to add signalP and since then all gone wrong! I thought I needed to add a perl module and maybe updated through cpan now I get:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not read file 'minced -gff 'PROKKA_07152014/PROKKA_07152014.fna' |': No such file or directory
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.18.2/Bio/Root/Root.pm:449
STACK: Bio::Root::IO::_initialize_io /usr/local/share/perl/5.18.2/Bio/Root/IO.pm:270
STACK: Bio::Tools::GFF::new /usr/local/share/perl/5.18.2/Bio/Tools/GFF.pm:200
STACK: /usr/local/bin/prokka:589

same exception for barrnaup

Both programs seem to run ok on their own.

Any suggestions?

Many thanks

Support prodigal 2.7 (git head)

Prodigal 2.7 has unfortunately changed the command line options in a non-compatible manner. -m was renamed to -n and -p was renamed to -m.

2.6	2.7
-m	-n
-p meta	-m anon
-p single	-m normal