sanger-pathogens / gff3toembl Goto Github PK

Converts Prokka GFF3 files to EMBL files for uploading annotated assemblies to EBI

Home Page: http://sanger-pathogens.github.io/gff3toembl/

License: Other

Python 95.03% Shell 2.21% TeX 2.05% Dockerfile 0.71%

genomics sequencing next-generation-sequencing research bioinformatics bioinformatics-pipeline global-health infectious-diseases pathogen

gff3toembl's Introduction

GFF3toEMBL

Converts GFF3 files from Prokka into a format suitable for submission to EMBL.

Introduction
Installation
Usage
- Example data
License
Feedback/Issues
Citation
Known Issues

Introduction

Submitting annoated genomes to EMBL is a very difficult and time consuming process. This software converts GFF3 files from the most commonly use prokaryote annotation tool Prokka into a format that is suitable for submission to EMBL. It has been used to prepare more than 30% of all annotated genomes in EMBL/GenBank.

N.B. This implements some EMBL specific conventions and is not a generic conversion tool. It is also not a validator, so you need to pass in parameters which are acceptable to EMBL.

Installation

GFF3toEMBL has the following dependencies:

Required dependencies

Genometools

There are a number of ways to install GFF3toEMBL and details are provided below. If you encounter an issue when installing GFF3toEMBL please contact your local system administrator. If you encounter a bug please log it here or email us at [email protected].

Docker

A docker container is provided with all of the dependancies setup and installed. To install the container:

docker pull sangerpathogens/gff3toembl

To run the script from within the container on test data (substituting /home/ubuntu/data for your own directory):

docker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/gff3toembl gff3_to_embl --output_filename /data/output_file.embl ABC 123 PRJ1234 ABC /opt/gff3toembl-1.1.0/gff3toembl/tests/data/single_feature.gff

From source

This is for advanced users. The homebrew recipe, Dockerfile and the TravisCI install dependancies script all contain steps to setup depenancies and install the software so might be worth looking at for hints.

Install genometools including python bindings
git clone [email protected]:sanger-pathogens/gff3toembl.git
python setup.py install

Running the tests

Run python setup.py test

Usage

usage: gff3_to_embl [-h] [--authors AUTHORS] [--title TITLE]
                    [--publication PUBLICATION] [--genome_type GENOME_TYPE]
                    [--classification CLASSIFICATION]
                    [--output_filename OUTPUT_FILENAME]
                    [--locus_tag LOCUS_TAG]
                    [--translation_table TRANSLATION_TABLE]
                    [--chromosome_list CHROMOSOME_LIST] [--version]
                    organism taxonid project_accession description file

Converts prokaryote GFF3 annotations to EMBL for ENA submission. Cite
http://dx.doi.org/10.21105/joss.00080

positional arguments:
  organism              Organism
  taxonid               Taxon id
  project_accession     Accession number for the project
  description           Genus species subspecies strain of organism
  file                  GFF3 filename

optional arguments:
  -h, --help            show this help message and exit
  --authors AUTHORS, -i AUTHORS
                        Authors (in the EMBL RA line style)
  --title TITLE, -m TITLE
                        Title of paper (in the EMBL RT line style)
  --publication PUBLICATION, -p PUBLICATION
                        Publication or journal name (in the EMBL RL line
                        style)
  --genome_type GENOME_TYPE, -g GENOME_TYPE
                        Genome type (linear/circular)
  --classification CLASSIFICATION, -c CLASSIFICATION
                        Classification (PROK/UNC/..)
  --output_filename OUTPUT_FILENAME, -f OUTPUT_FILENAME
                        Output filename
  --locus_tag LOCUS_TAG, -l LOCUS_TAG
                        Overwrite the locus tag in the annotation file
  --translation_table TRANSLATION_TABLE, -n TRANSLATION_TABLE
                        Translation table
  --chromosome_list CHROMOSOME_LIST, -d CHROMOSOME_LIST
                        Create a chromosome list file, and use the supplied
                        name
  --version             show program's version number and exit

An example:

gff3_to_embl --authors 'John' --title 'Some title' --publication 'Some journal' \
             --genome_type 'circular' --classification 'PROK' \
             --output_filename /tmp/single_feature.embl --translation_table 11 \
             Organism 1234 'My project' 'My description' gff3toembl/tests/data/single_feature.gff

Example data

The directory 'example_data' contains an input GFF file and the output file along with the command.

License

GFF3toEMBL is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues to the issues page or email [email protected].

Citation

If you use this software please cite:

GFF3toEMBL: Preparing annotated assemblies for submission to EMBL
Andrew J. Page, Sascha Steinbiss, Ben Taylor, Torsten Seemann, Jacqueline A. Keane
The Journal of Open Source Software, 1 (6) 2016. doi: 10.21105/joss.00080

Known Issues

This doesn't work with some versions of Genometools on Mac OS X; it appears to work with Genometools 1.5.4

gff3toembl's People

Contributors

Stargazers

Watchers

Forkers

andrewjpage satta sjackman peterjc jtumelty indranimukhopadhya sachalau vdda palc pathogen-informatics barrantesisrael vikash84 luoluo690

gff3toembl's Issues

Should unescape %2C in GFF column 9 as comma in EMBL

See http://www.sequenceontology.org/gff3.shtml

Prokka produces GFF files which correctly escape a comma the product as %2C, for example 1,6-anhydro-N-acetylmuramyl-L-alanine amidase AmpD in GFF becomes 1%2C6-anhydro-N-acetylmuramyl-L-alanine amides AmpD:

gnl|JHI|xxx_contig000001    Prodigal:2.60   CDS 5020468 5021064 .   +   0   ID=xxx_04718;eC_number=3.5.1.28;gene=ampD;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P13016;locus_tag=xxx_04718;product=1%2C6-anhydro-N-acetylmuramyl-L-alanine amidase AmpD;protein_id=gnl|JHI|xxx_04718

Using gff3_to_embl it became the following in EMBL format:

FT   CDS             5020468..5021064
FT                   /product="1%2C6-anhydro-N-acetylmuramyl-L-alanine amidase
FT                   AmpD"
FT                   /inference="ab initio prediction:Prodigal:2.60"
FT                   /db_xref="UniProtKB/Swiss-Prot:P13016"
FT                   /locus_tag="xxx_04718"
FT                   /EC_number="3.5.1.28"
FT                   /gene="ampD"
FT                   /transl_table=11

This should use a comma as in the GenBank file from Prokka:

     CDS             5020468..5021064
                     /gene="ampD"
                     /locus_tag="xxx_04718"
                     /EC_number="3.5.1.28"
                     /inference="ab initio prediction:Prodigal:2.60"
                     /inference="similar to AA sequence:UniProtKB:P13016"
                     /codon_start=1
                     /transl_table=11
                     /product="1,6-anhydro-N-acetylmuramyl-L-alanine amidase
                     AmpD"
                     /protein_id="JHI:xxx_04718"
                     /translation=...

Non-standard indentation of Python code

Most of the Python community has settled on four space indentation as described in PEP8 https://www.python.org/dev/peps/pep-0008/#indentation

Much/all of gff3toembl seems to be using two spaces. There are tools like autopep8 which can be used to automatically fix this.

Long author lists break instead of line wrapping

Further to #45 about how to format the authors, if the author list is too long, rather than wrapping it into multiple RA lines we get an error, e.g.

$ gff3_to_embl --authors 'Jagger M., Richards K., Watts C., Wood R., Jones B., Stewart I., Wyman B., Taylor M.' --title 'Some title' --publication 'Some journal' --genome_type 'circular' --classification 'PROK' --output_filename /tmp/single_feature.embl --translation_table 11 Organism 1234 'My project' 'My description' gff3toembl/tests/data/single_feature.gff
Traceback (most recent call last):
  File "/usr/local/bin/gff3_to_embl", line 4, in <module>
    __import__('pkg_resources').run_script('gff3toembl==1.0.1', 'gff3_to_embl')
  File "/Library/Python/2.7/site-packages/pkg_resources/__init__.py", line 745, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Library/Python/2.7/site-packages/pkg_resources/__init__.py", line 1677, in run_script
    exec(script_code, namespace, namespace)
  File "/Library/Python/2.7/site-packages/gff3toembl-1.0.1-py2.7.egg/EGG-INFO/scripts/gff3_to_embl", line 38, in <module>

  File "build/bdist.macosx-10.10-intel/egg/gff3toembl/EMBLWriter.py", line 96, in parse_and_run
  File "build/bdist.macosx-10.10-intel/egg/gff3toembl/EMBLWriter.py", line 45, in create_output_file
  File "build/bdist.macosx-10.10-intel/egg/gff3toembl/EMBLContig.py", line 25, in format
ValueError: Could not format contig, a line exceeded 80 characters in length

Can we pip install this?

No module named 'gt' from bioconda install

Installed via bioconda, got the following error:

Traceback (most recent call last):                                                                    
  File "/scratch/schrat01/gff3_toembl_env/bin/gff3_to_embl", line 4, in <module>                         
    __import__('pkg_resources').run_script('gff3toembl==1.1.4', 'gff3_to_embl')                                                                   
  File "/scratch/schrat01/gff3_toembl_env/lib/python3.7/site-packages/pkg_resources/__init__.py", line 66
6, in run_script                                                         
self.require(requires)[0].run_script(script_name, ns)                
  File "/scratch/schrat01/gff3_toembl_env/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1460, in run_script                                                        
    exec(script_code, namespace, namespace)                              
  File "/scratch/schrat01/gff3_toembl_env/lib/python3.7/site-packages/gff3toembl-1.1.4-py2.7.egg/EGG-INFO/scripts/gff3_to_embl", line 8, in <module>                              
  File "/scratch/schrat01/gff3_toembl_env/lib/python3.7/site-packages/gff3toembl-1.1.4-py2.7.egg/gff3toembl/EMBLWriter.py", line 3, in <module>                                   
ModuleNotFoundError: No module named 'gt'

I created the environment with conda create -p gff3_toembl_env/ bioconda::gff3toembl.
Conda installed python 3.7 in that environment - does that work with the py2.7.egg ?

(gff3_toembl_env)[schrat01]$ conda list                   
# packages in environment at /scratch/schrat01/gff3_toembl_env:                                          
#                                                                        
# Name                    Version                   Build  Channel       
_libgcc_mutex             0.1                        main                
ca-certificates           2019.5.15                     1                
certifi                   2019.6.16                py37_1                
genometools-genometools   1.5.10               h470a237_1    bioconda    
gettext                   0.19.8.1             hd7bead4_3                
gff3toembl                1.1.4                      py_1    bioconda    
libedit                   3.1.20181209         hc058e9b_0                
libffi                    3.2.1                hd88cf55_4                
libgcc-ng                 9.1.0                hdf63c60_0                
libstdcxx-ng              9.1.0                hdf63c60_0                
ncurses                   6.1                  he6710b0_1                
openssl                   1.1.1d               h7b6447c_1                
pip                       19.2.2                   py37_0                
python                    3.7.4                h265db76_1                
readline                  7.0                  h7b6447c_5                
setuptools                41.0.1                   py37_0                
six                       1.12.0                   py37_0                
sqlite                    3.29.0               h7b6447c_0                
tk                        8.6.8                hbc83047_0                
wheel                     0.33.4                   py37_0                
xz                        5.2.4                h14c3975_4                
zlib                      1.2.11               h7b6447c_3

Sequence header SQ line can exceed 80 characters

Testing gff3_to_embl on a bacterial assembly with lots of N bases generated this SQ line:

SQ   Sequence 5090820 BP; 1202378 A; 1343737 C; 1345174 G; 1198528 T; 1003 other;

This caused this slightly cryptic exception:

Traceback (most recent call last):
  File "/home/xxxx/bin/gff3_to_embl", line 4, in <module>
    __import__('pkg_resources').run_script('gff3toembl==1.1.0', 'gff3_to_embl')
  File "/home/xxxx/lib/python2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/xxxx/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1511, in run_script
    exec(script_code, namespace, namespace)
  File "/home/xxxx/lib/python2.7/site-packages/gff3toembl-1.1.0-py2.7.egg/EGG-INFO/scripts/gff3_to_embl", line 38, in <module>
    
  File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLWriter.py", line 96, in parse_and_run
  File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLWriter.py", line 45, in create_output_file
  File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLContig.py", line 26, in format
ValueError: Could not format contig, a line exceeded 80 characters in length

Simple hack for testing:

$ git diff
diff --git a/gff3toembl/EMBLContig.py b/gff3toembl/EMBLContig.py
index 4214a4e..15f8b8d 100644
--- a/gff3toembl/EMBLContig.py
+++ b/gff3toembl/EMBLContig.py
@@ -23,7 +23,9 @@ class EMBLContig(object):
     line_lengths = map(len, formatted_string.split('\n'))
     maximum_line_length = max(line_lengths)
     if maximum_line_length > 80:
-      raise ValueError("Could not format contig, a line exceeded 80 characters in length")
+      # raise ValueError("Could not format contig, a line exceeded 80 characters in length")
+      import sys
+      sys.stderr.write("WARNING: Exceeded 80 character per line limit\n")
     return formatted_string
 
   def add_header(self, **kwargs):

Sadly if and how to line wrap is not explicit in ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

3.4.17  The SQ Line
The SQ (SeQuence header) line marks the beginning of the sequence data and 
Gives a summary of its content. An example is:
     SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; 
As shown, the line contains the length of the sequence in base pairs followed
by its base composition.  Bases other than A, C, G and T are grouped 
together as "other". (Note that "BP" is also used for single stranded RNA
sequences, which is not strictly accurate, but has been used for consistency
of format.) This information can be used as a check on accuracy or for
statistical  purposes. The word "Sequence" is present solely as a marker for
readability.

linuxbrew recipe needs updating

Current linuxbrew recipe only goes to version 1.0.9. Docker version is 1.1.0.

Document versions of Python supported

After spending some time on this, I have concluded that the tool currently only works or at least is only tested under Python 2.7.

There are incompatibilities under Python 2.6, e.g. using "blah {} {}".format(...) without numbering the insertion points, or use of "new" unittest methods like .assertIsInstance. This looks fixable with moderate effort, but probably not worthwhile? Instead make it clear if Python 2.6 is not supported.

There are incompatibilities under Python 3, such as clear syntax errors e.g. #39. This does seem worth fixing, #40.

Python 3 support

The current README does not say which versions of Python are supported. I would guess only Python 2.6 and 2.7? Support for Python 3 would be desirable.

See also #39 as a small step in this direction.

self.assertItemsEqual is Python 2.7 only

You could use self.assertCountEqual under Python 3 (see #40), or just replace it with explicit sorting and call the generic unittest method self.assertEqual instead?

I have added this to my brew repo (updated to 1.0.9)

brew tap homebrew/science
brew tap tseemann/homebrew-bioinformatics-linux
brew install gff3toembl

Installation and running problems

Hello all,

I'm having problems installing and running gff3toembl. First of all, installation of python-genometools does not make python gt module available - any ideas why?

sudo apt-get install python-genometools
python -c 'import gt; print(gt)'

Traceback (most recent call last):
File "", line 1, in
ImportError: No module named gt

Second, when I try to run the tool from the docker container, I get an error message that I have no idea how to debug:

fopen(): cannot open file '/home/apredeus/EMBL/ENA_D23/D23580_liv.prokka.gff_fixed.gff': No such file or directory
Failed to sort and tidy gff file with GT

I would appreciate any help.

v1.0.0 vs v1.0

It's quite confusing to have two versions tagged that differ by .0. Consider deleting one of the two tags.

Cannot open *.gff_fixed.gff: No such file or directory

I installed GFF3toEMBL via docker and this error appears when running the tool:

irene@irene-VirtualBox:~$ docker run --rm -it -v /home/irene:/data sangerpathogens/gff3toembl gff3_to_embl --authors "Ortega, I." --genome_type linear --classification PROK --output_filename H660_240122.embl --translation_table 11 "Campylobacter jejuni" 197 PRJEB50313 "Campylobacter jejuni strain H660" /home/irene/H660.gff
fopen(): cannot open file '/home/irene/H660.gff_fixed.gff': No such file or directory
Failed to sort and tidy gff file with GT

How can I solve it?

Bad line-breaks in long words, consider breaking at hyphenation

e.g. Prokka GFF file containing this in column 9:

product=2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase

The EMBL file from converting the GFF gave:

FT                   /product="2-amino-4-hydroxy-6-hydroxymethyldihydropteridin
FT                   e pyrophosphokinase"

Notice this has inserted the line-break mid-word, which is bad.

The Prokka GBK file had:

                     /product="2-amino-4-hydroxy-6-
                     hydroxymethyldihydropteridine pyrophosphokinase"

Notice it broke on the hyphen, which is much better.

How to prepare a novel strain/isolate of a bacteria?

(Some months back I did this successfully to submit a new strain from a different genus, so while I might be doing something wrong/different, I suspect the ENA validator has become stricter in the meantime)

For an un-named Serratia which does not (yet) have a unique NCBI taxonomy entry - the parent would be Serratia, taxid 613,

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=613&lvl=3&lin=f&keep=1&srchmode=1&unlock

I have tried that, and the entry Serratia sp., taxid 616

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=616&lvl=5&lin=f&keep=1&srchmode=1&unlock

$ gff3_to_embl --authors "Other A.N." -m "Serratia sp. XYZ annotated using Prokka." -g circular -c PROK -l XYZ -n 11 -f XYZ.embl "Serratia sp." 616 PRJEB00000 "Serratia sp. XYZ" XZY.gff

Either taxid approach fails validation:

$ java -jar embl-api-validator-1.1.149.jar XYZ.embl
...
ERROR: Scientific_name "Serratia sp." is not submittable. (MasterEntrySourceCheck_2)  line: 1 of XYZ.embl
ERROR: At least one of the following qualifiers "strain, environmental_sample, isolate" must exist when organism belongs to Bacteria. (OrganismAndRequiredQualifierCheck)  line: 17 of XYZ.embl
...

Here line 17 was the source feature. Manually editing the EMBL file to add a strain qualifier to the feature worked for me, but what exactly it wants for species name eludes me.

Am I missing something simple?

[Update: Yes, I was not giving the full organism name to gff3_to_embl, but also there was a problem with this version of the validator]

Should gff3_to_embl have options for inserting source feature qualifiers "strain, environmental_sample, isolate" (or should I have done this in prokka)?

Thanks!

Failure to install on Biolinux

I have tried to install via LinuxBrew and this fails on the fontconfig (see Homebrew/legacy-homebrew#45260). Have tried to install via git, and then run the install dependencies shellscript, and then python setup.py install. Setuptools is installed, and then run it, error message received:

All fine, but

Traceback (most recent call last):
File "/usr/local/bin/gff3_to_embl", line 5, in
pkg_resources.run_script('gff3toembl==1.0.1', 'gff3_to_embl')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1401, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/gff3toembl-1.0.1-py2.7.egg/EGG-INFO/scripts/gff3_to_embl", line 8, in

File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLWriter.py", line 3, in
ImportError: No module named gt

Where am I going wrong? Thanks :)

Does this cope with Prokka GFF3 files ok?

Does the GFF3 file that Prokka produces satisfy the input requirements for this script?

I would really like to ditch tbl2asn !

Format for listing authors not defined

The built-in help lists the --authors or -i argument:

$ gff3_to_embl -h
usage: gff3_to_embl [-h] [--authors AUTHORS] [--title TITLE]
...
  --authors AUTHORS, -i AUTHORS
                        Authors
...

This does not explain how to format multiple author lists (commas? and? first last?).

Experiments based on the example in the README which just uses --authors John show the string is inserted into the RA line as is, e.g.

$ gff3_to_embl --authors 'John Lennon, Paul McCartney, George Harrison, Ringo Starr' --title 'Some title' --publication 'Some journal' --genome_type 'circular' --classification 'PROK' --output_filename /tmp/single_feature.embl --translation_table 11 Organism 1234 'My project' 'My description' gff3toembl/tests/data/single_feature.gff
$ grep ^RA  /tmp/single_feature.embl 
RA   John Lennon, Paul McCartney, George Harrison, Ringo Starr;

I infer that we should be using Surname I., Surname I., Surname I., Other A.N. following published EMBL flat files. e.g.

$ gff3_to_embl --authors 'Lennon J., McCartney P., Harrison G., Starr R.' --title 'Some title' --publication 'Some journal' --genome_type 'circular' --classification 'PROK' --output_filename /tmp/single_feature.embl --translation_table 11 Organism 1234 'My project' 'My description' gff3toembl/tests/data/single_feature.gff
peter-cocks-mac-pro:gff3toembl pjcock$ grep ^RA  /tmp/single_feature.embl 
RA   Lennon J., McCartney P., Harrison G., Starr R.;

mac brew install does not work.

It seems the libpng dependency is out of date:

OSX and dlylib

On OS X, setting an explicit path for DYLD_LIBRARY_PATH isn't a great idea.

Instead, you should be instructing people to just create a symlink for libgenometools.dylib in /usr/lib that points to wherever the file is located. This is simpler in my opinion and removes any chance that a user setting an explicit search path will screw up anything else on their system.

Change PROKKA to Prokka in repo description

Thanks :)