Giter VIP home page Giter VIP logo

geneidc's Introduction

Codacy Badge

** Caveat: unofficial geneid repository**

geneid (master branch)

  • version: 1.4.5+

Synopsis

Geneid is a gene prediction program in eukaryotic genomes using Hidden Markov Models to detect signals in the DNA sequence. Integrating predictions from multiple sources is also supported, but quite basic at this stage.

Installation, setup and basic usage of geneid is fairly easy. The command line options control the program output and the program behaviour.

Installation

Requirements

Platforms/compilers

The program is written in ANSI C, but started moving towards the C11 C standard. It compiles on multiple Linux/Unix system with default compilers. The lists below contain only a subset of platforms /compiler versions, just to give an idea about recent distributions on which the builds were tested.

  • Linux:
    • Manjaro 18.1 (M_18.1)
    • Ubuntu 18.04.1 (U_18.0)

Programs and libraries

Tested with following compilers:

  • gcc:
    • 9.1.0 (M_18.1)
    • 7.4.0 (U_18.0)
  • clang: 8.0.1 (M_18.1)
  • cc (from Oracle Solaris Studio): 12.6 (M_18.1)
  • icc (Intel): 19.0.5.281 (U_18.0)

TODO

  • Python for testing (future)

Obtaining the sources

Options:

TODO:

  • Release
tar -xvfz geneid.tar.gz

Compilation

Go to geneid directory and type:

#gcc
make

#clang
make --file=Makefile.clang

#Oracle CC
make --file=Makefile.oracle_cc

Executable will be created as bin/geneid

Testing

#help
geneid -h

#running with predictions

./bin/geneid  -GP param/dictyostelium.param ./samples/dict_1chr.fa

Usage

Memory requirements

With the default configuration, geneid requires:

  • 31 Gb for human sized genome
  • 9 Gb for D.melanogaster genome (140 Mbp)

For estimating the virtual memory needed for geneid:

geneid -B genome.fa If the required memory size exceeds the RAM + swap size, one can split the genome into individual chromosomes/contigs.

Inputs

TODO

Outputs

TODO

Basic

To run geneid type: geneid -P parameter_filename genome.fasta

Typical

Setting environment variables

XXX Alternatively you can set the parameter file using the environment variable GENEID. XXX not explained

[Using existing parameter file on a new genome]

TBD

Advanced

TODO

  • Training
  • MEME
  • splice branches
  • other

Best practices

TODO

Parameter files

For gene prediction geneid needs various information about splice sites, nucleotide/various kmer frequencies, codon biases, etc. This is provided in species specific .param files. One (??) .param file is used for an individual geneid prediction run. Using the already created parameters from species close to the one with which are working often gives satisfactory results.

To download individual speciess parameter files (compressed with XXX): TODO XXX dir with .param files XXX (if possible rename all of these according to some rule. Plus get more comments inside how & when these were created)

To download all/reviewed? current parameter files: XXX link to geneid_all_param.tar.gz

Creating param file for new species

TODO

Command line options

Short help

	geneid	[-bdaefitnxszru]
		[-TDAZU]
		[-p gene_prefix]
		[-G] [-3] [-X] [-M] [-m]
		[-WCF] [-o]
		[-j lower_bound_coord]
		[-k upper_bound_coord]
		[-N numer_nt_mapped]
		[-O <gff_exons_file>]
		[-R <gff_annotation-file>]
		[-S <gff_homology_file>]
		[-P <parameter_file>]
		[-E exonweight]
		[-V evidence_exonweight]
		[-Bv] [-h]
		<locus_seq_in_fasta_format>

Long help

geneid [flags] <locus_seq_in_fasta_format>

-b: Output Start codons
-d: Output Donor splice sites
-a: Output Acceptor splice sites
-e: Output Stop codons
-f: Output Initial exons
-i: Output Internal exons
-t: Output Terminal exons
-n: Output introns
-s: Output Single genes
-x: Output all predicted exons
-z: Output Open Reading Frames

-T: Output genomic sequence of exons in predicted genes
-D: Output genomic sequence of CDS in predicted genes
-A: Output amino acid sequence derived from predicted CDS
-p: Prefix this value to the names of predicted genes, peptides and CDS

-G: Use GFF format to print predictions
-3: Use GFF3 format to print predictions
-X: Use extended-format to print gene predictions
-M: Use XML format to print gene predictions
-m: Show DTD for XML-format output

-j <coord>: Begin prediction at this coordinate
-k <coord>: End prediction at this coordinate
-N <num_reads>: Millions of reads mapped to genome
-W: Only Forward sense prediction (Watson)
-C: Only Reverse sense prediction (Crick)
-U: Allow U12 introns (Requires appropriate U12 parameters to be set in the parameter file)
-r: Use recursive splicing
-F: Force the prediction of one gene structure
-o: Only running exon prediction (disable gene prediction)
-O <exons_filename>: Only running gene prediction (not exon prediction)
-Z: Activate Open Reading Frames searching
-R <exons_filename>: Provide annotations to improve predictions
-S <HSP_filename>: Using information from protein sequence alignments to improve predictions

-u: Turn on UTR prediction. Only valid with -S option: HSP/EST/short read ends are used to determine UTR ends
-E: Add this value to the exon weight parameter (see parameter file)
-V: Add this value to the score of evidence exons
-P <parameter_file>: Use other than default parameter file (human)

-B: Display memory required to execute geneid given a sequence
-v: Verbose. Display info messages
-h: Show this help


Help

Legal

License

GNU General Public License

Authors

Code:

  • Enrique Blanco
  • Roderic Guigo
  • Tyler Alito

Parameter files:

  • Genis Parra
  • Francisco Camara

Contributions:

  • Josep F.Abril
  • Moises Burset
  • Xavier Messegue

Code cleanup:

  • darked89

Citation

If used for published results, please cite: TODO

Geneid publications

TODO

geneidc's People

Contributors

darked89 avatar eblancoga avatar emi80 avatar maikroeder avatar talioto avatar

Watchers

 avatar

geneidc's Issues

AddressSanitizer: strcpy-param-overlap: memory ranges src/BackupGenes.c

While running geneid compiled with gcc 9.1.0 with -fsanitize=address

./bin/geneid  -GP param/dictyostelium.param ./samples/dict_1chr.fa
=================================================================
==29498==ERROR: AddressSanitizer: strcpy-param-overlap: memory ranges [0x7fd8e204c5b5,0x7fd8e204c5b8) and [0x7fd8e204c5b5, 0x7fd8e204c5b8) overlap
    #0 0x7fd9d5c53552 in __interceptor_strcpy /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cc:429
    #1 0x560a23e5268c in backupExon src/BackupGenes.c:65
    #2 0x560a23e539ce in backupGene src/BackupGenes.c:123
    #3 0x560a23e54c61 in BackupArrayD src/BackupGenes.c:216
    #4 0x560a23e5197b in main src/geneid.c:532
    #5 0x7fd9d4f6aee2 in __libc_start_main (/usr/lib/libc.so.6+0x26ee2)
    #6 0x560a23e5046d in _start (/home/darked89/proj_soft/geneidc/crg_github/geneid/bin/geneid+0x1c46d)

0x7fd8e204c5b5 is located 3874229 bytes inside of 37337552-byte region [0x7fd8e1c9a800,0x7fd8e40361d0)
allocated by thread T0 here:
    #0 0x7fd9d5cc6ce8 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:153
    #1 0x560a23ea1738 in RequestMemoryDumpster src/RequestMemory.c:801
    #2 0x560a23e50afd in main src/geneid.c:268
    #3 0x7fd9d4f6aee2 in __libc_start_main (/usr/lib/libc.so.6+0x26ee2)

0x7fd8e204c5b5 is located 3874229 bytes inside of 37337552-byte region [0x7fd8e1c9a800,0x7fd8e40361d0)
allocated by thread T0 here:
    #0 0x7fd9d5cc6ce8 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:153
    #1 0x560a23ea1738 in RequestMemoryDumpster src/RequestMemory.c:801
    #2 0x560a23e50afd in main src/geneid.c:268
    #3 0x7fd9d4f6aee2 in __libc_start_main (/usr/lib/libc.so.6+0x26ee2)

SUMMARY: AddressSanitizer: strcpy-param-overlap /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cc:429 in __interceptor_strcpy
==29498==ABORTING

valgrind: Conditional jump [] uninitialised value ReadIsochore (readparam.c:586)

clang 8.01 with:

OPTS=-I$(INCLUDE) -std=c11 -ggdb -g2 -march=native -Weverything -Wshadow -Wstrict-overflow -fno-strict-aliasing -Weverything

==11637== Conditional jump or move depends on uninitialised value(s)
==11637==    at 0x4A2DE5F: __printf_fp_l (in /usr/lib/libc-2.29.so)
==11637==    by 0x4A46050: __vfprintf_internal (in /usr/lib/libc-2.29.so)
==11637==    by 0x4A52583: __vsprintf_internal (in /usr/lib/libc-2.29.so)
==11637==    by 0x4A32987: sprintf (in /usr/lib/libc-2.29.so)
==11637==    by 0x13D95A: ReadIsochore (readparam.c:586)
==11637==    by 0x13F11E: readparam (readparam.c:1019)
==11637==    by 0x109649: main (geneid.c:276)

sprintf bugs in readparam.c

./src/readparam.c: In function ‘ReadProfileSpliceSites’:
./src/readparam.c:512:75: warning: ‘%s’ directive writing up to 1599 bytes into a region of size 1573 [-Wformat-overflow=]
  512 |                                 sprintf(mess, "Wrong format: profile name %s \n\tis not admitted for donors [only %s, %s, %s, %s, %s, %s or %s]",
      |                                                                           ^~
  513 |                                         header, sprofileDON, sprofileU12gtagDON, sprofileU12atacDON, sprofileU2gcagDON, sprofileU2gtaDON, sprofileU2gtgDON, sprofileU2gtyDON);
      |                                         ~~~~~~                             
./src/readparam.c:512:33: note: ‘sprintf’ output between 211 and 1810 bytes into a destination of size 1600
  512 |                                 sprintf(mess, "Wrong format: profile name %s \n\tis not admitted for donors [only %s, %s, %s, %s, %s, %s or %s]",
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  513 |                                         header, sprofileDON, sprofileU12gtagDON, sprofileU12atacDON, sprofileU2gcagDON, sprofileU2gtaDON, sprofileU2gtgDON, sprofileU2gtyDON);
      |                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./src/readparam.c:425:71: warning: ‘%s’ directive writing up to 1599 bytes into a region of size 1573 [-Wformat-overflow=]
  425 |                             sprintf(mess, "Wrong format: profile name %s \n\tis not admitted for acceptors [only %s, %s, %s, %s, %s or %s]",
      |                                                                       ^~
  426 |                                     header, sprofileACC, sprofilePPT, sprofileBP, sprofileU12BP, sprofileU12gtagACC, sprofileU12atacACC);
      |                                     ~~~~~~                             
./src/readparam.c:425:29: note: ‘sprintf’ output between 217 and 1816 bytes into a destination of size 1600
  425 |                             sprintf(mess, "Wrong format: profile name %s \n\tis not admitted for acceptors [only %s, %s, %s, %s, %s or %s]",
      |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  426 |                                     header, sprofileACC, sprofilePPT, sprofileBP, sprofileU12BP, sprofileU12gtagACC, sprofileU12atacACC);
      |                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.