Giter VIP home page Giter VIP logo

xxay-16 / genome-wide-annotation-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from unavailable-2374/genome-wide-annotation-pipeline

0.0 0.0 0.0 1.42 MB

This is a workflow that combines multiple software, mainly used for whole genome annotation of eukaryotes. Change the implementation based on Perl to Python while adding more details for use.

Shell 0.24% Python 1.23% Perl 98.29% R 0.21% Makefile 0.02%

genome-wide-annotation-pipeline's Introduction

Zhou Lab @ AGIS Genome-Wide-Annotation-Pipeline

This is a workflow that combines multiple software, mainly for whole genome annotation of eukaryotes.

The GWAP workflow

Requirements

Tools

The following tools are required. Some options and compatibilities might depend on the software version.

Software Installation

1.Download the latest Pipeline:

git clone https://github.com/unavailable-2374/Genome-Wide-annotation-pipeline.git

2.Install

If you do not have much experience in compiling software, it is recommended to use conda to complete most of the software installation.

cd Genome-Wide-annotation-pipeline
export PATH=/PATH/TO/bin >> ~/.bashrc
mamba env create -f anno_tools.yml
conda activate GWAP

Manual installation section.

Download and cat PFAM_dabase

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-A.hmm.gz 
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-B.hmm.gz 
gzip -dc Pfam-A.hmm.gz > Pfam-AB.hmm
gzip -dc Pfam-B.hmm.gz >> Pfam-AB.hmm

Usage

    Usage:
        perl GWAP.pl [options]
    For example:
        perl GWAP.pl --genome genome.fasta -1 rna_1.1.fq.gz,rna_2.1.fq.gz -2 rna_1.2.fq.gz,rna_2.2.fq.gz --protein homolog.fasta --out_prefix out --cpu 80 --gene_prefix Vitis --Pfam_db /PATH-to/Pfam-AB.hmm
    Parameters:
    [General]
        --genome <string>     Required
        genome file in fasta format.
        -1 <string> -2 <string>    Required
        fastq format files contain of paired-end RNA-seq data. if you have data come from multi librarys, input multi fastq files separated by comma. the compress file format .gz also can be accepted.
        --protein <string>    Required
        homologous protein sequences (derived from multiple species would be recommended) file in fasta format.
        --augustus_species <string>    Required when --use_existed_augustus_species were not provided
        species identifier for Augustus. the relative hmm files of augustus training will be created with this prefix. if the relative hmm files of augustus training exists, the program will delete the hmm files directory firstly, and then start the augustus training steps.
       [other]
        --out_prefix <string>    default: out
        the prefix of outputs.
        --use_existed_augustus_species <string>    Required when --augustus_species were not provided
        species identifier for Augustus. This parameter is conflict with --augustus_species. When this parameter set, the --augustus_species parameter will be invalid, and the relative hmm files of augustus training should exists, and the augustus training step will be skipped (this will save lots of runing time).
        --RM_species <string>    default: None
        species identifier for RepeatMasker. The acceptable value of this parameter can be found in file $dirname/RepeatMasker_species.txt. Such as, Eukaryota for eucaryon, Fungi for fungi, Viridiplantae for plants, Metazoa for animals. The repeats in genome sequences would be searched aganist the Repbase database when this parameter set. 
        --RM_lib <string>    default: None
        A fasta file of repeat sequences. Generally to be the result of RepeatModeler. If not set, RepeatModeler will be used to product this file automaticly, which shall time-consuming.
        --augustus_species_start_from <string>    default: None
        species identifier for Augustus. The optimization step of Augustus training will start from the parameter file of this species, so it may save much time when setting a close species.
        --cpu <int>    default: 4
        the number of threads.
        --strand_specific    default: False
        enable the ability of analysing the strand-specific information provided by the tag "XS" from SAM format alignments. If this parameter was set, the paramter "--rna-strandness" of hisat2 should be set to "RF" usually.
        --Pfam_db <string>    default: None
        the absolute path of protein family HMM database which was used for filtering of false positive gene models. multiple databases can be input, and the prefix of database files should be seperated by comma.
        --gene_prefix <string>    default: gene
        the prefix of gene id shown in output file.
        --help|-h Display this help info
        
        Version: 1.0

genome-wide-annotation-pipeline's People

Contributors

unavailable-2374 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.