Giter VIP home page Giter VIP logo

rna_gatherer's Introduction

Wants to annotate ncRNA in a genome, but is having trouble navigating the dozens of different tools and databases out there? Trying to find functions for lncRNAs, but finding almost nothing?

This software may help you solve those problems.

RNA Gatherer

RNA Gatherer is a software with ready to use pipelines for:

  • explorer.py: Annotation and prediction of ncRNA in genomes, taking into account transcriptome data, covariance models, reference sequences, reference annotations and data from public APIs;
  • prophet.py: Computational prediction of lncRNA functions using gene coexpression;

Installation

RNA Gatherer requires some databases and software in order to run. It was developed for Linux x64 environments and uses a command line interface.

First of all, you should clone (or download) this repository:

git clone https://github.com/pentalpha/rna_gatherer.git
cd rna_gatherer

Databases

File Is it mandatory? Download Link
Gene Ontology Graph Yes http://purl.obolibrary.org/obo/go.obo
RFAM Covariance Models Yes ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
Non-Redundant Proteins Only if you want to remove known protein's mRNA from lncRNA data ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
ncRNA Database FASTAs Only if you want to look for known ncRNA through alignment It can be ANY .fasta file. We suggest using RNA Central's database: ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/sequences/rnacentral_active.fasta.gz

After downloading them, edit the config.json file to include the full paths. If the file does not already exist, create it:

cp config.dummy.json config.json

Now, open config.json with your favorite text editor. Fill in the empty fields with the path to the downloaded files:

[...]
    
    "rna_dbs": {'DB Name': db_path, 
        'DB Name 2': db_path_2, ...},
    "non_redundant": "path/to/nr.fasta",
    "go_obo": "path/to/go.obo",
    "rfam_cm": "path/to/Rfam.cm"
}

Non-mandatory fields can be left empty.

Required software

The required software are listed in the environment.yml file. Using conda, you can create the environment in one command:

conda env create -f environment.yml

Now activate the fresh new environment in order to use the software:

conda activate rna

How To Use

explorer.py

This is an extensive pipeline for detecting ncRNA in a given genome. Given a genome (and maybe some optional inputs), it will give you a non-redundant .GFF annotation file and a .TSV file with functional annotations, based on RFAM and other databases.

A basic command would be:

python explorer.py -g [genome.fasta] \
    -tx [taxonomic ID for species] \
    -o [output directory]

These are the only required input arguments. But other inputs can be passed in order to make the annotation a lot better!

This enables the annotation of lncRNA transcripts:

python explorer.py -g [genome.fasta]\
    -tx [taxonomic ID for species] \
    -tr [transcriptome.fasta] \
    -o [output directory]

This includes a ncRNA reference annotation file (.gff format):

python explorer.py -g [genome.fasta] \
    -tx [taxonomic ID for species] \
    -gff [reference.gff] \
    -o [output directory]

You can find reference files like these for many species here. Please note that inclusing mRNA in the reference annotation can mess things up a little bit...

Many species have reference ncRNA sequences out there with no position in the genome. RNA Gatherer can map them for you:

python explorer.py -g [genome.fasta]\
    -tx [taxonomic ID for species] \
    -ref [reference.fasta] \
    -o [output directory]

For more detailed description of the command line arguments, use --help:

python explorer.py --help

prophet.py

Given a count reads table, a list of lncRNA names and a annotation of coding genes, this tool enables you to predict the functions (Gene Ontology terms) of lncRNA.

An example command:

python prophet.py -cr test_data/counts/mus_musculus_tpm.tsv \
    -reg test_data/lnc_list/mus_musculus_lncRNA.txt \
    -ann test_data/annotation/mgi_genes_annotation.tsv \
    -o output_directory

-cr: The count reads table is a simple .TSV table where the first row is the sample names and the following rows start with a gene name, followed by the read counts at each sample. The counts must be normalized, preferably with TPM. It must include counts for both lncRNA and mRNA (example).

-reg: The lncRNA list specifies which ones of the genes in the count reads table are lncRNA. It's a simple .TXT file where every line is a lncRNA name (example).

-ann: The functional annotation for the coding genes, another .TSV table. Each line contains a gene name, a GO term and the respective ontology - molecular_function, biological_process or cellular_component (example).

For more detailed description of the command line arguments, use --help:

python prophet.py --help

Project structure

To-do list

  • Create an utility to download the databases for the user;

rna_gatherer's People

Contributors

pentalpha avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

jianshu93

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.