Giter VIP home page Giter VIP logo

proteins's Introduction

Set of command-line tools operating on FASTA files suitable for:

  • extend.py - protein sequence enrichment
  • scan-pfam.py - protein domain finding
  • fisher-test.py - measuring of protein enrichment significance

Pipfile and Pipfile.lock (for Pipenv) with all required libraries is included.
In addition, test script test.sh, sample input data example.fasta as well as all generated output files for this input are included.

The APIs used here are:

Usage scenario

Suppose that we have a given set of protein subsequences (e.g. from protein microarray) that are known to be derived from a larger set of protein sequences. And suppose that we are interested in what this larger set looks like, in particular what protein domains are present in it.

One possible solution to this problem is extending these subsequences to sequences that are known and very similar to them. Then a search for protein domains in these longer sequences need to be done. In the end, we can calculate significance of this set of domains in comparison to domains from input subsequences, using Fisher's exact test. It will be measured as a deviation from a null hypothesis, that in both sets of sequences, all types of domains should occur equally often.

The list of steps needed to obtain such a result is as follows:

  1. You have an input FASTA file example.fasta with incomplete protein sequences.
  2. Pass it to extend.py -i example.fasta to get extended protein sequences example-out.fasta from BLAST nr Database.
  3. Pass the result file example-out.fasta to scan-pfam.py example-out.fasta example-out.csv.
  4. Pass the input file example.fasta to scan-pfam.py example.fasta example.csv.
  5. Pass both example.csv and example-out.csv to fisher-test.py.
  6. At the end, we get probability of this domain distribution.

This whole set of instructions can be executed from test.sh example.fasta.

extend.py

Usage: extend.py -i <input_file> --evalue <evalue> --minident <minimal_percent_of_identity>

Search in NCBI nr (Non-redundant protein sequences) database for protein sequences that match with at least one sequence from input_file. The resulting file does not contain duplicates. The match criterion is determined by parameters:

  • minident - Minimal percent of amino acids pairs identity with respect to alignment length. Defaults to 90%.
  • evalue - Maximal expected number of chance matches in a random model. Defaults to 10E-9.

scan-pfam.py

Usage: scan_pfam.py <input_file> <output_file>

Generates .csv file of all domains found in input file according to PFAM database of protein domains.
The .csv output file has the following structure:

  • Protein sequence ids in table rows
  • Protein domain ids in table columns
  • At the intersection of row and column:
    • 1 if this domain was found in this sequence
    • 0 otherwise

fisher-test.py

Usage: fisher-test.py <file1> <file2>

Calculate Fisher exact test for protein domain distributions from .csv files: file1 and file2.
This is done using fisher_exact method from scipy.stats library, as well as own implementation.

proteins's People

Contributors

mmalejky avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.