Giter VIP home page Giter VIP logo

cod-rna's Introduction

Project Title:

COD-rNA Prediction


Overview

This project was carried out with the aim to prediction/detection of novel non-coding RNA (ncRNA) protein strands in sequenced genomes. For more information, see the published paper by the authors.

Stack

All stack elements are open-sourced:

Python Jupyter Notebook PyCharm Scikit-Learn Git saythanks

Features and Data Information

There are eight (8) features contained in the data, and they are as follows:

  • Divide by 10 to get deltaG_total value computed by the Dynalign algorithm
  • The length of shorter sequence
  • 'A' frequencies of sequence 1
  • 'U' frequencies of sequence 1
  • 'C' frequencies of sequence 1
  • 'A' frequencies of sequence 2
  • 'U' frequencies of sequence 2
  • 'C' frequencies of sequence 2

Motivation

Quick Start

The dataset for this project was of a file size beyond GitHub accommodation levels. As such, an abstraction is provided to read the dataset into memory and compress it for storage on the fly. The compressed data file is to be found in the data/archive directory. For use in the project, the compressed file is to be decompressed into the data/dataset directory.

To run the scripts, type as below in the Terminal:

  1. Navigate to the scripts directory.
./ $ cd scripts
  1. Next, run the main.py file with the following syntax:

    py main --argument argument_value

Example:

./scripts/ $ py main.py --r_state 42

Acceptable arguments include:

  • visualize (default = False)
  • r_state (default = 42; random state)
  • data_dir (data directory)
  • arch_dir (compressed file directory)
  • thresh (minimum limit for feature importance)
  • train (create train split?)
  • valid (create valid split?)
  • test (create test split?)

Others may be found in the main.py script.

  1. Generated diagnostics, text and images, will populate the reports/text and reports/images directories respectively.
  2. Find trained model artefact in the artefacts directory.

Training Procedure

Training was done via composite models, i.e. estimators and transformers chained via a Pipeline API in Scikit Learn. The dataset was plagued by class imbalance, and an oversampling technique was applied during modelling.

The Pipeline steps are:

  • MinMax scaler
  • Quantile transformer
  • SMOTE Oversampler
  • Learning algorithm

Two learning algorithms were attempted:

  • Logistic Regression
  • Support Vector Machines (SVMs)

The SVM was the final algorithm selected.

Performance Report

A considerable test performance improvement was observed, from ~ 89 % for a LogisticRegression Pipeline to ~ 94 % for an SVM on the test set. These performance metrics were obtained for the major classification metrics (accuracy, f1-score, recall, precision, and AUC score), via macro averaging.

Thresholding was carried out for the trained model, and the maximal test performance on the roc-auc metric (94.42050894431847 %) was obtained at a threshold of 0.49400000000000005.

To-Dos

  1. Flesh out this README.
  2. Tighten up the main.py file.

Appendix

Data Source:

Source: [ Train ] || [ Test ]

Authors and Citation

  1. Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7(173), 2006. [ Link ]

cod-rna's People

Contributors

arkhymadhe avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.