Project Title:

COD-rNA Prediction

Overview

This project was carried out with the aim to prediction/detection of novel non-coding RNA (ncRNA) protein strands in sequenced genomes. For more information, see the published paper by the authors.

Stack

All stack elements are open-sourced:

Features and Data Information

There are eight (8) features contained in the data, and they are as follows:

Divide by 10 to get deltaG_total value computed by the Dynalign algorithm
The length of shorter sequence
'A' frequencies of sequence 1
'U' frequencies of sequence 1
'C' frequencies of sequence 1
'A' frequencies of sequence 2
'U' frequencies of sequence 2
'C' frequencies of sequence 2

Motivation

Quick Start

The dataset for this project was of a file size beyond GitHub accommodation levels. As such, an abstraction is provided to read the dataset into memory and compress it for storage on the fly. The compressed data file is to be found in the data/archive directory. For use in the project, the compressed file is to be decompressed into the data/dataset directory.

To run the scripts, type as below in the Terminal:

Navigate to the scripts directory.

./ $ cd scripts

Next, run the main.py file with the following syntax:

py main --argument argument_value

Example:

./scripts/ $ py main.py --r_state 42

Acceptable arguments include:

visualize (default = False)
r_state (default = 42; random state)
data_dir (data directory)
arch_dir (compressed file directory)
thresh (minimum limit for feature importance)
train (create train split?)
valid (create valid split?)
test (create test split?)

Others may be found in the main.py script.

Generated diagnostics, text and images, will populate the reports/text and reports/images directories respectively.
Find trained model artefact in the artefacts directory.

Training Procedure

Training was done via composite models, i.e. estimators and transformers chained via a Pipeline API in Scikit Learn. The dataset was plagued by class imbalance, and an oversampling technique was applied during modelling.

The Pipeline steps are:

MinMax scaler
Quantile transformer
SMOTE Oversampler
Learning algorithm

Two learning algorithms were attempted:

Logistic Regression
Support Vector Machines (SVMs)

The SVM was the final algorithm selected.

Performance Report

A considerable test performance improvement was observed, from ~ 89 % for a LogisticRegression Pipeline to ~ 94 % for an SVM on the test set. These performance metrics were obtained for the major classification metrics (accuracy, f1-score, recall, precision, and AUC score), via macro averaging.

Thresholding was carried out for the trained model, and the maximal test performance on the roc-auc metric (94.42050894431847 %) was obtained at a threshold of 0.49400000000000005.

To-Dos

Flesh out this README.
Tighten up the main.py file.

Appendix

Data Source:

Source: [ Train ] || [ Test ]

Authors and Citation

Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7(173), 2006. [ Link ]

arkhymadhe / cod-rna Goto Github PK

cod-rna's Introduction

Project Title:

COD-rNA Prediction

Overview

Stack

Features and Data Information

Motivation

Quick Start

Training Procedure

Performance Report

To-Dos

Appendix

Authors and Citation

cod-rna's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent