This project was carried out with the aim to prediction/detection of novel non-coding RNA (ncRNA) protein strands in sequenced genomes. For more information, see the published paper by the authors.
All stack elements are open-sourced:
There are eight (8) features contained in the data, and they are as follows:
- Divide by 10 to get
deltaG_total
value computed by the Dynalign algorithm - The length of shorter sequence
- 'A' frequencies of sequence 1
- 'U' frequencies of sequence 1
- 'C' frequencies of sequence 1
- 'A' frequencies of sequence 2
- 'U' frequencies of sequence 2
- 'C' frequencies of sequence 2
The dataset for this project was of a file size beyond GitHub accommodation levels. As such, an abstraction is provided to read the dataset into memory and compress it for storage on the fly. The compressed data file is to be found in the data/archive
directory. For use in the project, the compressed file is to be decompressed into the data/dataset
directory.
To run the scripts, type as below in the Terminal:
- Navigate to the
scripts
directory.
./ $ cd scripts
-
Next, run the
main.py
file with the following syntax:py main --argument argument_value
Example:
./scripts/ $ py main.py --r_state 42
Acceptable arguments include:
- visualize (default = False)
- r_state (default = 42; random state)
- data_dir (data directory)
- arch_dir (compressed file directory)
- thresh (minimum limit for feature importance)
- train (create train split?)
- valid (create valid split?)
- test (create test split?)
Others may be found in the main.py
script.
- Generated diagnostics, text and images, will populate the
reports/text
andreports/images
directories respectively. - Find trained model artefact in the
artefacts
directory.
Training was done via composite models, i.e. estimators and transformers chained via a Pipeline API in Scikit Learn. The dataset was plagued by class imbalance, and an oversampling technique was applied during modelling.
The Pipeline steps are:
- MinMax scaler
- Quantile transformer
- SMOTE Oversampler
- Learning algorithm
Two learning algorithms were attempted:
- Logistic Regression
- Support Vector Machines (SVMs)
The SVM was the final algorithm selected.
A considerable test performance improvement was observed, from ~ 89 %
for a LogisticRegression
Pipeline to ~ 94 %
for an SVM
on the test set. These performance metrics were obtained for the major classification metrics (accuracy, f1-score, recall, precision, and AUC score), via macro
averaging.
Thresholding was carried out for the trained model, and the maximal test performance on the roc-auc
metric (94.42050894431847 %
) was obtained at a threshold of 0.49400000000000005
.
- Flesh out this README.
- Tighten up the
main.py
file.
Data Source:
- Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7(173), 2006. [ Link ]