Giter VIP home page Giter VIP logo

neural-ensemble-method-for-cancer-prediction's Introduction

A snapshot neural ensemble method for cancer type prediction based on CNVs

Code and supplementary materials for the following two papers:

  • "Cancer Risk and Type Prediction Based on Copy Number Variations with LSTM and Deep Belief Networks" submitted to Artificial Intelligence International Conference โ€“ A2IC 2018, Barcelona, Spain, November 2018.
  • "A snapshot neural ensemble method for cancer type prediction based on copy number variations", submitted to Journal of Neural Computing and Applications.

Initial preprocessing

TCGA provide samples from each patient in a single .txt file. CNV data is extracted from these sample files and merged all together in a sinle .csv file. These data processing is done in raw_cnv_data_processing module. This module has four functions

  1. merge_cancer_files: This function merge each cancer sample data into a single .csv file. While merging the samples, it does some inital preprocessing such as numbering of samples, adding gender (male 1, female 0), cancer type and renaming of columns for our convenience. Here, cancer types 1-14 are 'COAD', 'GBM', 'KIRC', 'LGG', 'LUAD', 'LUSC', 'OV', 'UCEC', 'BRCA', 'HNSC', 'THCA', 'PRAD', 'STAD', 'BLCA' respectively.

  2. merge_normal_files: This function does the similar task as 'merge_cancer_files' but for healthy samples. Also, all healthy samples are considered as type 15 initially.

  3. data_preprocessing: In previous two functions, we mainly extracted data from txt files to a single csv file. In this function, we prepare CNV data for machine learning models. Raw CNV data contains chromosome 'X' and 'Y' as string. We replace them by 23 and 24 respectively. Also, we determine CNV gain or loss and CNV length within this function. Finally, we remove all copy numbers that has segmentation mean less than 0.301 and probe number less than 11.

  4. generate_data_summary: Here we generate a simple statistical sumaary on the data. For example, total no of samples, total CNVs, no of CNVs per sample, average CNVs per male, female samples etc.

Gene data processing

For final data processing we need human gene list along with their locations. TCGA provides gene details as json format. Gene processing is done in gene_data_processing module.

  1. extract_gene_names: Gene details are extracted from json files to a .csv file.

  2. Rscript to convert cytobands to gene coordinates: TCGA provides gene locations as cytobands. To convert cytobands to DNA coordinates we used a Rscript which uses ensemble package of biomaRt library. It takes chromosome name, cytobands as input and return start and end positions of that chromosome.

  3. preprocess_gene_list: Now that we have extracted gene details, we need to process gene data to remove noise. For example, chromosome names 'X', 'Y' are represented as 23, 24 respectively. Remove duplicate gene names etc.

Final CNVs with oncogenes:

common_moduls->prepare_input_data: here only the oncogenes are taken as features. In TCGA data, is_cancer_gene_census = True are the oncogenes.

Final CNVs with protein-coding genes

Recursive feature elimination: common_moduls->feature_ranking: We used Recursive Feature Elimination(RFE) from sklearn.feature_selection.

A quick example

A quick example on a small dataset can be performed as follows:

  • $ cd GradCAM
  • $ python3 load_data.py (make sure that the data in CSV format in the 'data' folder)
  • $ python3 model.py
  • $ python3 grad_cam.py

Citation request

If you use the code of this repository in your research, please consider citing the folowing papers:

@inproceedings{karim2018recurrent,
    title={Cancer Risk and Type Prediction Based on Copy Number Variations with LSTM and Deep Belief Networks},
    author={Karim, Md Rezaul and Rahman, Md Ashiqur and Decker, Stefan and Beyan Deniz},
    booktitle={Artificial Intelligence International Conference (A2IC2018)},
    year={2018}
}
@article{karim2019cae,
    title={A snapshot neural ensemble method for cancer type prediction based on copy number variations},
    author={Karim, Md Rezaul and Rahman, Md Ashiqur and Decker, Stefan and Beyan Deniz},
    journal={Neural Computing and Applications},
    year={2019}
}

Contributing

For any questions, feel free to open an issue or contact at [email protected]

neural-ensemble-method-for-cancer-prediction's People

Contributors

rezacsedu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

neural-ensemble-method-for-cancer-prediction's Issues

Regarding GradCam

Hi sir,
I guess there is some code missing in Gradcam load_data.py. Can you look? It is not able to generate the files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.