Giter VIP home page Giter VIP logo

lung_cancer_subtyping's Introduction

Lung_cancer_subtyping

Description of the project

The goal of this class project is to build and evaluate a mathematical model that can discriminate between two lung cancer subtypes. To build the model we use an unsupervised k-means clustering algorithm (Euclidean distance) of 58 NSCLC tumors using k=2. To evaluate the model we compute the model accuracy. Accuracy in this case is the percentage of samples that the model assigns to the wrong subtype outof all the samples it classifies.

Data

The data contains 40 adenocarcinoma (AD) samples and 18 squamous cell carcinoma (SCC) samples.

The data is available in the SOFT formatted family file available under the Download header at the following link. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10245
The SOFT formatted gz file is also available in the data folder of this repository.

Packages

All packages used in this project are in the Python language. The packages used in this project are:

  • pandas
  • sklearn
  • GEOparse
  • skelearn
  • matplotlib
  • seaborn

The GEOparse package is used to parse the SOFT formatted file and extract the data.
The sklearn package is used to perform the k-means clustering algorithm.
The matplotlib and seaborn packages are used to plot the data and the results.
The pandas package is used to manipulate the data.

How to install the packages

The following code will install the packages and their dependencies:

git clone https://github.com/QuanEvans/Lung_cancer_subtyping.git
cd Lung_cancer_subtyping/python/gse_tools
pip install .

Features

  • Parse the SOFT formatted file and extract the data
  • Perform the k-means clustering algorithm
  • Compute the model accuracy and archive the results
  • Plot the results (bar plot accuracy and scatter plot of the clustering results)

How to run the code

The code is written in Python language. The code is available in the python folder of this repository. We suggest running the code in a Jupyter notebook. The following are example of how to run the code:

from gse_tools.GSEs import GSEs # or import GSEs if you install the package
filepath="./../data/GSE10245_family.soft.gz"
gse = GSEs(file_path=filepath) # create an instance of the GSEs class

gse.set_seed(575) # set the seed for the random number generator
# note the predict would automatically archive the results and the trian_model would automatically reset all the parameters
# the follow are three example of the training using different subset of the data
gse.train_model(n_clusters=2,train_frac=0.5).predict(testOnTrain=True).accuracy # get the model accuracy
gse.train_model(n_clusters=2,train_frac=0.5).predict(testOnTrain=True).accuracy
gse.train_model(n_clusters=2, train_frac=0.5).predict().accuracy

# the Datafrane of the sample lable; cluster number; and the subtype (AD or SCC) can be accessed using the following attributes
gse.accuracy_matrix

The bar plot of the model accuracy can be plotted using the following code:

gse.plot_accuracy()

example barplot


The scatter plot of the clustering results (pca) can be plotted using the following code:

gse.plot_cluster()

example pca plot

lung_cancer_subtyping's People

Contributors

groyumich avatar kylexyx0930 avatar quanevans avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.