Giter VIP home page Giter VIP logo

scrnagan's Introduction

scRNAGAN

This repository implements AC-GAN both with classical GAN and WGAN-GP loss functions. I created this repository to create GAN models and evaluate their performance for single cell RNA-seq datasets. AC-GAN is implemented to make use of cell type both in training and generating novel samples. AC-GAN modification allows generator to generate cells from specific cell types.

For evaluation I used PCA plots, t-SNE plots and a measure to compare marker gene expression of generated samples with real samples.

The repository also provides a systematic way for grid search over the hyperparameter space. Models and generated samples are logged regularly over epochs for evaluation during hyperparameter tuning.

Creating and running an experiment using script.py:

Specify the hyperparameters and the location of dataset in script.py

Example:

config = {

    "data_path": ["/home/halilbilgin/data/alphabeta_joint_500/"], # the location of train.npy and train_labels.npy
    "log_transformation": [0],                                     # whether log transformation will be done before training or not
    "scaling": ["minmax"],                                         # minmax, standard (z-score) or none scaling before training
    "d_hidden_layers": [[1500, 200], [120, 450]],                  # discriminator's hidden layer architecture
    "g_hidden_layers": [[1000], [250, 500]],                       # generator's hidden layer architecture
    "activation_function": ["tanh"],                               # activation function will be used in hidden layers 
    "leaky_param": [0.1],                                          # alpha parameter if leaky ReLU is used
    "learning_rate": [0.00001, 0.0001, 0.001, 0.01],               # learning rate that will be used
    "learning_schedule": ['no_schedule', 'search_then_converge'],  # learning schedule
    "optimizer": [ 'Adam', 'Adadelta'],                            # optimizer
    "wgan": [1],                                                   # whether the loss function is classical GAN or WGAN-GP
    "z_dim": [100],                                                # dimension of input noise given to generator 
    "mb_size": [40, 60, 80, 100],                                  # minibatch size
    "d_dropout": [0],                                              # dropout applied to discriminator's hidden layers
    "g_dropout": [0],                                              # dropout applied to generator's hidden layers
    "label_noise": [0]                                             # label noise to (see https://github.com/soumith/ganhacks )

}
..
..
..
args.epochs = 30                                                   # total number of epochs
args.log_sample_freq = 5                                           # how frequent you want to log the samples? (e.g 5 epoch)
args.log_sample_size = 500                                         # how many samples should be generated from generator (e.g 500) 
.
.
.
repeat = 3                                                         # how many times do you want to repeat the same experiment?

Then, run

python script.py --exp_dir /vol1/ibrahim/out/gene_500_batchsize
or
sbatch script.py --exp_dir /vol1/ibrahim/out/gene_500_batchsize #if you use slurm

where --exp_dir specifies where the experiment should be located

When experiment run completely, you can execute:

python analysis.py --exp_dir /vol1/ibrahim/out/gene_500_batchsize

to generate PCA plots, marker plots and index scores. All the results will be stored in /vol1/ibrahim/out/gene_500_batchsize/analysis/

config allows grid search. For example if you set "mb_size": [40, 60, 80, 100] then all the possible minibatch sizes will run in a separate model. In the grid search configuration above, script.py creates 256 different hyperparameter combinations (i.e. 2 possible d_hidden_layers, g_hidden_layers, learning_schedule, optimizer and 4 possible learning_rate & mb_size, 22224*4 = 256)

All the hyperparamter combinations created using grid search will be stored in a subfolder named with a random generated ID. For example, if the --exp_dir is /vol1/ibrahim/out/gene_500_batchsize then, a possible hyperparameter combination could be stored in the folder "/vol1/ibrahim/out/gene_500_batchsize/h_bn_BDBSUMVLWQ".

The ID's and hyperparameters are stored in config.json in the subfolder In addition, analysis.py creates a results.csv where the ID, hyperparameters of the experiment with the ID and index stores are stored.

Dataset

  1. train.npy
  2. train_labels.npy
  3. class_details.csv

train.npy -> The training dataset you will use should be saved either as npy or rds file. If it is rds, you should change args.IO to 'rds' in script.py Rows should be samples, columns should represent features

train_labels.npy -> allows to use classes both in training and analysis. Columns represent labels in one-hot vector format.

class_details.csv-> columns represent: class name, marker gene and marker id (column number in train.npy)

A dataset I used in the presentation is located in data/alpha_beta_joint_500

Reproducing the experiments shown in the slides

Below is command line way of creating and running an experiment of which hyperparameter configuration file is saved as a JSON file.

You can access the files used in my presentation from out folder of this repo.

In the out/slides_3layers/exp.json, replace the data path with the absolute data path in your computer and use that absolute path in the commands as well.

python create_experiments -cfg /home/.../scRNAGAN/out/slides_3layers/exp.json -epath /home/.../scRNAGAN/out/slides_3layers/
python train_all.py -epath /home/.../scRNAGAN/out/slides_3layers/ -repeat 4 -epochs 30 -l_freq 5 -l_size 500
python analysis.py -exp_dir /home/.../scRNAGAN/out/slides_3layers/

Running differential gene expression using generated samples after training the model

from libraries.analysis import Analysis
analysis = Analysis('/home/.../scRNAGAN/out/slides_3layers/h_bn_BDBSUMVLWQ')

results = analysis.differential_gene_expression(20)

results variable stores the percentage of differentially expressed genes for each cell type of generated samples in 20th epoch.

Prerequisites

What things you need to install the software and how to install them

Matplotlib

Numpy

Tensorflow >= 1.3

Scikit-learn

Enum34 (for python 2.7)

rpy2(optional, for differential gene expression analysis)

scrnagan's People

Contributors

halilbilgin avatar sumeetpalsingh avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.