Giter VIP home page Giter VIP logo

deepore's Introduction

Deepore: Deep learning for base calling MinION reads

poster

The MinION device by Oxford Nanopore Technologies (ONT) is the first portable USB sequencing device which promises play a unique part in the future of DNA sequencing.

Not only is it portable, the underlying technology is able to produce long reads (1Mb) as compared to the current status quo of short reads (100 ~ 300 bp).

However it suffers from a high sequencing error rate.

The objective of this project is to apply deep neural network models to improve upon the base calling procedure. Initial models were based on Hidden Markov Models (HMMs) however several deep neural network implementations have already been published; DeepNano (RNN) (Boža et al 2017), Chiron (CNN + RNN) (Teng et al 2017).

The problem of base calling in computational biology runs parallel to machine translation in natural language processing (NLP) as both fields attempt to translate one sequence to another sequence.

Hence we can try to use cross-pollinate methods from both sides and see the results from this experiment.

git clone --recursive https://github.com/etheleon/deepore.git

Docker Container

We are using nvidia's customised docker nvidia-docker. nvidia-docker

Which is based on the 8.0-cudnn6-runtime-ubuntu16.04 tag

tensorflow version 1.3.0 python=2.7

We modified the docker from https://github.com/anurag/fastai-course-1.git

To start the container:

nvidia-docker run -it \
    --entrypoint /bin/zsh \
    -v /data/nanopore/new/fast5Dir/:/data \
    --name nanopore \
    -w /home/docker \
    -p 8889:8888 \
    etheleon/chiron

To start a new shell with a existing container running

containername="awesome_benz"
nvidia-docker exec -it $containername /bin/zsh

To train (ecoli), the model

  1. run preprocessing first
  2. run chiron_rcnn_train.py

but remember to check 2 things

  1. set the the raw file directory, containing the .signal and .label files
  2. the logs directory, by default this will be pointing to /home/docker/out/logs. Remember to backup the contents of this folder if you're running a new model, else the checkpoint data will saved over.

For the ecoli dataset, the raw files are in /home/docker/ecoli/data/ecoli_raw

➜  deepore git:(master) ✗ ls ~/ecoli/data/ecoli_raw | head
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read381_strand1.label
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read381_strand1.signal
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read423_strand.label
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read423_strand.signal
export CUDA_VISIBLE_DEVICES="1"
newChiron=</path/2/new/chiron>
python $newChiron/chiron/chiron_rcnn_train.py

To run original chiron the 8.0-cudnn5-runtime-ubuntu16.04 tag should be used since tensorflow 1.0.1 relies on cudnn5.

Training data: Ecoli

Reference sequence NC_000913

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.fna
  1. Ecoli reads in fast5 format from Nic Loman's lab link [need citation]

Validation data: Zika

Reference sequence NC_012532.1

The reads are amplicons from 36 primers meant to cover the whole of the zika genome from Quick et. al 2017.

Additional data

  1. A subset of 254 reads from human genome (chromosome 12 part 9, chiron used chromosome 23 part 3) from the nanopore WGS consortium [need citation]

Preprocessing

1. Resquiggling

Based on proprietary basecalled sequence, we align using reference sequence NC_000913 to correct for basecall errors.

alt text

bash ./preprocessing/resquiggle.sh

Rmbr to edit the variables in resquiggle.sh

Dataset # reads Failed Alignment Reference sequence
Ecoli 164472 171 NC_000913
Zika 9608 NC_012532

2. Extracting the raw signal

bash ./preprocessing/runraw.sh

Reference

Boža, V, Brejová, B, Vinař, T (2017). DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. PLoS ONE, 12, 6:e0178751.

Quick, J., Grubaugh, N. D., Pullan, S. T., Claro, I. M., Smith, A. D., Gangavarapu, K., … Nature, S. (2017). Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples, 12(6). http://doi.org/10.1038/nprot.2017.066

Teng, H, Hall, M B, Duarte, T, Cao, M D, Coin, L (2017). Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. bioRxiv,

deepore's People

Contributors

etheleon avatar neoanarika avatar

Stargazers

 avatar

Watchers

 avatar  avatar

deepore's Issues

Logs

Include options to remove or add logs

Entry.py

Entry.py cannot load our models, need to find out why and find how to resolve this.

SRU + seq2seq

It is rife with so many dimensional incompatibilities. There are 2 reason for this

  1. I don't really understand how the inputs and outputs work for seq2seq and SRU hence more experimentation is needed maybe making seq2seq in a toy setup next time. That is why implementation is so hard
  2. SRU computation may result in some incompatibility, with tensorflow inbuild seq2seq.

SRU

Implementing SRU u first because it is easier and will speed up training

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.