Giter VIP home page Giter VIP logo

methbert's Introduction

On the application of BERT models for nanopore methylation detection

MethBERT explores a non-recurrent modeling approach for nanopore methylation detection based on the bidirectional encoder representations from transformers (BERT). Compared with the state-of-the-art model using bi-directional recurrent neural networks (RNN), BERT can provide a faster model inference solution without the limit of computation in sequential order. We proviede two types of BERTs: the basic one [Devlin et al.] and the refined one. The refined BERT is refined according to the task-specific features, including:

  • learnable postional embedding
  • self-attetion with realtive postion representation [Shaw et al.]
  • center postitions concatenation for the output layer

The model structures are shown in the above figure.

Installation

git clone https://github.com/yaozhong/methBERT.git
cd methBERT
pip install .

Docker enviroment

We also provide a docker image for running this source code

docker pull yaozhong/ont_methylation:0.6
  • ubuntu 16.04.10
  • Python 3.5.2
  • Pytorch 1.5.1+cu101
nvidia-docker run -it --shm-size=64G -v LOCAL_DATA_PATH:MOUNT_DATA_PATH yaozhong/ont_methylation:0.6

Training

Data sampling and split

We use reads from complete methylation data and amplicon data (Control) for training models. We provide the following two choice for sampling reads.

  • random and balanced selection of reads
  • region-based selection
N_EPOCH=50
W_LEN=21
LR=1e-4
MODEL="biRNN_basic"
MOTIF="CG"
NUCLEOTIDE_LOC_IN_MOTIF=0
POSITIVE_SAMPLE_PATH=<methylated fast5 path>
NEGATIVE_SAMPLE_PATH=<unmethylated fast5 path>
MODEL_SAVE_FILE=<model saved path>

# training biRNN model
python3 train_biRNN.py --model ${MODEL}  --model_dir ${MODEL_SAVE_FILE} --gpu cuda:0 --epoch ${N_EPOCH} \
 --positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
 --motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --w_len ${W_LEN} --lr $LR --data_balance_adjust

# training bert models
MODEL="BERT_plus" (option: "BERT", "BERT_plus")
python3 train_bert.py --model ${MODEL}  --model_dir ${MODEL_SAVE_FILE} --gpu cuda:0 --epoch ${N_EPOCH} \
 --positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
 --motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --w_len ${W_LEN} --lr $LR  --data_balance_adjust

Detection

We provided independent trained models on each 5mC and 6mA datasets of different motifs and methyltransferases in the ./trained_model fold.

MODEL="BERT_plus" 
MODEL_SAVE_PATH=<model saved path>
REF=<reference genome fasta file>
FAST5_FOLD=<fast5 files to be analyzed>
OUTPUT=<output file>

time python detect.py --model ${MODEL} --model_dir ${MODEL_SAVE_PATH} \
--gpu cuda:0  --fast5_fold ${FAST5_FOLD} --num_worker 12 \
--motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --evalMode test_mode --w_len ${W_LEN} --ref_genome ${REF} --output_file ${OUTPUT}

We generate the same output format as the deepSignal (https://github.com/bioinfomaticsCSU/deepsignal).

# output example
NC_000913.3     4581829 +       4581829 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     3.0398369e-06   0       TGCGGGTCTTCGCCATACACG
NC_000913.3     4581838 +       4581838 43ea7b03-8d2b-4df3-b395-536b41872137    t       0.9999996       0.00013372302   0       TCGCCATACACGCGCTCAAAC
NC_000913.3     4581840 +       4581840 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     0.0     0       GCCATACACGCGCTCAAACGG
NC_000913.3     4581848 +       4581848 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     0.0     0       CGCGCTCAAACGGCTGCAAAT
NC_000913.3     4581862 +       4581862 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     0.0     0       TGCAAATGCTCGTCGGTAAAC

Available benchmark dataset

We test models on 5mC and 6mA dataset sequenced with Nanopore R9 flow cells, which is commonly used as the benchmark data in the previous work.

The fast5 reads are supposed to be pre-processed with re-squggle (Tombo). More detailed information on data pre-prcessing, please refer to methBERT.readthedoc

Reference genome

  • E.coli: K-12 sub-strand MG1655
  • H.sapiens: GRCh38

Reference

This source code is referring to the follow github projects.

methbert's People

Contributors

yaozhong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

methbert's Issues

UnboundLocalError: local variable 'net' referenced before assignment

Hi, yaozheng

Thanks to your help in last time. Recently, I'm trying to use detect.py and meet such a problem:

python /home/yqfu/software/methBERT/methBERT/detect.py --model M_dam_gAtc --model_dir /home/yqfu/software/methBERT/trained_model/Stoiber/random_split_balanced/R9_6mA/M_dam_gAtc/BERT_plus_W21_E50_stoiber_ecoli-M_dam_gAtc_basic_lr-1e-4-128.pth  \
>  --fast5_fold /home/data/yqfu/nanopore/single_fast5/C_fast5 --num_worker 12 \
> --motif A --ref_genome /home/data/yqfu/nanopore/ref_genome/ref.fa  --output_file /home/data/yqfu/nanopore/methBERT
[+] Detecting methylation A-motif with for 0-th position [A] for nanopore fast5 data ...
/home/data/yqfu/nanopore/single_fast5/C_fast5
 |- Loading single fold data with label-[1] ...
/home/yqfu/miniconda3/lib/python3.9/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
* Loading trained model ...
Traceback (most recent call last):
  File "/home/yqfu/software/methBERT/methBERT/detect.py", line 166, in <module>
    detection_run(data_gen, device, args.model, args.model_dir, args.evalMode, False, args.ref_genome, args.output_file)
  File "/home/yqfu/software/methBERT/methBERT/detect.py", line 87, in detection_run
    net.to(device)
UnboundLocalError: local variable 'net' referenced before assignment

It seems like it has not assign 'net'

software updating

We are currently making a major revision of the methBERT.

  • The current models are trained on R9 dataset. R9.4 models will be provided when our experiment data is ready.

ModuleNotFoundError: No module named 'scrappy'

Hi there,

I followed the instructions and tried to test python3 train_biRNN.py, but I received the following error message:

Traceback (most recent call last):                                                                                                                             
  File "train_biRNN.py", line 7, in <module>                                                                                                                   
    from dataProcess.data_loading import *
  File "/hsinlun/bin/methBERT/methBERT/dataProcess/data_loading.py", line 3, in <module>
    from .ref_util import get_fast5s, sim_seq_singal, pdist
  File "hsinlun/bin/methBERT/methBERT/dataProcess/ref_util.py", line 10, in <module>
    import scrappy
ModuleNotFoundError: No module named 'scrappy'

I would appreciate any suggestions you may have for resolving this issue.

Thank you for your time and assistance.

Best regards,
Hsin

Where is the trained model?

Hi,
Thanks for the methBERT tool. You mentioned that "We provided independent trained models on each 5mC and 6mA datasets of different motifs and methyltransferases in the ./trained_model fold.", but I didn't find the folder. Any chance to provide the pre-trained BERT model for direct usage? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.