Here is a repo to pretrain Mistral large language model for DNA sequences. Here the Mixtral model (https://huggingface.co/mistralai/Mistral-7B-v0.1) was modified to significantly reduce the number of parameters mostly by removing layers, such that it could be trained on a GPU such as an RTX3090.
If you have an Nvidia GPU, then you must install CUDA and cuDNN libraries. See:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
https://developer.nvidia.com/cudnn
Be aware that you should check the compatibility between your graphic card and the versions of CUDA and cuDNN you want to install.
This is a bit tricky and time consuming!
To know the version of your NVIDIA driver (if you use an NVIDIA GPU) and the CUDA version, you can type:
nvidia-smi
The versions that were used here were :
- NVIDIA-SMI 535.129.03
- Driver Version: 535.129.03
- CUDA Version: 12.2
The models were developed with python and transformers.
Before installing python packages, you need to install python3 (>=3.10.12) (if you don't have it):
sudo apt update
sudo apt install python3-dev python3-pip python3-venv
Make mistral-dna environment:
conda create -n mitral-dna python=3.8
conda activate mitral-dna
To install pytorch:
pip install torch>=1.13.0
Other python packages need to be installed:
pip install transformers>=4.37.0.dev0 numpy>=1.24.4 pandas>=1.4.4 sklearn==0.0 datasets>=2.14.4 peft>=0.7.2.dev0
pip install flash-attn==0.2.4
pip install accelerate>=0.21.0
To generate the data, you need to first install R packages using the following command (to type inside R):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
BiocManager::install("GenomicRanges")
BiocManager::install("Biostrings")
If you want to pretrain the model using the whole human genome, first use the R script:
- scriptR/script_generate_dna_sequences.R to generate the DNA sequences \
You will obtain the following file (too large to be stored on github):
- data/genome_sequences/hg38/sequences_hg38_200b.csv.gz (100% of the human genome)
Alternatively, you can skip this step and use smaller files stored on github:
- data/genome_sequences/hg38/sequences_hg38_200b_small.csv.gz (10% of the human genome)
- data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz (1% of the human genome)
Second, in the python folder "scriptPython/", you'll find the jupyter notebook:
- script_pretrain_mistral-dna.ipynb to pretrain Mixtral model on DNA sequences. \
Select the data you want to pretrain the model on (full data, small data and very small data).