Giter VIP home page Giter VIP logo

mistral-dna's Introduction

Mistral-DNA: Mistral large language model for DNA sequences

Overview

Here is a repo to pretrain Mistral large language model for DNA sequences. Here the Mixtral model (https://huggingface.co/mistralai/Mistral-7B-v0.1) was modified to significantly reduce the number of parameters mostly by removing layers, such that it could be trained on a GPU such as an RTX3090.

Requirements

If you have an Nvidia GPU, then you must install CUDA and cuDNN libraries. See:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
https://developer.nvidia.com/cudnn
Be aware that you should check the compatibility between your graphic card and the versions of CUDA and cuDNN you want to install. This is a bit tricky and time consuming!

To know the version of your NVIDIA driver (if you use an NVIDIA GPU) and the CUDA version, you can type:

nvidia-smi

The versions that were used here were :

  • NVIDIA-SMI 535.129.03
  • Driver Version: 535.129.03
  • CUDA Version: 12.2

The models were developed with python and transformers.

Before installing python packages, you need to install python3 (>=3.10.12) (if you don't have it):

sudo apt update
sudo apt install python3-dev python3-pip python3-venv

Make mistral-dna environment:

conda create -n mitral-dna python=3.8
conda activate mitral-dna

To install pytorch:

pip install torch>=1.13.0

Other python packages need to be installed:

pip install transformers>=4.37.0.dev0 numpy>=1.24.4 pandas>=1.4.4 sklearn==0.0 datasets>=2.14.4 peft>=0.7.2.dev0
pip install flash-attn==0.2.4
pip install accelerate>=0.21.0

To generate the data, you need to first install R packages using the following command (to type inside R):

if (!require("BiocManager", quietly = TRUE))  
install.packages("BiocManager")  
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")  
BiocManager::install("GenomicRanges")
BiocManager::install("Biostrings")

Generate the data to pretrain the model

If you want to pretrain the model using the whole human genome, first use the R script:

  • scriptR/script_generate_dna_sequences.R to generate the DNA sequences \

You will obtain the following file (too large to be stored on github):

  • data/genome_sequences/hg38/sequences_hg38_200b.csv.gz (100% of the human genome)

Alternatively, you can skip this step and use smaller files stored on github:

  • data/genome_sequences/hg38/sequences_hg38_200b_small.csv.gz (10% of the human genome)
  • data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz (1% of the human genome)

Pretraining the model

Second, in the python folder "scriptPython/", you'll find the jupyter notebook:

  • script_pretrain_mistral-dna.ipynb to pretrain Mixtral model on DNA sequences. \

Select the data you want to pretrain the model on (full data, small data and very small data).

Contact:

[email protected]

mistral-dna's People

Contributors

raphaelmourad avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.