Giter VIP home page Giter VIP logo

biotrainer's Introduction

Biotrainer

Biotrainer is an open-source tool to simplify the training process of machine-learning models for biological applications. It specializes on training models to predict features for proteins. Using biotrainer comes as simple as providing your sequence and label data in the correct format, along with a configuration file.

Data standardization

Biotrainer provides a lot of data standards, designed to ease the usage of machine learning for biology. This standardization process is also expected to improve communication between different scientific disciplines and help to keep the overview about the rapidly developing field of protein prediction.

Available protocols

The protocol defines, how the input data should be interpreted and which kind of prediction task has to be applied. The following protocols are already implemented:

D=embedding dimension (e.g. 1024)
B=batch dimension (e.g. 30)
L=sequence dimension (e.g. 350)
C=number of classes (e.g. 13)

- residue_to_class --> Predict a class C for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxC
- residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC
- residues_to_value --> Predict a value V for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output Bx1
- sequence_to_class --> Predict a class C for each sequence encoded in a fixed dimension D. Input BxD --> output BxC
- sequence_to_value --> Predict a value V for each sequence encoded in a fixed dimension D. Input BxD --> output Bx1

Input file standardization

For every protocol, we created a standardization on how the input data must be provided. You can find detailed information for each protocol here.

Below, we show an example on how the sequence and label file must look like for the residue_to_class protocol:

sequences.fasta

>Seq1
SEQWENCE

labels.fasta

>Seq1 SET=train VALIDATION=False
DVCDVVDD

Configuration file

To run biotrainer, you need to provide a configuration file in .yaml format along with your sequence and label data. Here you can find an exemplary file for the residue_to_class protocol. All configuration options are listed here.

Example configuration for residue_to_class:

protocol: residue_to_class
sequence_file: sequences.fasta # Specify your sequence file
labels_file: labels.fasta # Specify your label file
model_choice: CNN # Model architecture 
optimizer_choice: adam # Model optimizer
learning_rate: 1e-3 # Optimizer learning rate
loss_choice: cross_entropy_loss # Loss function 
use_class_weights: True # Balance class weights by using class sample size in the given dataset
num_epochs: 200 # Number of maximum epochs
batch_size: 128 # Batch size
embedder_name: Rostlab/prot_t5_xl_uniref50 # Embedder to use

(Bio-)Embeddings

To convert the sequence data to more meaningful input for a model, embeddings generated by protein language models (pLMs) have become widely applied in the last years. Hence, biotrainer enables automatic calculation of embeddings on a per-sequence and per-residue level, depending on the protocol. Take a look at the embeddings options to find out about all the available embedding methods. It is also possible to provide your own embeddings file using your own embedder, independent of the provided calculation pipeline. Please refer to the data standardization document and the relevant examples to learn how to do this. Pre-calculated embeddings can be used for the training process via the embeddings_file parameter, as described in the configuration options.

Installation

  1. Make sure you have poetry installed:
curl -sSL https://install.python-poetry.org/ | python3 -
  1. Install dependencies and biotrainer via poetry:
# In the base directory:
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter

Running

cd examples/residue_to_class
poetry run biotrainer config.yml

You can also use the provided run-biotrainer.py file for development and debugging (you might want to set up your IDE to directly execute run-biotrainer.py with the provided virtual environment):

# residue_to_class
poetry run python3 run-biotrainer.py examples/residue_to_class/config.yml
# sequence_to_class
poetry run python3 biotrainer.py examples/sequence_to_class/config.yml

Docker

# Build
docker build -t biotrainer .
# Run
docker run --rm \
    -v "$(pwd)/examples/docker":/mnt \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    biotrainer:latest /mnt/config.yml

Output can be found afterward in the directory of the provided configuration file.

Citation

If you are using biotrainer for your work, please add a citation:

@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.