Giter VIP home page Giter VIP logo

deepchainbio / bio-transformers Goto Github PK

View Code? Open in Web Editor NEW
140.0 140.0 31.0 4.88 MB

bio-transformers is a wrapper on top of the ESM/Protbert model, trained on millions on proteins and used to predict embeddings.

Home Page: https://bio-transformers.readthedocs.io/en/latest/getting_started/install.html

License: Apache License 2.0

Python 98.86% Dockerfile 1.14%
artificial-intelligence bioinformatics embeddings transformer

bio-transformers's People

Contributors

adsodemelk avatar ahmed-dj avatar delfosseaurelien avatar hippolytej avatar jbsevestre avatar johnhunter avatar kevineloff avatar martinp7 avatar ranzentom avatar theomeb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bio-transformers's Issues

Get probabilities dicts

Add a function to allow the user to get dictionaries of probabilities overs the AAs at each position in the sequence.

Potential overlap

Hi folks :)

looks like a great resource! Funny enough, to be on-topic: seems like an instance of convergent evolution :) https://github.com/sacdallago/bio_embeddings

I see that you are going in some interesting/different directions than what as done on the bio_embeddings side: we'd be happy to see if there are singergies & we can combine efforts rather than maybe distribute it on many different fronts. I also see from this blog post that you are already way ahead on the making the probability plots look beautiful, w.r.t. my ugly ones :D

In any case, let's chat if wanted!

CC @konstin

add finetuning method

add method to finetune on personnal protein dataset based on a backend (protbert/ESM) and save trained model.

Trouble using Multi-GPU on cluster

Hi,
I am trying to utilize multiple GPU to inference. However, when I run it I encounter an error like this.
Traceback (most recent call last):
File "main.py", line 10, in
embeddings = bio_trans.compute_embeddings(sequences, pool_mode=('cls','mean'))
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/wrappers/transformers_wrappers.py", line 669, in compute_embeddings
_, batch_embeddings = self._model_pass(batch_inputs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/wrappers/esm_wrappers.py", line 141, in _model_pass
repr_layers=[self.repr_layers],
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'tokens'

Can you help me look at this issue please?
Thanks

Typo in docstring of get_batch_indices defined in biotransformers/lightning_utils/data.py

In the docstring of the get_batch_indices function there is a typo in the example provided:

    Example:
        returning [[(1, 100), (3, 600)],[(4, 100), (7, 1200), (10, 600)], [(12, 1000)]]
        means that the first batch  will be composed of sequence at index 1 and 8 with
        lengths 100 and  600. The third batch contains only sequence 12 with a length
        of 1000.

The 8 should be replaced by 3

add multi-gpu support

allow the use of torch multi-gpu for inference
use of nn.DataParallel or ray to make the inference on multiple batches

Can not import properly after newest version installation

Hi,
I tried installing the new version of bio-transformers and when I import it, I get this error.

Traceback (most recent call last):
File "main.py", line 1, in
from biotransformers import BioTransformers
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/init.py", line 1, in
from biotransformers.bio_transformers import BioTransformers # noqa
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/bio_transformers.py", line 5, in
from biotransformers.wrappers.esm_wrappers import ESMWrapper
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/wrappers/esm_wrappers.py", line 10, in
from biotransformers.lightning_utils.data import AlphabetDataLoader
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/lightning_utils/data.py", line 11, in
from torch._six import int_classes as _int_classes
ImportError: cannot import name 'int_classes' from 'torch._six' (/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/_six.py)

add multi-gpu support

allow the use of torch multi-gpu for inference
use of nn.DataParallel or ray to make the inference on multiple batches

How to use MSA transformer embeddings?

When I calculate MSA embeddings with MSA Transformer model, I get an matrix but not a vector, how can I use the embeddings in downstream tasks? for example I use ['mean'] option to calculate embeddings, and get an matrix include every vector of every sequence in the mean m3a file, I should use in the downstream task with the mean value for all rows? or just use the first row?
Thanks you.

Saving and Loading Fine Tuned Model

I'm not sure if I'm either not saving or loading the finetuned model correctly. After fine-tuning and running an evaluation script, the accuracy before loading model and after loading model are the exact same. I've tried this with my own data, but for a check, I was attempting to replicate the example here: https://bio-transformers.readthedocs.io/en/latest/tutorial/finetuning.html

Training Script:

import biodatasets
import numpy as np
from biotransformers import BioTransformers
import ray

data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train on small sequence
length = np.array(list(map(len, X))) < 200
train_seq = X[length][:10000]
val_seq = X[length][10000:15000]

ray.init()
bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=4)


bio_trans.finetune(
    train_seq,
    validation_sequences=val_seq,
    lr=1.0e-5,
    warmup_init_lr=1e-7,
    toks_per_batch=2000,
    epochs=20,
    acc_batch_size=256,
    warmup_updates=1024,
    accelerator="ddp",
    checkpoint=None,
    save_last_checkpoint=False,
amp_level=None
)

After running it the logs directory is created with hparams.yaml (is empty), metrics.csv and checkpoints folder with last checkpoint (epoch=19-step=39.ckpt).

Then I run the evaluation script:

import biodatasets
import numpy as np
from biotransformers import BioTransformers
import ray

data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train sequence with length less than 200 AA
# Test on sequence that was not used for training.
length = np.array(list(map(len, X))) < 200
train_seq = X[length][15000:20000]


ray.init()
bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=4)
acc_before = bio_trans.compute_accuracy(train_seq, batch_size=32)
print(f"Accuracy before finetuning : {acc_before}")


bio_trans.load_model("logs/finetune_masked/version_0/checkpoints/epoch=19-step=39.ckpt")
acc_after = bio_trans.compute_accuracy(train_seq, batch_size=32)
print(f"Accuracy after finetuning : {acc_after}")

Which outputs:

Accuracy before finetuning : 0.3469025194644928
Accuracy after finetuning : 0.3469025194644928

Am I saving or loading incorrectly?

CUDA RUNTIME ERROR

Hi,
Thanks for updating the multi-GPU instructions. I am wondering since there is no CUDA toolkit installed during the default installation, do I need to install this aside from the bio-transformers main packages? Also, when I try to run with multi-GPUs, I get this error
RTX A6000 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the RTX A6000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Use absolute import instead of relative import

In the current code, we use relative imports such as https://github.com/DeepChainBio/bio-transformers/blob/main/biotransformers/esm_wrappers.py#L12

I believe the absolute imports are recommended.

In general, absolute imports are preferred over relative imports.

https://chrisyeh96.github.io/2017/08/08/definitive-guide-python-imports.html#absolute-vs-relative-import

Thus, our package named bio-transformers, the source folder is biotransformers, we should import like

from biotransformers.transformers_wrappers import (
    NATURAL_AAS,
    TransformersInferenceConfig,
    TransformersModelProperties,
    TransformersWrapper,
)

instead of

from .transformers_wrappers import (
    NATURAL_AAS,
    TransformersInferenceConfig,
    TransformersModelProperties,
    TransformersWrapper,
)

I think this is clearer? What do you think?

CONTRIBUTING - feedbacks / problems

While doing the MR #31 I followed the CONTRIBUTING step-by-step and I have some feedbacks / remarks / problems:

  1. Once the conda environment is installed, the pre-commit was not available (even if the it is available in environment_dev.yaml). I tried to run pip install pre-commit but it did not solve the issue. However, using conda install -c conda-forge pre-commit solved the issue. Maybe we can update the environment_dev.yaml by moving the pre-commit requirement from pip to dependencies (+ add conda-forge channel which is commented for now)
  2. Once the pre-commit is installed, it is advised to launch pre-commit run --all-files. With the latest version of main I had several errors, cf:
    image
  3. Regarding the pre-commit, isort is mentioned in the CONTRIBUTING.md but it is not defined in the .pre-commit-config.yaml (Remark: mypy is used but not mentioned in the CONTRIBUTING.md)
  4. To ensure the git conventions you can add a hook in the pre-commit by adding the following lines in the .pre-commit-config.yaml`:
  - repo: https://github.com/alessandrojcm/commitlint-pre-commit-hook
    rev: v2.2.0
    hooks:
      - id: commitlint
        stages: [commit-msg]
        additional_dependencies: ["@commitlint/config-conventional"]

Some information regarding my laptop:

  • Pop OS: 19.10
  • conda: 4.8.2

Wrong formatting in CONTRIBUTING.md - Git conventions

In the Git conventions section of the CONTRIBUTING.md file, the last item is wrongly formatted cf
image

It should be:

  • section:
    • it is optional
    • It is an extension of the section used to add a longer description about the changes if relevant

Also, why the name is section and not <body> instead?

a worker died or was killed while executing a task by an unexpected system error

I use 4 GPUs to calculate MSA embeddings, but each time the process terminated, the error was raise by ray, the error message is " a worker died or was killed while executing a task by an unexpected system error", the GPU process terminated one by one, I tried several times, I update ray with lastest version, the problem is same. How can I treat the problem?
Thanks!

ESM1v support

Hi,
Thank you for writing this package. Is ESM1v supported as described in the original Facebook ESM page?
Best,
Albert

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.