deepchainbio / bio-transformers Goto Github PK

bio-transformers is a wrapper on top of the ESM/Protbert model, trained on millions on proteins and used to predict embeddings.

Home Page: https://bio-transformers.readthedocs.io/en/latest/getting_started/install.html

License: Apache License 2.0

Python 98.86% Dockerfile 1.14%

artificial-intelligence bioinformatics embeddings transformer

bio-transformers's People

Contributors

Stargazers

Watchers

bio-transformers's Issues

Get probabilities dicts

Add a function to allow the user to get dictionaries of probabilities overs the AAs at each position in the sequence.

Potential overlap

Hi folks :)

looks like a great resource! Funny enough, to be on-topic: seems like an instance of convergent evolution :) https://github.com/sacdallago/bio_embeddings

I see that you are going in some interesting/different directions than what as done on the bio_embeddings side: we'd be happy to see if there are singergies & we can combine efforts rather than maybe distribute it on many different fronts. I also see from this blog post that you are already way ahead on the making the probability plots look beautiful, w.r.t. my ugly ones :D

In any case, let's chat if wanted!

CC @konstin

add finetuning method

add method to finetune on personnal protein dataset based on a backend (protbert/ESM) and save trained model.

ERRO: ValueError: Connection error, and we cannot find the requested files in the cached path.

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
Somebody in the same problem?

embedding in ESM is not extracted from the last layer

Our current code is:

https://github.com/DeepChainBio/bio-transformers/blob/main/biotransformers/wrappers/esm_wrappers.py#L143

But if you check the ESM extract code:

https://github.com/facebookresearch/esm/blob/master/extract.py#L80

You can see if i = -1, then (i + model.num_layers + 1) % (model.num_layers + 1) = model.num_layers

Use GitHub actions for test and release

As now the repo is public, we can use GitHub actions (https://github.com/features/actions) for

Test (Pre-commit check, unit-test, integration-test)
Documentation build (RTD)
Release to pip
Release to docker hub

Trouble using Multi-GPU on cluster

Hi,
I am trying to utilize multiple GPU to inference. However, when I run it I encounter an error like this.
Traceback (most recent call last):
File "main.py", line 10, in
embeddings = bio_trans.compute_embeddings(sequences, pool_mode=('cls','mean'))
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/wrappers/transformers_wrappers.py", line 669, in compute_embeddings
_, batch_embeddings = self._model_pass(batch_inputs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/wrappers/esm_wrappers.py", line 141, in _model_pass
repr_layers=[self.repr_layers],
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'tokens'

Can you help me look at this issue please?
Thanks

Feature Request - Use any ESM model

Is there a way to use any of the ESM models? There seems to be links here: https://github.com/facebookresearch/esm#available-models

Still cannot use esm_msa1_t12_100M_UR50S, teturns the model not in the list.

Still cannot use esm_msa1_t12_100M_UR50S, list_backend() teturns
That the model not in the list.

Typo in docstring of get_batch_indices defined in biotransformers/lightning_utils/data.py

In the docstring of the get_batch_indices function there is a typo in the example provided:

    Example:
        returning [[(1, 100), (3, 600)],[(4, 100), (7, 1200), (10, 600)], [(12, 1000)]]
        means that the first batch  will be composed of sequence at index 1 and 8 with
        lengths 100 and  600. The third batch contains only sequence 12 with a length
        of 1000.

The 8 should be replaced by 3

add multi-gpu support

allow the use of torch multi-gpu for inference
use of nn.DataParallel or ray to make the inference on multiple batches

add jupyter example

Can not import properly after newest version installation

Hi,
I tried installing the new version of bio-transformers and when I import it, I get this error.

Traceback (most recent call last):
File "main.py", line 1, in
from biotransformers import BioTransformers
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/init.py", line 1, in
from biotransformers.bio_transformers import BioTransformers # noqa
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/bio_transformers.py", line 5, in
from biotransformers.wrappers.esm_wrappers import ESMWrapper
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/wrappers/esm_wrappers.py", line 10, in
from biotransformers.lightning_utils.data import AlphabetDataLoader
File "/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/biotransformers/lightning_utils/data.py", line 11, in
from torch._six import int_classes as _int_classes
ImportError: cannot import name 'int_classes' from 'torch._six' (/om2/user/kaiyi/anaconda/envs/bio-transformers/lib/python3.7/site-packages/torch/_six.py)

add multi-gpu support

allow the use of torch multi-gpu for inference
use of nn.DataParallel or ray to make the inference on multiple batches

How to use MSA transformer embeddings?

When I calculate MSA embeddings with MSA Transformer model, I get an matrix but not a vector, how can I use the embeddings in downstream tasks? for example I use ['mean'] option to calculate embeddings, and get an matrix include every vector of every sequence in the mean m3a file, I should use in the downstream task with the mean value for all rows? or just use the first row?
Thanks you.

Saving and Loading Fine Tuned Model

I'm not sure if I'm either not saving or loading the finetuned model correctly. After fine-tuning and running an evaluation script, the accuracy before loading model and after loading model are the exact same. I've tried this with my own data, but for a check, I was attempting to replicate the example here: https://bio-transformers.readthedocs.io/en/latest/tutorial/finetuning.html

Training Script:

import biodatasets
import numpy as np
from biotransformers import BioTransformers
import ray

data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train on small sequence
length = np.array(list(map(len, X))) < 200
train_seq = X[length][:10000]
val_seq = X[length][10000:15000]

ray.init()
bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=4)


bio_trans.finetune(
    train_seq,
    validation_sequences=val_seq,
    lr=1.0e-5,
    warmup_init_lr=1e-7,
    toks_per_batch=2000,
    epochs=20,
    acc_batch_size=256,
    warmup_updates=1024,
    accelerator="ddp",
    checkpoint=None,
    save_last_checkpoint=False,
amp_level=None
)

After running it the logs directory is created with hparams.yaml (is empty), metrics.csv and checkpoints folder with last checkpoint (epoch=19-step=39.ckpt).

Then I run the evaluation script:

import biodatasets
import numpy as np
from biotransformers import BioTransformers
import ray

data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train sequence with length less than 200 AA
# Test on sequence that was not used for training.
length = np.array(list(map(len, X))) < 200
train_seq = X[length][15000:20000]


ray.init()
bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=4)
acc_before = bio_trans.compute_accuracy(train_seq, batch_size=32)
print(f"Accuracy before finetuning : {acc_before}")


bio_trans.load_model("logs/finetune_masked/version_0/checkpoints/epoch=19-step=39.ckpt")
acc_after = bio_trans.compute_accuracy(train_seq, batch_size=32)
print(f"Accuracy after finetuning : {acc_after}")

Which outputs:

Accuracy before finetuning : 0.3469025194644928
Accuracy after finetuning : 0.3469025194644928

Am I saving or loading incorrectly?

CUDA RUNTIME ERROR

Hi,
Thanks for updating the multi-GPU instructions. I am wondering since there is no CUDA toolkit installed during the default installation, do I need to install this aside from the bio-transformers main packages? Also, when I try to run with multi-GPUs, I get this error
RTX A6000 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the RTX A6000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Use absolute import instead of relative import

In the current code, we use relative imports such as https://github.com/DeepChainBio/bio-transformers/blob/main/biotransformers/esm_wrappers.py#L12

I believe the absolute imports are recommended.

In general, absolute imports are preferred over relative imports.

https://chrisyeh96.github.io/2017/08/08/definitive-guide-python-imports.html#absolute-vs-relative-import

Thus, our package named bio-transformers, the source folder is biotransformers, we should import like

from biotransformers.transformers_wrappers import (
    NATURAL_AAS,
    TransformersInferenceConfig,
    TransformersModelProperties,
    TransformersWrapper,
)

instead of

from .transformers_wrappers import (
    NATURAL_AAS,
    TransformersInferenceConfig,
    TransformersModelProperties,
    TransformersWrapper,
)

I think this is clearer? What do you think?

pytorch_lightning.utilities.exceptions.MisconfigurationException: You have asked for `amp_level='O2'` but it's only supported with `amp_backend='apex'`.

I am trying to fine-tune on a set of sequences, using a A100 GPU and the exact example script from the github mainpage, and it throws this error.

What to do here?

CONTRIBUTING - feedbacks / problems

While doing the MR #31 I followed the CONTRIBUTING step-by-step and I have some feedbacks / remarks / problems:

Once the conda environment is installed, the pre-commit was not available (even if the it is available in environment_dev.yaml). I tried to run pip install pre-commit but it did not solve the issue. However, using conda install -c conda-forge pre-commit solved the issue. Maybe we can update the environment_dev.yaml by moving the pre-commit requirement from pip to dependencies (+ add conda-forge channel which is commented for now)
Once the pre-commit is installed, it is advised to launch pre-commit run --all-files. With the latest version of main I had several errors, cf:
Regarding the pre-commit, isort is mentioned in the CONTRIBUTING.md but it is not defined in the .pre-commit-config.yaml (Remark: mypy is used but not mentioned in the CONTRIBUTING.md)
To ensure the git conventions you can add a hook in the pre-commit by adding the following lines in the .pre-commit-config.yaml`:

  - repo: https://github.com/alessandrojcm/commitlint-pre-commit-hook
    rev: v2.2.0
    hooks:
      - id: commitlint
        stages: [commit-msg]
        additional_dependencies: ["@commitlint/config-conventional"]

Some information regarding my laptop:

Pop OS: 19.10
conda: 4.8.2

Wrong formatting in CONTRIBUTING.md - Git conventions

In the Git conventions section of the CONTRIBUTING.md file, the last item is wrongly formatted cf

It should be:

section:
- it is optional
- It is an extension of the section used to add a longer description about the changes if relevant

Also, why the name is section and not <body> instead?

a worker died or was killed while executing a task by an unexpected system error

I use 4 GPUs to calculate MSA embeddings, but each time the process terminated, the error was raise by ray, the error message is " a worker died or was killed while executing a task by an unexpected system error", the GPU process terminated one by one, I tried several times, I update ray with lastest version, the problem is same. How can I treat the problem?
Thanks!

Monitor code quality

There are multiple tools to monitor the code quality

https://scrutinizer-ci.com for code quality
https://codeclimate.com/github/DeepRegNet/DeepReg for code quality as well
https://github.com/apps/codecov for test coverage

deepchainbio / bio-transformers Goto Github PK

bio-transformers's People

Contributors

Stargazers

Watchers

Forkers

bio-transformers's Issues

Recommend Projects

Recommend Topics

Recommend Org