oxpig / calm Goto Github PK

View Code? Open in Web Editor NEW

28.0 1.0 5.0 74.84 MB

Protein language model trained on coding DNA

License: BSD 3-Clause "New" or "Revised" License

Python 69.10% Jupyter Notebook 30.90%

codons computational-biology deep-learning dna language-models llms machine-learning protein-language-model

calm's Introduction

CaLM

The Codon adaptation Language Model

This repository encapsulates all code required to reproduce the results of the paper "Codon language embeddings provide strong signals for use in protein engineering", by Carlos Outeiral and Charlotte M. Deane.

Citation

If you use our work, please cite:

Outeiral, Carlos, and Charlotte M. Deane. Codon language embeddings provide strong signals for use in protein engineering Nature Machine Intelligence 6.2 (2024): 170-179.

Installation

git clone https://github.com/oxpig/CaLM
python setup.py install

Usage

from calm import CaLM

model = CaLM()
model.embed_sequence('ATGGTATAGAGGCATTGA')

calm's People

Contributors

Stargazers

Watchers

Forkers

xvess horikitasaku mahmoud-ekhani cnp-ciimar arabidopsis

calm's Issues

UnpicklingError When Loading Model Weights in training.py

Dear Developer Team,

I hope this message finds you well. I am reaching out to report an issue I encountered while working with your software, specifically when attempting to load a model that I trained using the training.py file. Upon executing the test code to utilize the trained model, I encountered an UnpicklingError that halted the process.

To provide a detailed context, here is the error message that was generated:

UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.

This error occurred during the following operation in the code:

model = CaLM(weights_file='/home/hugeng/39-CaLM/production-run/latest-0.ckpt')

Thank you for your time and assistance. I look forward to your response and any suggestions you may offer to resolve this issue.

Best regards,

Geng Hu

The issue of data usage

Hello! I noticed that in the data you provided, some sequences do not begin with "ATG", for example, 'TTGAAAAGAAAAGCCAGTATCATGTTTGTCCATCAAGACAAGTACGAAGAATACAAACAGCGGCATGATGACATTTGGCCTGAGATGGCAGAAGCACTCAAAGCTCATGGAGCACACCATTATTCCATTTTTCTAGACGAGGAAACAGGCAGGCTTTTTGCATATTTAGAAATAGAGGATGAAGAGAAATGGAGAAAGATGGCGGACACGGAAGTTTGCCAAAGATGGTGGAAATCGATGGCGCCATTAATGAAAACAAATTCGGATTTCAGTCCTGTTGCGATAGATCTAAAGGAAGTTTTTTATTTGGATTGA'.
When tokenizing, should I discard the part before ATG and start from ATG, or should I just use the entire sequence as it is?
Similarly, when translating it into an amino acid sequence, should I translate the entire sequence directly or start translating from ATG?

AlphaBet

Hi,

Thank you for your work.

I noticed this is trained on cDNA data, while the tokeniser seems to use RNA vocab (https://github.com/oxpig/CaLM/blob/main/calm/alphabet.py)

Can you please clarify the data preprocessing pipeline?

Missing license

Great resource, however the repository does not contain a license, which makes it difficult to use/reuse.

FileNotFoundError

Dear Developer Team,

I am writing to seek your assistance regarding an issue I encountered while attempting to run the code associated with your paper titled "Codon language embeddings provide strong signals for use in protein engineering". When running the 'training.py' file, I encountered the following traceback error:

Traceback (most recent call last):
  File "training.py", line 141, in <module>
    ckpt_path='production-run/latest-56000.ckpt')
  ...
  FileNotFoundError: [Errno 2] No such file or directory: 'training_data.fasta'

It seems that the 'training_data.fasta' file is not found, leading to this error. I would greatly appreciate it if you could provide some guidance on how to address this issue.

Thank you very much for your time and consideration. I look forward to your valuable guidance.

Sincerely,
Geng Hu

Fine-tune on top of your pre-trained model

Can you please share your pytorch lightning model snapshot, so I can fine-tune a model on top of yours?

Currently you only share a weights file (calm_weights.pkl). I full pytorch lightning model snapshot allows me to run Trainer() from your location.

GPU device management

Hi and thanks again for the great work,

I will fix that in my local copy of the repo but as far as I see, the CaLM class in pretrained.py which is used for inference doesn't support a device argument and setting e.g. model.model.cuda() results in a conflict between the model device and the device of the tensors put in the forward method.

Environment installation issues

Hi @couteiral and thanks for sharing your work,

I tried to run python setup.py install but when I try to do from calm import CaLM the pytorch init triggers an error:
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory

So I tried to build the env myself starting from a fresh conda environment, such that pytorch/cuda can be imported without issues. Then I install your repository within this env, all seems to go through and imports work well, however when I try to init the model, it downloads the weights but after that I get another cuda error:
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Do you have any hints on how to finalise the installation please?