agemagician / ankh Goto Github PK

Ankh: Optimized Protein Language Model

License: Other

Python 100.00%

ankh's Introduction

Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Ankh is the first general-purpose protein language model trained on Google's TPU-V4 surpassing the state-of-the-art performance with dramatically less parameters, promoting accessibility to research innovation via attainable resources.

This repository will be updated regulary with new pre-trained models for proteins in part of supporting the biotech community in revolutinizing protein engineering using AI.

Installation
Models Availability
Dataset Availability
Usage
Original downstream Predictions
Followup use-cases
Comparisons to other tools
Community and Contributions
Have a question?
Found a bug?
Requirements
Sponsors
Team
License
Citation

Installation

python -m pip install ankh

Models Availability

Model	ankh	Hugging Face
Ankh Large	`ankh.load_large_model()`	Ankh Large
Ankh Base	`ankh.load_base_model()`	Ankh Base

Datasets Availability

Dataset	Hugging Face
Remote Homology	`load_dataset("proteinea/remote_homology")`
CASP12	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP12.csv']})`
CASP14	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP14.csv']})`
CB513	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CB513.csv']})`
TS115	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['TS115.csv']})`
DeepLoc	`load_dataset("proteinea/deeploc")`
Fluorescence	`load_dataset("proteinea/fluorescence")`
Solubility	`load_dataset("proteinea/solubility")`
Nearest Neighbor Search	`load_dataset("proteinea/nearest_neighbor_search")`

Usage

Loading pre-trained models:

  import ankh

  # To load large model:
  model, tokenizer = ankh.load_large_model()
  model.eval()


  # To load base model.
  model, tokenizer = ankh.load_base_model()
  model.eval()

Feature extraction using ankh large example:

  model, tokenizer = ankh.load_large_model()
  model.eval()

  protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
  'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']

  protein_sequences = [list(seq) for seq in protein_sequences]


  outputs = tokenizer.batch_encode_plus(protein_sequences, 
                                    add_special_tokens=True, 
                                    padding=True, 
                                    is_split_into_words=True, 
                                    return_tensors="pt")
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs['attention_mask'])

Loading downstream models example:

  # To use downstream model for binary classification:
  binary_classification_model = ankh.ConvBertForBinaryClassification(input_dim=768, 
                                                                     nhead=4, 
                                                                     hidden_dim=384, 
                                                                     num_hidden_layers=1, 
                                                                     num_layers=1, 
                                                                     kernel_size=7, 
                                                                     dropout=0.2, 
                                                                     pooling='max')

  # To use downstream model for multiclass classification:
  multiclass_classification_model = ankh.ConvBertForMultiClassClassification(num_tokens=2, 
                                                                             input_dim=768, 
                                                                             nhead=4, 
                                                                             hidden_dim=384, 
                                                                             num_hidden_layers=1, 
                                                                             num_layers=1, 
                                                                             kernel_size=7, 
                                                                             dropout=0.2)

  # To use downstream model for regression:
  # training_labels_mean is optional parameter and it's used to fill the output layer's bias with it, 
  # it's useful for faster convergence.
  regression_model = ankh.ConvBertForRegression(input_dim=768, 
                                                nhead=4, 
                                                hidden_dim=384, 
                                                num_hidden_layers=1, 
                                                num_layers=1, 
                                                kernel_size=7, 
                                                dropout=0, 
                                                pooling='max', 
                                                training_labels_mean=0.38145)

Original downstream Predictions

Secondary Structure Prediction (Q3):

Model	CASP12	CASP14	TS115	CB513
Ankh 2 Large	84.18%	76.82%	88.59%	88.78%
Ankh Large	83.59%	77.48%	88.22%	88.48%
Ankh Base	80.81%	76.67%	86.92%	86.94%
ProtT5-XL-UniRef50	83.34%	75.09%	86.82%	86.64%
ESM2-15B	83.16%	76.56%	87.50%	87.35%
ESM2-3B	83.14%	76.75%	87.50%	87.44%
ESM2-650M	82.43%	76.97%	87.22%	87.18%
ESM-1b	79.45%	75.39%	85.02%	84.31%

Secondary Structure Prediction (Q8):

Model	CASP12	CASP14	TS115	CB513
Ankh 2 Large	72.90%	62.84%	79.88%	79.01%
Ankh Large	71.69%	63.17%	79.10%	78.45%
Ankh Base	68.85%	62.33%	77.08%	75.83%
ProtT5-XL-UniRef50	70.47%	59.71%	76.91%	74.81%
ESM2-15B	71.17%	61.81%	77.67%	75.88%
ESM2-3B	71.69%	61.52%	77.62%	75.95%
ESM2-650M	70.50%	62.10%	77.68%	75.89%
ESM-1b	66.02%	60.34%	73.82%	71.55%

Contact Prediction Long Precision Using Embeddings:

Model	ProteinNet (L/1)	ProteinNet (L/5)	CASP14 (L/1)	CASP14 (L/5)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	48.93%	73.49%	16.01%	29.91%
Ankh Base	43.21%	66.63%	13.50%	28.65%
ProtT5-XL-UniRef50	44.74%	68.95%	11.95%	24.45%
ESM2-15B	31.62%	52.97%	14.44%	26.61%
ESM2-3B	30.24%	51.34%	12.20%	21.91%
ESM2-650M	29.36%	50.74%	13.71%	22.25%
ESM-1b	29.25%	50.69%	10.18%	18.08%

Contact Prediction Long Precision Using attention scores:

Model	ProteinNet (L/1)	ProteinNet (L/5)	CASP14 (L/1)	CASP14 (L/5)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	31.44%	55.58%	11.05%	20.74%
Ankh Base	25.93%	46.28%	9.32%	19.51%
ProtT5-XL-UniRef50	30.85%	51.90%	8.60%	16.09%
ESM2-15B	33.32%	57.44%	12.25%	24.60%
ESM2-3B	33.92%	56.63%	12.17%	21.36%
ESM2-650M	31.87%	54.63%	10.66%	21.01%
ESM-1b	25.30%	42.03%	7.77%	15.77%

Localization (Q10):

Model	DeepLoc Dataset
Ankh 2 Large	82.57%
Ankh Large	83.01%
Ankh Base	81.38%
ProtT5-XL-UniRef50	82.95%
ESM2-15B	81.22%
ESM2-3B	81.22%
ESM2-650M	82.08%
ESM-1b	80.51%

Remote Homology:

Model	SCOPe (Fold)
Ankh 2 Large	62.09%
Ankh Large	61.01%
Ankh Base	61.14%
ProtT5-XL-UniRef50	59.38%
ESM2-15B	54.48%
ESM2-3B	59.24%
ESM2-650M	51.36%
ESM-1b	56.93%

Solubility:

Model	Solubility
Ankh 2 Large	75.86%
Ankh Large	76.41%
Ankh Base	76.36%
ProtT5-XL-UniRef50	76.26%
ESM2-15B	60.52%
ESM2-3B	74.91%
ESM2-650M	74.56%
ESM-1b	74.91%

Fluorescence (Spearman Correlation):

Model	Fluorescence
Ankh 2 Large	0.62
Ankh Large	0.62
Ankh Base	0.62
ProtT5-XL-UniRef50	0.61
ESM2-15B	0.56
ESM-1b	0.48
ESM2-650M	0.48
ESM2-3B	0.46

Nearest Neighbor Search using Global Pooling:

Model	Lookup69K (C)	Lookup69K (A)	Lookup69K (T)	Lookup69K (H)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	0.83	0.72	0.60	0.70
Ankh Base	0.85	0.77	0.63	0.72
ProtT5-XL-UniRef50	0.83	0.69	0.57	0.73
ESM2-15B	0.78	0.63	0.52	0.67
ESM2-3B	0.79	0.65	0.53	0.64
ESM2-650M	0.72	0.56	0.40	0.53
ESM-1b	0.78	0.65	0.51	0.63

Team

Technical University of Munich:

Ahmed Elnaggar	Burkhard Rost

Proteinea:

Hazem Essam	Wafaa Ashraf	Walid Moustafa	Mohamed Elkerdawy

University of Columbia:

Charlotte Rochereau

License

Ankh pretrained models are released under the under terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Community and Contributions

The Ankh project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

Have a question?

We are happy to hear your question in our issues page Ankh! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via Hello.

Found a bug?

Feel free to file a new issue with a respective title and description on the the Ankh repository. If you already found a solution to your problem, we would love to review your pull request!.

✏️ Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@article{elnaggar2023ankh,
  title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
  author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
  journal={arXiv preprint arXiv:2301.06568},
  year={2023}
}

ankh's People

Contributors

Stargazers

Watchers

Forkers

dongcf eltociear obinnaobeleagu stjordanis biocad guruace biocoder007 danielnzg85 lajd schmigle ddofer darrengao628 kimdn or2513 qqlaoxia usmanovsky lhallee spyfighting rufus-willy

ankh's Issues

How to compute the perplexity of a protein sequence by ankh PLM?

Database of generated embeddings

Hi, thanks so much for your work !

I was wondering, using Ankh, have you generated the embeddings on uniref50 or another database and made it available somewhere by any chance ? It would be awesome and time saving ! Both in float16 or float64. (float16 would be easier to store, and I think it is as performant as float64 on downstream tasks ?)

Thanks

Decode From Protein Representation

Hi,

Thanks for this amazing work.

I am wondering if AnKh supports decoding from extracted protein representation.

Thank you!

Secondary structure prediction dataset

Hi!

When running examples/secondary_structure_prediction_3_states.ipynb, I have an error with downloading dataset from HuggingFace. It looks like there are no such files as stated in the script.

FileNotFoundError: Couldn't find a dataset script at /home/u/Ankh/examples/proteinea/SSP/SSP.py or any data file in the same directory. Couldn't find 'proteinea/SSP' on the Hugging Face Hub either: FileNotFoundError: Unable to find training_data.csv in dataset repository proteinea/secondary_structure_prediction with any supported extension ['csv', ...]

Also, is there a straightforward way to use Ankh for secondary structure prediction on a single sequence protein?

Availability of training/fine-tuning scripts

Hi! Are there any publicly available training/fine-tuning scripts? I would like to fine-tune the pre-trained Ankh on a custom dataset of protein sequences using the masked setup described in section 4.6.1 of the manuscript. It would be wonderful if I could utilize your code, but unfortunately, I couldn't find it anywhere. Thank you!

is feature extraction correct due to padding?

Thank you for this great protein LM. My question is regarding the feature extraction due to padding (given that the input sequences are from different length), in this case shouldn't the attention_mask be used as an additional input to the model for correct results, which makes sure that padding tokens of the inputs are ignored? like this:

  model, tokenizer = ankh.load_large_model()
  model.eval()

  protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
  'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']

  protein_sequences = [list(seq) for seq in protein_sequences]


  outputs = tokenizer.batch_encode_plus(protein_sequences, 
                                    add_special_tokens=True, 
                                    padding=True, 
                                    is_split_into_words=True, 
                                    return_tensors="pt")
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs["attention_mask"])

thank you.

Missing model architecture and training details

Hello,

I have just read the paper, Great work!

I hoped to get more insights into the model architecture and training details, but none seems available on the repo.

It is a bit intriguing that a paper focused on so many pre-training experiments does not include the model architecture and training functions.

Binary solubility example notebook error.

Hi,
I'm trying to run the example notebook given for solubility prediction.

I am getting an error after running code from one of the cells as shown below. I have used the code from your utils file which checks for the presence of the accelerate module as an additional cell in the running notebook. This returns True for the presence of accelerate. Is there any guidance that you could provide yto make this work? Thanks in afdvance.

Cheers,
Saif

---------Code from cell-------------------------
model_type = 'ankh_large'
experiment = f'solubility_{model_type}'

training_args = TrainingArguments(
output_dir=f'./results_{experiment}',
num_train_epochs=5,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
warmup_steps=1000,
learning_rate=1e-03,
weight_decay=0.0,
logging_dir=f'./logs_{experiment}',
logging_steps=200,
do_train=True,
do_eval=True,
evaluation_strategy="epoch",
gradient_accumulation_steps=16,
fp16=False,
fp16_opt_level="02",
run_name=experiment,
seed=seed,
load_best_model_at_end=True,
metric_for_best_model="eval_accuracy",
greater_is_better=True,
save_strategy="epoch"
)

---------Errror--------------------------

ImportError Traceback (most recent call last)
Cell In[34], line 4
1 model_type = 'ankh_large'
2 experiment = f'solubility_{model_type}'
----> 4 training_args = TrainingArguments(
5 output_dir=f'./results_{experiment}',
6 num_train_epochs=5,
7 per_device_train_batch_size=1,
8 per_device_eval_batch_size=1,
9 warmup_steps=1000,
10 learning_rate=1e-03,
11 weight_decay=0.0,
12 logging_dir=f'./logs_{experiment}',
13 logging_steps=200,
14 do_train=True,
15 do_eval=True,
16 evaluation_strategy="epoch",
17 gradient_accumulation_steps=16,
18 fp16=False,
19 fp16_opt_level="02",
20 run_name=experiment,
21 seed=seed,
22 load_best_model_at_end=True,
23 metric_for_best_model="eval_accuracy",
24 greater_is_better=True,
25 save_strategy="epoch"
26 )

File :121, in init(self, output_dir, overwrite_output_dir, do_train, do_eval, do_predict, evaluation_strategy, prediction_loss_only, per_device_train_batch_size, per_device_eval_batch_size, per_gpu_train_batch_size, per_gpu_eval_batch_size, gradient_accumulation_steps, eval_accumulation_steps, eval_delay, learning_rate, weight_decay, adam_beta1, adam_beta2, adam_epsilon, max_grad_norm, num_train_epochs, max_steps, lr_scheduler_type, lr_scheduler_kwargs, warmup_ratio, warmup_steps, log_level, log_level_replica, log_on_each_node, logging_dir, logging_strategy, logging_first_step, logging_steps, logging_nan_inf_filter, save_strategy, save_steps, save_total_limit, save_safetensors, save_on_each_node, save_only_model, no_cuda, use_cpu, use_mps_device, seed, data_seed, jit_mode_eval, use_ipex, bf16, fp16, fp16_opt_level, half_precision_backend, bf16_full_eval, fp16_full_eval, tf32, local_rank, ddp_backend, tpu_num_cores, tpu_metrics_debug, debug, dataloader_drop_last, eval_steps, dataloader_num_workers, past_index, run_name, disable_tqdm, remove_unused_columns, label_names, load_best_model_at_end, metric_for_best_model, greater_is_better, ignore_data_skip, fsdp, fsdp_min_num_params, fsdp_config, fsdp_transformer_layer_cls_to_wrap, deepspeed, label_smoothing_factor, optim, optim_args, adafactor, group_by_length, length_column_name, report_to, ddp_find_unused_parameters, ddp_bucket_cap_mb, ddp_broadcast_buffers, dataloader_pin_memory, dataloader_persistent_workers, skip_memory_metrics, use_legacy_prediction_loop, push_to_hub, resume_from_checkpoint, hub_model_id, hub_strategy, hub_token, hub_private_repo, hub_always_push, gradient_checkpointing, gradient_checkpointing_kwargs, include_inputs_for_metrics, fp16_backend, push_to_hub_model_id, push_to_hub_organization, push_to_hub_token, mp_parameters, auto_find_batch_size, full_determinism, torchdynamo, ray_scope, ddp_timeout, torch_compile, torch_compile_backend, torch_compile_mode, dispatch_batches, split_batches, include_tokens_per_second, include_num_input_tokens_seen, neftune_noise_alpha)

File ~/ankh_venv/lib/python3.10/site-packages/transformers/training_args.py:1483, in TrainingArguments.post_init(self)
1477 if version.parse(version.parse(torch.version).base_version) == version.parse("2.0.0") and self.fp16:
1478 raise ValueError("--optim adamw_torch_fused with --fp16 requires PyTorch>2.0")
1480 if (
1481 self.framework == "pt"
1482 and is_torch_available()
-> 1483 and (self.device.type != "cuda")
1484 and (self.device.type != "npu")
1485 and (self.device.type != "xpu")
1486 and (get_xla_device_type(self.device) != "GPU")
1487 and (self.fp16 or self.fp16_full_eval)
1488 ):
1489 raise ValueError(
1490 "FP16 Mixed precision training with AMP or APEX (--fp16) and FP16 half precision evaluation"
1491 " (--fp16_full_eval) can only be used on CUDA or NPU devices or certain XPU devices (with IPEX)."
1492 )
1494 if (
1495 self.framework == "pt"
1496 and is_torch_available()
(...)
1503 and (self.bf16 or self.bf16_full_eval)
1504 ):

File ~/ankh_venv/lib/python3.10/site-packages/transformers/training_args.py:1921, in TrainingArguments.device(self)
1917 """
1918 The device used by this process.
1919 """
1920 requires_backends(self, ["torch"])
-> 1921 return self._setup_devices

File ~/ankh_venv/lib/python3.10/site-packages/transformers/utils/generic.py:54, in cached_property.get(self, obj, objtype)
52 cached = getattr(obj, attr, None)
53 if cached is None:
---> 54 cached = self.fget(obj)
55 setattr(obj, attr, cached)
56 return cached

File ~/ankh_venv/lib/python3.10/site-packages/transformers/training_args.py:1831, in TrainingArguments._setup_devices(self)
1829 if not is_sagemaker_mp_enabled():
1830 if not is_accelerate_available():
-> 1831 raise ImportError(
1832 f"Using the Trainer with PyTorch requires accelerate>={ACCELERATE_MIN_VERSION}: "
1833 "Please run pip install transformers[torch] or pip install accelerate -U"
1834 )
1835 AcceleratorState._reset_state(reset_partial_state=True)
1836 self.distributed_state = None

ImportError: Using the Trainer with PyTorch requires accelerate>=0.21.0: Please run pip install transformers[torch] or pip install accelerate -U

Half-precision

Will performance be affected if half-precision is used with Ankh? As in model.half() to convert to FP16.
I am trying to optimize inference speed as much as possible.

Decoding labels in masking

Hi there!

Awesome work! Quick question, in the paper you mention

performed a full demasking/reconstruction of the sequence (i.e., all tokens are reconstructed as individual tokens).

Isn't it different from T5 decoder implementation where the recovered labels are just the masked ones? Do you have a separate implementation for this and if so could you please point it? I am sorry if I missed it.

Thank so much!

python script for solubility score?

Hi,

Reading through the different documents in this repo, it's not clear to me how to run the solubility prediction to get a solubility score for a given protein.

Is there one such script? Thx.

Generating sequences longer than 115

Hello!

I see that this implementation of T5 has ~115 extra ids for masking. Is it possible to tile or pattern these in someway to get the model to generate sequences longer than 115, or do new extra ids need to be added for this purpose? Didn't know if anyone had tried this yet.

Downstream task: what model for remote homology prediction ?

Hi, could you tell me what model you used for the remote homology prediction task please ?
I was able to run your example notebooks for secondary structure prediction but I am unsure how to do to regenerate your results for remote homology downstream task.
Thanks !

MultiLabelClassification Issue

Hello @agemagician @hazemessamm ,

In the current multi-label convbert the output is logits that are batch_size, seq_len, emb_dim to compare with labels that are batch_size, emb_dim. F.binary_cross_entropy_with_logits requires the label and pred be the same size, so this will not work if I understand it correctly. One potential solution is to average across the sequence dimension to get batch_size, emb_dim but I'm not sure if this is optimal. Does it work as intended or have I actually found a bug?
Best,
Logan

High-N generation

Dear authors,
Thank you for open-sourcing your amazing PLM.
Can you please provide an example of how to fine-tune model for sequence generation based on protein family?
Is that the same as fine-tuning causal language modeling with transformers library?

Thank you.
Ai

Generation within Sequence/Mask prediction

Thank you authors for your publication and work in open sourcing :)
Was able to run the simple_generation example you provided - this is great if one wants to generate at the end of a sequence.
Is it possible to generate within the middle of a sequence, using the existing model and mask token as well?

Fine-tuning multilabel classification

Hello,

Thanks for your work!
I am trying to fine-tune the model on some of my data on a multilabel classification task.
From my understanding of the example notebooks, the downstream architectures always take embeddings (generated by ankh-large or ankh-base) as inputs and not tokenized sequences directly.

How can one train the full network (and thus modify the embeddings representations given to a sequence) for a downstream task?

My question is simple and might have an obvious response and I'm sorry if it's the case.

Thanks a lot for your help!

Error: Missing Model Files for ElnaggarLab/ankh2-large on HF

Hello,

I encountered an issue while trying to use the ankh2-large model on the Hugging Face model hub. Here is the code snippet I used:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


tokenizer = AutoTokenizer.from_pretrained("ElnaggarLab/ankh2-large")
model = AutoModelForSeq2SeqLM.from_pretrained("ElnaggarLab/ankh2-large")

However, I received the following error:

OSError: ElnaggarLab/ankh2-large does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

It seems that the model files are missing from the repository. Could you please check and upload the necessary files?

Thank you for your assistance.

Best regards,
Alex

Availability of fine-tuned models for downstream tasks

Hi there, great work on making these large pre-trained protein LMs available for us to use.

Are the fine-tuned models for prediction of secondary structure, contacts, solubility, fluorescence etc going to be made available?

Thanks!

agemagician / ankh Goto Github PK

ankh's Introduction

Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Table of Contents

Installation

Models Availability

Datasets Availability

Usage

Original downstream Predictions

Team

Sponsors

License

Community and Contributions

Have a question?

Found a bug?

✏️ Citation

ankh's People

Contributors

Stargazers

Watchers

Forkers

ankh's Issues

Recommend Projects

Recommend Topics

Recommend Org