nadavbra / protein_bert Goto Github PK

View Code? Open in Web Editor NEW

462.0 462.0 95.0 23.48 MB

Python 5.18% Jupyter Notebook 94.82%

protein_bert's People

Contributors

Stargazers

Watchers

Forkers

sailfish009 stjordanis natnaelt anu-bioinfo rubenszimbres superxiang akriot fanshijianpharmacy tinkeringengr stellaathena beira-bf giorginolab amadeusine freemanguohua basicskywards n4m-ohnuki xinxinatg williamsawran project2you yhgon merouone animesh zwormz mustapha-ajeghrir bishnukuet mah51 ruanzy13 miladansari junsangpark enjoytoshare judewells khatvangi bioinformatica ak422 ys-arch sajidahmeduiu ruthnduta rosehe1029 iamskab chenchuming roupenminassian danish2562022 shoubolee vks4git vuhongai zakiasalod abdelkadergelany lyndonlens fcampelo prathamsoni vedasheersh thomaskluiters mikemacintosh arnaudmkonan takshan avivlazar r-kellerm azra3lzz spyfighting kennylee123 ruilvdotcom wangze09 nadavrap sinisraj neuron-vision bennyyu79 zzl196 biogeeker oomycota1 arjunjain27 lpiekarski fangzhe3 jd8111997 mooshroom-person 950288 vxh357 zhaohongpu fakher77 linjames0 samiulshuvo marcopasi lkenn012 titobabatunde benjynstrauss kimjisaner loganwu0526 championsnet syabahmad hjanime evelynmitchell shimooper manolaz shubhamkrpandey19

protein_bert's Issues

Questions about ProteinBert n the global attention visualizing section

Hi @nadavbra ,I have some questions about ProteinBert n the global attention visualizing section.

In the task of forecasting secondary structures, when I use
('disorder_secondary_structure', OutputType(True, 'binary'))
the shape of finetuned_attention_values and pretrained_attention_values in
axes[2].barh(np.arange(seq_len), 100 * (finetuned_attention_values - pretrained_attention_values).sum(axis=0), color='#28B463')
are different so they can't subtractible, but when I use (False, 'binary'), the shape of them are same. So how do I change this part of the code.
For global attention visualizing, when the task is about predict signal peptide, A sequence with a only one label, we can therefore easily analyse the effect of different amino acid global attention value on the whole sequence prediction, i.e. on the final predicted value. But in the task of predict secondary structures, Each amino acid site corresponds to a label, so how do I analyse the different global attention value in relation to these sites?

Thank you .

AttributeError: 'tuple' object has no attribute 'other_optimizer_kwargs'

problem with finetuning

hello,
I have been trying to run the example given in this notebook ( https://github.com/nadavbra/protein_bert/blob/master/ProteinBERT%20demo.ipynb ). however, when the model tried to fine-tune and run for 40 epochs, after 3 epochs, this error happened:
AttributeError Traceback (most recent call last)

in <cell line: 1>()
----> 1 finetune(model_generator, input_encoder, OUTPUT_SPEC, train_set['seq'], train_set['label'], valid_set['seq'], valid_set['label'],
2 seq_len = 512, batch_size = 32, max_epochs_per_stage = 10, lr = 1e-04, begin_with_frozen_pretrained_layers = True,
3 lr_with_frozen_pretrained_layers = 1e-02, n_final_epochs = 1, final_seq_len = 1024, final_lr = 1e-05, callbacks = training_callbacks)

2 frames

/usr/local/lib/python3.10/dist-packages/proteinbert/model_generation.py in update_state(self, model)
33 def update_state(self, model):
34 self.model_weights = copy_weights(model.get_weights())
---> 35 self.optimizer_weights = copy_weights(model.optimizer.get_weights())
36
37 def _init_weights(self, model):

AttributeError: 'Adam' object has no attribute 'get_weights'

I tried to uninstall and reinstall my Tensorflow to an older version, But again, after a couple of epochs, this error happened:
Epoch 1/10
468/468 [==============================] - ETA: 0s - loss: 0.0998

WARNING:tensorflow:evaluate() received a value for sample_weight, but weighted_metrics were not provided. Did you mean to pass metrics to weighted_metrics in compile()? If this is intentional you can pass weighted_metrics=[] to compile() in order to silence this warning.

468/468 [==============================] - 1594s 3s/step - loss: 0.0998 - val_loss: 0.0641 - lr: 0.0100
Epoch 2/10
468/468 [==============================] - ETA: 0s - loss: 0.0741

468/468 [==============================] - 1597s 3s/step - loss: 0.0741 - val_loss: 0.0635 - lr: 0.0100
Epoch 3/10
468/468 [==============================] - ETA: 0s - loss: 0.0735

Epoch 3: ReduceLROnPlateau reducing learning rate to 0.0024999999441206455.
468/468 [==============================] - 1568s 3s/step - loss: 0.0735 - val_loss: 0.1107 - lr: 0.0100
[2023_06_14-16:24:03] Training the entire fine-tuned model...
[2023_06_14-16:24:20] Incompatible number of optimizer weights - will not initialize them.
Epoch 1/10
8/468 [..............................] - ETA: 1:14:39 - loss: 0.1079

(just changed the epoch numbers so it can be done faster but its the same as 40 epochs )
I just wanted to run the example to see the overall fine-tuning process. I don't want to override any methods or functions. How can I deal with these errors?

GO Annotation Vector

Hi @nadavbra,

Cool model, I just have a few questions. The first being, did you ever recover the GO annotation vector that was lost in #6? I can't seem to find it anywhere, and I unfortunately do not have a month nor a powerful enough GPU to retrain the model myself.

My second question is the global_representations vector that is returned when using the model to predict the function of a protein sequence is an array of floating point values between 0 an 1. Do you have any intuition on how to interpret these values? I get that values closer to one represent a higher probability of a given protein belonging to that GO annotation. But what is the cutoff? Is a value greater than 0.1 sufficient? Should it be higher, can you go lower? What values did you use in your paper? The model seems fairly stringent, is this by design?

Cheers,
Rhys

How to customize different fine-tuning task like protein-protein interaction (PPI) ?

Hello:
As mentioned in the title, I wonder how to customize fine-tuning task like protein-protein interaction? I am not sure which part of finetuning.py (or perhaps FinetuningModelGenerator in model_generation.py) should I change; I checked all fine-tuning task in this repo, always one protein seq in one single sample.
The fellowing CSV format is the dataset I want to use in my task. Many thanks.

label	seq_A	seq_B
0	MAVSVTPIRDTKWLTLEVCREFQRGTCSRPDTECKFAHPSKSCQVENGRVIACFDSLKGRCSRENCKYLH	MATTNSFTILIFMILATTSSTFATLGEMVTVLSIDGGGIKGIIPATILEFLEGQLQEVDNNTDARLADYF
1	MCCEKWSRVAEMFLFIEEREDCKILCLCSRAFVEDRKLYNLGLKGYYIRDSGNNSGDQATEEEEGGYSCG	MSASSRFIPEHRRQNYKGKGTFQADELRRRRETQQIEIRKQKREENLNKRRNLVDVQEPAEETIPLEQDK

Question related to Fine-Tuning Phase

Hello,
Thanks for this great publication and open source this whole code.
In the supplementary material of the paper, this thing is mentioned: "for tasks involving global predictions (at the entire protein level), the final layer was connected to its global representations. Specifically, we used a concatenation of all the global normalization layers (2 per block) and the output annotations". I have gone through your notebook that is related to finetuning step. but did not find out the answer What does this above statement mean " It means, will we load the pretrained model weights for the annotation part only and will provide protein sequence and GO annotation (all zero vector) as input and pass it through a dropout layer with 0.5% probability and then and then pass to dense layer and map it to 8943 dimensions and predict the GO annotations from the protein sequence". Can you explain this fine tuning process? I want to predict GO annotations from protein sequences.

Error whilst evaluating fine-tuned model with categorical GO terms

Hello,

I am specifically looking to use the model to predict go terms on sequences less than 150 AAs in length so I am attempting to fine tune the model to my dataset of small sequences. The fine-tune process seems successful, however when I run the evaluate_by_len method in finetuning.py I get an out of memory error (see below). The error is originating on line 90 from y_pred = model.predict(X, batch_size = batch_size) I have reduced the batch_size right down to 2 in an attempt to prevent the error however I have had no success. I just wanted to check that I am inputting the data correctly and that it isn't a problem with my code.

Code & inputs into methods

 save_data = {
      "benchmark_name": BENCHMARK_NAME,
      "samples": [('Training-set', (train_set_seqs, train_set_labels)), ('Validation-set', (valid_set_seqs, valid_set_labels))],
      "model_generator": model_generator,
      "input_encoder": input_encoder,
      "output_spec": output_spec,
      "start_seq_len" : settings['seq_len'],
      "start_batch_size": settings['batch_size']
    }

for dataset_name, dataset in saved_data["samples"]:
          log('*** %s performance: ***' % dataset_name)
          log('batch size: ', saved_data["start_batch_size"])
          results, confusion_matrix = evaluate_by_len(saved_data["model_generator"], saved_data["input_encoder"], saved_data["output_spec"], dataset[0], dataset[1], \
                  start_seq_len = saved_data["start_seq_len"], start_batch_size = saved_data["start_batch_size"])

(These are entered as dataframes which I assume is the correct input format, but just want to check.)
dataset[0]:
(dataframes.head() method)

Name: seq, dtype: object
P82299                               GAYGQGQNIGQLFVNILIFLFY
O46577    SVVKSEDFSLPAYMDRRDHPLPEVAHVKHLSASQKALKEKEKASWS...
A0QKT3    MPTYAPKAGDTTRSWYVIDATDVVLGRLAVAAANLLRGKHKPTFAP...
Q4QM28    MYAVFQSGGKQHRVSEGQVVRLEKLELATGATVEFDSVLMVVNGED...
A7GR09    MKWWKLSGQILLLFCFAWTGEWIAKQVHLPIPGSIIGIFLLLISLK...

dataset[1]:

P82299                             [GO:0005615, GO:0005344]
O46577    [GO:0016021, GO:0005743, GO:0005751, GO:000412...
A0QKT3                 [GO:0005840, GO:0003735, GO:0006412]
Q4QM28     [GO:0005840, GO:0019843, GO:0003735, GO:0006412]
A7GR09     [GO:0016021, GO:0005886, GO:0019835, GO:0012501]

OOM Error

2022-03-01 15:33:19.980713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:84:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-03-01 15:33:19.980756: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-03-01 15:33:19.980797: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-03-01 15:33:19.980817: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-03-01 15:33:19.980840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-03-01 15:33:19.980859: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-03-01 15:33:19.980881: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-03-01 15:33:19.980904: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-03-01 15:33:19.980948: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-03-01 15:33:19.981759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2022-03-01 15:33:19.981796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-03-01 15:33:20.725621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-01 15:33:20.725680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1 
2022-03-01 15:33:20.725691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N N 
2022-03-01 15:33:20.725697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   N N 
2022-03-01 15:33:20.726963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11119 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:08:00.0, compute capability: 6.0)
2022-03-01 15:33:20.728077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11119 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:84:00.0, compute capability: 6.0)
2022-03-01 15:33:22.452796: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-03-01 15:33:22.453398: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3196210000 Hz
2022-03-01 15:33:23.836672: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-03-01 15:33:24.074444: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-03-01 15:33:24.080029: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-03-01 15:33:24.943426: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
2022-03-01 15:33:24.995202: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2022-03-01 15:33:54.356200: W tensorflow/core/common_runtime/bfc_allocator.cc:433] Allocator (GPU_0_bfc) ran out of memory trying to allocate 128.0KiB (rounded to 131072)requested by op model_2/global-attention-block5/transpose_3
Current allocation summary follows.
2022-03-01 15:33:54.356891: I tensorflow/core/common_runtime/bfc_allocator.cc:972] BFCAllocator dump for GPU_0_bfc
2022-03-01 15:33:54.356918: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (256): 	Total Chunks: 31, Chunks in use: 31. 7.8KiB allocated for chunks. 7.8KiB in use in bin. 240B client-requested in use in bin.
2022-03-01 15:33:54.356930: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (512): 	Total Chunks: 50, Chunks in use: 50. 25.5KiB allocated for chunks. 25.5KiB in use in bin. 25.2KiB client-requested in use in bin.
2022-03-01 15:33:54.356939: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (1024): 	Total Chunks: 2, Chunks in use: 1. 2.8KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2022-03-01 15:33:54.356948: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (2048): 	Total Chunks: 38, Chunks in use: 38. 76.5KiB allocated for chunks. 76.5KiB in use in bin. 76.0KiB client-requested in use in bin.
2022-03-01 15:33:54.356956: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (4096): 	Total Chunks: 1, Chunks in use: 0. 5.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-03-01 15:33:54.356965: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (8192): 	Total Chunks: 2, Chunks in use: 1. 26.0KiB allocated for chunks. 13.0KiB in use in bin. 13.0KiB client-requested in use in bin.
2022-03-01 15:33:54.356981: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (16384): 	Total Chunks: 3, Chunks in use: 1. 71.5KiB allocated for chunks. 20.5KiB in use in bin. 20.3KiB client-requested in use in bin.
2022-03-01 15:33:54.356990: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (32768): 	Total Chunks: 1, Chunks in use: 1. 51.0KiB allocated for chunks. 51.0KiB in use in bin. 34.9KiB client-requested in use in bin.
2022-03-01 15:33:54.356999: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (65536): 	Total Chunks: 8, Chunks in use: 8. 661.2KiB allocated for chunks. 661.2KiB in use in bin. 557.5KiB client-requested in use in bin.
2022-03-01 15:33:54.357008: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (131072): 	Total Chunks: 11, Chunks in use: 11. 1.50MiB allocated for chunks. 1.50MiB in use in bin. 1.32MiB client-requested in use in bin.
2022-03-01 15:33:54.357016: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (262144): 	Total Chunks: 17, Chunks in use: 17. 4.41MiB allocated for chunks. 4.41MiB in use in bin. 4.25MiB client-requested in use in bin.
..... (A lot more chunk messages)
memory_limit_: 11659697216 available bytes: 64 curr_region_allocation_bytes_: 23319394816
2022-03-01 15:34:14.429783: I tensorflow/core/common_runtime/bfc_allocator.cc:1048] Stats: 
Limit:                     11659697216
InUse:                     11659624704
MaxInUse:                  11659624704
NumAllocs:                     1112241
MaxAllocSize:                592642048
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
Traceback (most recent call last):
  File "/home/mah51/files/protein_bert/evaluate_go.py", line 37, in <module>
    run_job()
  File "/home/mah51/files/protein_bert/evaluate_go.py", line 27, in run_job
    start_seq_len = saved_data["start_seq_len"], start_batch_size = saved_data["start_batch_size"])
  File "/shared/home/mah51/files/protein_bert/proteinbert/finetuning.py", line 93, in evaluate_by_len
    y_pred = model.predict(X, batch_size = 1)
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1629, in predict
    tmp_batch_outputs = self.predict_function(iterator)
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 862, in _call
    results = self._stateful_fn(*args, **kwds)
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[128,4,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node model_2/global-attention-block5/transpose_3 (defined at shared/home/mah51/files/protein_bert/proteinbert/conv_and_global_attention_model.py:77) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_predict_function_6175]

Function call stack:
predict_function

Thank you in advance for any help!

Visualizing Attention

Hi, I'm trying to use BertViz for visualizing attentions, I got an error for not correct number of dimensions, is it possible to somehow manage this error?

Question related to text processing

hello,
Is there possibility to generate the embeddings representations of the the go term definitions such as "Any process that modulates a measurable attribute of any biological process, quality or function." by fine-tuning the proteinBERT model. Will the fine tuning of the proteinBERT be beneficial for this task, instead, it is trainied on protein sequences along with GO labels. Or, would I need to pre-train the proteinBERT model from scratch? For pre-training the proteinBERT model from scratch on GO term definitions, can you give some suggestions. how much time, it will take to pretrain the model from scratch?

Generate sequence based on natural one

Hi,
First of all, thank you for sharing ProteinBERT code.
I read the article as part of my thesis, and started playing with the model.

My questions:

How may I generate a new sequence by ProteinBERT which based on natural sequence?
For example:
Natural seq: "DQSVRKLVRKLPDEGLDREKVKTYLDKLGVDREELQKFSDAIGLESSGGS"
A new generated seq: "GSSDIEITVEGKEQADKVIEEMKRRNLEVHVEEHNGQYIDKASLESSGGS"
In generation process, is it possible to define which station/s in natural sequence I want ProteinBERT will change? if yes - how?
Example: I want to change only the 3th station in sequence, so:
Natural seq: "DQSV..."
Possible generated sequence: "DQGV..."

Best Regards,
Aviv.

Save model after finetune process

Hi Nadav,
First of all - thank you very much for reveal the ProteinBERT for published usage.
I tried to use ProteinBert for stability' prediction. Here is the code from your Colab notebook "ProteinBERT demo.ipynb", with some changes for finetune the ProteinBert only to stability-prediction task:

import os

import pandas as pd
from IPython.display import display

from tensorflow import keras

from sklearn.model_selection import train_test_split

from proteinbert import OutputType, OutputSpec, FinetuningModelGenerator, load_pretrained_model, finetune, evaluate_by_len, log
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

local_model_dump_dir = '/content/drive/MyDrive/Thesis/proteinbert_models'
local_model_dump_file_name = 'epoch_92400_sample_23500000_10_03.pkl'

BENCHMARKS = [
    # name, output_type
    # ('signalP_binary', OutputType(False, 'binary')),
    # ('fluorescence', OutputType(False, 'numeric')),
    # ('remote_homology', OutputType(False, 'categorical')),
    ('stability', OutputType(False, 'numeric'))
    # ('scop', OutputType(False, 'categorical')),
    # ('secondary_structure', OutputType(True, 'categorical')),
    # ('disorder_secondary_structure', OutputType(True, 'binary')),
    # ('ProFET_NP_SP_Cleaved', OutputType(False, 'binary')),
    # ('PhosphositePTM', OutputType(True, 'binary')),
]

settings = {
    'max_dataset_size': None,
    'max_epochs_per_stage': 1,
    'seq_len': 512,
    'batch_size': 32,
    'final_epoch_seq_len': 1024,
    'initial_lr_with_frozen_pretrained_layers': 1e-02,
    'initial_lr_with_all_layers': 1e-04,
    'final_epoch_lr': 1e-05,
    'dropout_rate': 0.5,
    'training_callbacks': [
        keras.callbacks.ReduceLROnPlateau(patience = 1, factor = 0.25, min_lr = 1e-05, verbose = 1),
        keras.callbacks.EarlyStopping(patience = 2, restore_best_weights = True),
    ],
}

####### Uncomment for debug mode
# settings['max_dataset_size'] = 500
# settings['max_epochs_per_stage'] = 1

def run_benchmark(benchmark_name, pretraining_model_generator, input_encoder, pretraining_model_manipulation_function = None):
    
    log('========== %s ==========' % benchmark_name)  
    
    output_type = get_benchmark_output_type(benchmark_name)
    log('Output type: %s' % output_type)
    
    train_set, valid_set, test_set = load_benchmark_dataset(benchmark_name)        
    log(f'{len(train_set)} training set records, {len(valid_set)} validation set records, {len(test_set)} test set records.')
    
    if settings['max_dataset_size'] is not None:
        log('Limiting the training, validation and test sets to %d records each.' % settings['max_dataset_size'])
        train_set = train_set.sample(min(settings['max_dataset_size'], len(train_set)), random_state = 0)
        valid_set = valid_set.sample(min(settings['max_dataset_size'], len(valid_set)), random_state = 0)
        test_set = test_set.sample(min(settings['max_dataset_size'], len(test_set)), random_state = 0)
    
    if output_type.is_seq or output_type.is_categorical:
        train_set['label'] = train_set['label'].astype(str)
        valid_set['label'] = valid_set['label'].astype(str)
        test_set['label'] = test_set['label'].astype(str)
    else:
        train_set['label'] = train_set['label'].astype(float)
        valid_set['label'] = valid_set['label'].astype(float)
        test_set['label'] = test_set['label'].astype(float)
        
    if output_type.is_categorical:
        
        if output_type.is_seq:
            unique_labels = sorted(set.union(*train_set['label'].apply(set)) | set.union(*valid_set['label'].apply(set)) | \
                    set.union(*test_set['label'].apply(set)))
        else:
            unique_labels = sorted(set(train_set['label'].unique()) | set(valid_set['label'].unique()) | set(test_set['label'].unique()))
            
        log('%d unique lebels.' % len(unique_labels))
    elif output_type.is_binary:
        unique_labels = [0, 1]
    else:
        unique_labels = None
        
    output_spec = OutputSpec(output_type, unique_labels)
    model_generator = FinetuningModelGenerator(pretraining_model_generator, output_spec, pretraining_model_manipulation_function = \
            pretraining_model_manipulation_function, dropout_rate = settings['dropout_rate'])
    finetune(model_generator, input_encoder, output_spec, train_set['seq'], train_set['label'], valid_set['seq'], valid_set['label'], \
            seq_len = settings['seq_len'], batch_size = settings['batch_size'], max_epochs_per_stage = settings['max_epochs_per_stage'], \
            lr = settings['initial_lr_with_all_layers'], begin_with_frozen_pretrained_layers = True, lr_with_frozen_pretrained_layers = \
            settings['initial_lr_with_frozen_pretrained_layers'], n_final_epochs = 1, final_seq_len = settings['final_epoch_seq_len'], \
            final_lr = settings['final_epoch_lr'], callbacks = settings['training_callbacks'])
    
    for dataset_name, dataset in [('Training-set', train_set), ('Validation-set', valid_set), ('Test-set', test_set)]:
        
        log('*** %s performance: ***' % dataset_name)
        results, confusion_matrix = evaluate_by_len(model_generator, input_encoder, output_spec, dataset['seq'], dataset['label'], \
                start_seq_len = settings['seq_len'], start_batch_size = settings['batch_size'])
    
        with pd.option_context('display.max_rows', None, 'display.max_columns', None):
            display(results)
        
        if confusion_matrix is not None:
            with pd.option_context('display.max_rows', 16, 'display.max_columns', 10):
                log('Confusion matrix:')
                display(confusion_matrix)
                
    return model_generator

def load_benchmark_dataset(benchmark_name):
    
    train_set_file_path = os.path.join(BENCHMARKS_DIR, '%s.train.csv' % benchmark_name)
    valid_set_file_path = os.path.join(BENCHMARKS_DIR, '%s.valid.csv' % benchmark_name)
    test_set_file_path = os.path.join(BENCHMARKS_DIR, '%s.test.csv' % benchmark_name)
    
    train_set = pd.read_csv(train_set_file_path).dropna().drop_duplicates()
    test_set = pd.read_csv(test_set_file_path).dropna().drop_duplicates()
          
    if os.path.exists(valid_set_file_path):
        valid_set = pd.read_csv(valid_set_file_path).dropna().drop_duplicates()
    else:
        log(f'Validation set {valid_set_file_path} missing. Splitting training set instead.')
        train_set, valid_set = train_test_split(train_set, stratify = train_set['label'], test_size = 0.1, random_state = 0)
    
    return train_set, valid_set, test_set

def get_benchmark_output_type(benchmark_name):
    for name, output_type in BENCHMARKS:
        if name == benchmark_name:
            return output_type
        
pretrained_model_generator, input_encoder = load_pretrained_model(local_model_dump_dir=local_model_dump_dir,
                                                                  local_model_dump_file_name=local_model_dump_file_name) # load_pretrained_model()

for benchmark_name, _ in BENCHMARKS:
    run_benchmark(benchmark_name, pretrained_model_generator, input_encoder, pretraining_model_manipulation_function = \
            get_model_with_hidden_layers_as_outputs)
        
log('Done.')

After the finetuning process is finished, I want to save the finetuned-mode. So that in the next Colab' running, I would be able to load the finetuned model without re-train it:
pretrained_model_generator, input_encoder = load_pretrained_model(local_model_dump_dir="my/local/dir", local_model_dump_file_name="my/finetuned/model")
How do I do such saving?

Regards,
Aviv.

The annotation number does not match your pretrained model

Hi Protein_bert team,

Thanks for providing such useful model. I found one weird thing about your pretrained model, after I created the uniref90 database and merged with GO database through the pipeline you provided, I got 9211 annotation records. This is different from your number in the manuscript, 8943. Would you mind taking a look into this?

save finetuned model

Hi, I have a question about saving a fine tuned model plese.

After fine tuning i'm using save_weights such as:

finetune(model_generator,  ...)
...
model_generator.save_weights('fine_tuned_model.h5')

and if i want to use the model later:

pretrained_model_generator, input_encoder = load_pretrained_model(local_model_dump_dir='')
model_generator = FinetuningModelGenerator(
    pretrained_model_generator, 
    OUTPUT_SPEC, 
    pretraining_model_manipulation_function = get_model_with_hidden_layers_as_outputs,
    dropout_rate = 0.5)
fine_tuned_model = model_generator.create_model(512)
fine_tuned_model.load_weights('fine_tuned_model.h5')

where OUTPUT_SPEC is the same that I used to fine tune the model

Is this ok?

Restore amino acid sequence from local_representations or global_representations

Hello,
I have a question.

Is there any way to recover the original one-dimensional amino acid sequence from local_representations or global_representations?

The code is as follows.

model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
X = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations= model.predict(X, batch_size = batch_size)

Thanks for your help!

Hello World

Can you please post a hello world code example? I need to use the pre-trained model to predict one missing residue in a single sequence. I have spent hours trying to figure it out using the demo examples and code snippets in the issues, but haven't had any success. I would be really grateful for any help with this.

finetune with dataset of different length sequences

Dear authors,

Thank you for sharing your great work.
I would like to fine-tune your pretrained model on my dataset. The only problem is that sequences in the dataset contain various lengths. I wonder if the model is still suitable for this task.

Bests,
Ai

chunk_size value

I figured out the model.fit takes batch_size * batches_per_epoch samples. However, we import 100,000 samples each time we need new data (chunk_size). Can we reduce this number to batch_size * batches_per_epoch samples so that the memory usage decreases? (in case of fixed batch_size=64)

Incompatible number of optimizer weights message

Hi - when running the signalP benchmark I get a message
Incompatible number of optimizer weights - will not initialize them.
What does this mean? Is this a problem?

Thanks

Clustering protein sequences

Good afternoon, devs. I would like to use ProteinBert to perform protein sequence-based clustering. I have several questions:

Is it possible with this tool?
Would training it from scratch be very beneficial? (it's a step I'd like to skip)
How would I apply the model to a set of sequences? I do not find any guidance in the README file

Thank you in advance

Finetuning problem

When I use my own protein sequence data set for fine-tuning ， since my protein sequence has no GO annotations, what is the input on the former input annotations

What to do with the local_representations and global_representations

Hello everyone,

After using the model I have two array that are the local_representations and global_representations

# After parsing the sequences from the FASTA file into 'seqs' and choosing 'seq_len' (e.g. 512) and 'batch_size' (e.g. 32)

from proteinbert import load_pretrained_model
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

pretrained_model_generator, input_encoder = load_pretrained_model()
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
X = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations= model.predict(X, batch_size = batch_size)

But now I don't know what to do to have the GO annotations of my sequences ?

all the best

cannot load pkl when loading the pre-trained protein_bert model

Hi nadavbra,

I'm running into the same issue as WangShixianChina (commented on Sep 29). When I try to load the pret-rained model in my protein_bert.py script:

from tensorflow import keras
from proteinbert import OutputType, OutputSpec, FinetuningModelGenerator, load_pretrained_model, finetune, evaluate_by_len
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

pretrained_model_generator, input_encoder = load_pretrained_model()

I get an error message stating: pickle.UnpicklingError: pickle data was truncated.

I need help unloading/unpickling the epoch_92400_sample_23500000.pkl file. I tried to install the exact or closest Python packages required by ProteinBERT. I also show the stack trace when I try to load the pretrained model (see below).

$ pip install tensorflow==2.4.0
Collecting tensorflow==2.4.0
Successfully installed gast-0.3.3 grpcio-1.32.0 numpy-1.19.5 tensorflow-2.4.0 tensorflow-estimator-2.4.0

$ pip install tensorflow_addons==0.12.1
Successfully installed tensorflow_addons==0.12.1

$ pip install numpy==1.20.1
ERROR: Could not find a version that satisfies the requirement numpy==1.20.1 (from versions: 1.3.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.10.0.post2, 1.10.1, 1.10.2, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.12.0, 1.12.1, 1.13.0rc1, 1.13.0rc2, 1.13.0, 1.13.1, 1.13.3, 1.14.0rc1, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0rc1, 1.15.0rc2, 1.15.0, 1.15.1, 1.15.2, 1.15.3, 1.15.4, 1.16.0rc1, 1.16.0rc2, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.16.4, 1.16.5, 1.16.6, 1.17.0rc1, 1.17.0rc2, 1.17.0, 1.17.1, 1.17.2, 1.17.3, 1.17.4, 1.17.5, 1.18.0rc1, 1.18.0, 1.18.1, 1.18.2, 1.18.3, 1.18.4, 1.18.5, 1.19.0rc1, 1.19.0rc2, 1.19.0, 1.19.1, 1.19.2, 1.19.3, 1.19.4, 1.19.5)
ERROR: No matching distribution found for numpy==1.20.1
$ pip install numpy==1.19.5
Requirement already satisfied: numpy==1.19.5

$ pip install pandas==1.2.3
ERROR: Could not find a version that satisfies the requirement pandas==1.2.3 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5)
ERROR: No matching distribution found for pandas==1.2.3
$ pip install pandas==1.1.5
Successfully installed pandas-1.1.5

$ pip install h5py==3.2.1
ERROR: Could not find a version that satisfies the requirement h5py==3.2.1 (from versions: 2.2.1, 2.3.0b1, 2.3.0, 2.3.1, 2.4.0b1, 2.4.0, 2.5.0, 2.6.0, 2.7.0rc2, 2.7.0, 2.7.1, 2.8.0rc1, 2.8.0, 2.9.0rc1, 2.9.0, 2.10.0, 3.0.0rc1, 3.0.0, 3.1.0)
ERROR: No matching distribution found for h5py==3.2.1
$ pip install h5py==3.1.0
Successfully installed cached-property-1.5.2 h5py-3.1.0

$ pip install lxml==4.3.2
Successfully installed lxml-4.3.2

$ pip install pyfaidx==0.5.8
Successfully installed pyfaidx-0.5.8

$ python protein_bert.py
2021-12-02 08:33:15.585243: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-02 08:33:15.585275: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "protein_bert.py", line 40, in
pretrained_model_generator, input_encoder = load_pretrained_model()
File "/home/williamsawran/anaconda3/envs/protein_bert/lib/python3.6/site-packages/proteinbert/existing_model_loading.py", line 53, in load_pretrained_model
other_optimizer_kwargs = other_optimizer_kwargs, annots_loss_weight = annots_loss_weight, load_optimizer_weights = load_optimizer_weights)
File "/home/williamsawran/anaconda3/envs/protein_bert/lib/python3.6/site-packages/proteinbert/model_generation.py", line 159, in load_pretrained_model_from_dump
n_annotations, model_weights, optimizer_weights = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated

Missing validation dataset for some benchmark tasks

Hello,

I downloaded the benchmark dataset but noticed that for some tasks such as SCOP the validation csv file is missing. Can I get access to the files please?

Thank you!

Minibatch sequence length

Hi,

First, thank you very much for making the code and the pretrained weight available and so user-friendly.

Wanted to ask about the minibatch data preparation -
In the paper (section 2.2) you write: "When encoding a protein longer than the chosen sequence length, we selected a random subsequence of the protein, leaving out at least one of its two ends."
However in the code you filter samples that exceed the current sequence length, but you still add the START and END tokens.
Is this dismissed on purpose?

Thanks!

Downloading benchmark data

Hi,
I tried to download the protein benchmark data via ftp command
wget ftp://ftp.cs.huji.ac.il/users/nadavb/protein_bert/protein_benchmarks

but get the error
No such file ‘protein_benchmarks’.

Can someone please help on the right command line?

Thank you!

Obtain dataset for pretraining ProteinBERT

Hi, I would like to pretrain the ProteinBERT model with a smaller dataset, such as human proteins in UniRef90. I have searched in between UniProtKB and UniRef90 website but could not find out how the obtain the .xml.gz file containing GO annotations similar to your uniref90.xml.gz file.
Could you explain how to get that kind of input file?

Thank you!

Finetuing

Need your help, I want to know how to add a layer in the fine tuning?

the benchmarks dataset FTP seems to be invalid, Could you deal with it

pretraining without annotation

Hello and thank you for this amazing github repository protein_bert!

Im thinking about pretraining the protein_bert model from scratch on very short sequences (7 to 20 acids, cdrs from tcr for example).
I have a large fasta-file containing millions of those very short sequences. But because these sequences are very variable and there is a vast amount of sequences, I have no annotation to the sequences.

When I try to run following command to create the h5 file which is used for pretraining:
nohup create_uniref_h5_dataset --protein-fasta-file=fasta_file.fasta --output-h5-dataset-file=./dataset.h5 --min-records-to-keep-annotation=100 >&! ./log_create_uniref_h5_dataset.txt
I get the error message:
create_uniref_h5_dataset: error: the following arguments are required: --protein-annotations-sqlite-db-file, --go-annota tions-meta-csv-file

Should I provide an empty db-file and an empty csv-file for create_uniref_h5_dataset or should I directly create a h5-file from the fasta which will be used as input for pretrain_proteinbert? What would I have to consider when creating an h5-file with the sequences?
Or could you think of a way how I could pretrain protein-bert on the described fasta file without annotation?
Is it even possible to pretrain protein_bert on only sequences without annotation?

Any help is highly appreciated!

can not load pkl

I got an error loading your trained model: pickle data was truncated.I don't know how to solve.

UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.

In model_generation.py:

model.compile(optimizer = self.optimizer_class(lr = self.lr,...

should be changed to:

model.compile(optimizer = self.optimizer_class(learning_rate = self.lr,...

To avoid the following deprecation warnings:
UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super(Adam, self).__init__(name, **kwargs)

Getting benchmarks

resolved

'Adam' object has no attribute 'get_weights'

I ran the finetuning on Google Colab and it returned this error message

AttributeError                            Traceback (most recent call last)
[<ipython-input-12-5e6f22672ea3>](https://localhost:8080/#) in <module>
----> 1 finetune(model_generator, input_encoder, OUTPUT_SPEC, train_set['seq'], train_set['label'], valid_set['seq'], valid_set['label'], \
      2         seq_len = 512, batch_size = 32, max_epochs_per_stage = 5, lr = 1e-04, begin_with_frozen_pretrained_layers = True, \
      3         lr_with_frozen_pretrained_layers = 1e-02, n_final_epochs = 1, final_seq_len = 1024, final_lr = 1e-05, callbacks = training_callbacks)

2 frames
[/usr/local/lib/python3.9/dist-packages/proteinbert/model_generation.py](https://localhost:8080/#) in update_state(self, model)
     33     def update_state(self, model):
     34         self.model_weights = copy_weights(model.get_weights())
---> 35         self.optimizer_weights = copy_weights(model.optimizer.get_weights())
     36 
     37     def _init_weights(self, model):

AttributeError: 'Adam' object has no attribute 'get_weights'

I tried to change the optimizer by model_generator.compile but it didn't work out.
Could you suggest to me how can I fix this error, please?

Question on stability.valid.csv

Dear all,

I noticed the following:

wc -l stability.*csv
12852 stability.test.csv
53615 stability.train.csv
2513 stability.valid.csv
68980 total

Now, when I look at stability.valid.csv in particular, all sequences end with "ESSGGS".

Is there any particular reason for this ?

Likewise, the number of records appears quite low in stability.valid.csv

Best wishes,

Christoph

final epoches with different length in finetuning the model

Dear authors,

Could you please explain for me why we need to train the model with extra epoches with different length (final_seq_len)? Is that to make the finetuned model better reassemble your pre-trained conditions, or you found that it results in better performance?

https://github.com/nadavbra/protein_bert/blob/2090b78f09e70f0b960d7b591bf80d3d411c9d3f/proteinbert/finetuning.py#L59-L64
if n_final_epochs > 0: log('Training on final epochs of sequence length %d...' % final_seq_len) final_batch_size = max(int(batch_size / (final_seq_len / seq_len)), 1) encoded_train_set, encoded_valid_set = encode_train_and_valid_sets(train_seqs, train_raw_Y, valid_seqs, valid_raw_Y, input_encoder, output_spec, final_seq_len) model_generator.train(encoded_train_set, encoded_valid_set, final_seq_len, final_batch_size, n_final_epochs, lr = final_lr, callbacks = callbacks, \ freeze_pretrained_layers = False)

And how the final finetuned model should be created?
model_generator.create_model(seq_len) or model_generator.create_model(final_seq_len)

Thank you for your help.
Ai

how can i download this pkl

'ftp://ftp.cs.huji.ac.il/users/nadavb/protein_bert/epoch_92400_sample_23500000.pkl'

I can not access it this path, can you help me?

GO annotations are unavailable

Hello,

Thank you for the great publication and for sharing the code.
I'd like to pretrain the model - but cannot find the GO metadata from CAFA -- the link https://www.biofunctionprediction.org/cafa-targets/cafa4ontologies.zip is broken.
Is there an alternative location of which the GO data can be downloaded from?

Thanks!

release date of uniref90

Could you please mention which release date of uniref90 you used for pertaining?

Finetuning question

I want to use your method to fine-tune the problem I have done,but I cannot download
ftp.cs.huji.ac.il/users/nadavb/protein_bert/protein_benchmarks
please, what shoulde I do ?

Finetuning question

What is final_seq_len?

How to extract protein embeddings/representations learned by the model?

Hi, Could you please tell how can I extract the embeddings learned by the model, similar to FAIR's ESM model? How do I parse a .fasta file to the model and extract the representations?

Thanks

Pretraining from scratch - Error while creating the h5 dataset file?

Hi,

I'm trying to pretrain from scratch with UniRef50.
Until running the script "create_uniref_h5_dataset" (step 1 (5)) it seems like everything's going fine.
After running this script, the output log seems like it did some work (took 5.5 hours) except that the 2 last lines are -
'Finished. Failed finding the sequence for 51333317 of 51333317 records.'
'Done'.
And - the dataset.h5 file weight 290Kb. Is this really an error? the size is too small for this file?

Thank you for your help,
Alon

Inputs types while pretraining alone

Hi @nadavbra,

Im trying to pretrain the model on antibodies dataset (without annotations). Does the next screenshot is a good example of inputs to the model? in the sense of types (for example - does seqs should be an array of strings? or maybe array of arrays?)

Thank you very much for your help
Alon

Cannot convert a symbolic Tensor

Dear Developer,

tried to run Fine-tune the model for the signal peptide benchmark in demo but got the following error, could you please help? Thanks!

File "C:\Users\miniconda3\envs\tensor25\lib\site-packages\tensorflow\python\framework\ops.py", line 870, in array
" a NumPy call, which is not supported".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (seq-merge1-norm-block1/mul_1:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

FTP link doesn't work

Greetings.

The ftp link does not work. Could you please upload the trained models somewhere else?

WARNING:tensorflow:`evaluate()` received a value for `sample_weight`, but `weighted_metrics` were not provided

Hi,

I'm getting the following warning when when running the "Run all benchmarks" section of the demo notebook with some of my own data. Is this something to worry about? The fine-tuning finishes and yields results for training/test/validation. Could it be related to using CPUs not GPUs?

WARNING:tensorflow:`evaluate()` received a value for `sample_weight`, but `weighted_metrics` were not provided. Did you mean to pass metrics to `weighted_metrics` in `compile() ? If this is intentional you can pass `weighted_metrics=[]` to `compile()` in order to silence this warning.

Also, while at it, would you recommend to experiment with any of the settings in that block (lr, dropout, batch size, etc)? I wasn't sure if any parameters are being varied since a validation set is present or if changing something manually might be of benefit.

Thank you!

Question: Load finetuned weights and create model for prediction

Hi,
In continue to the closed issue #47, I saved the finetuned weights (in Colab notebook) for stability prediction as below:

with open(f'{fine_tuned_model_dir}/{fine_tuned_model_fname}', 'wb') as f:
    pickle.dump(model_generator.model_weights, f)

I wanted to verify my use when I load the saved weights:

SEQ_LEN = 27
DUMMY_SEQS = ['DLIPTSSKLVVVDTSLQVKKAFFALVT']

BATCH_SIZE = 1

MODEL_DIR = 'D:\\python_projects\\genetic_algo\\ProteinBert_Weights'

fine_tuned_model_dir = MODEL_DIR
fine_tuned_model_fname = 'fine_tuned_model_FULL_TRAIN.pkl'

local_model_dump_dir = MODEL_DIR
local_model_dump_file_name = 'epoch_92400_sample_23500000_10_03.pkl'

# unpickling the weights
with open(f'{fine_tuned_model_dir}\\{fine_tuned_model_fname}', 'rb') as f:
    saved_model_weights = pickle.load(f)

saved_pretrained_model_generator, saved_input_encoder = load_pretrained_model(
    local_model_dump_dir=local_model_dump_dir,
    local_model_dump_file_name=local_model_dump_file_name)

stability_type = OutputType(False, 'numeric')
stability_spec = OutputSpec(output_type=stability_type,
                            unique_labels=None)

saved_model_generator = FinetuningModelGenerator(saved_pretrained_model_generator,
                                                 stability_spec,
                                                 pretraining_model_manipulation_function=get_model_with_hidden_layers_as_outputs,
                                                 model_weights=saved_model_weights)

model = saved_model_generator.create_model(SEQ_LEN+2)

X = saved_input_encoder.encode_X(DUMMY_SEQS, SEQ_LEN)
Y = model.predict(X, batch_size=BATCH_SIZE)

Is it correct?

Best Regards,
from Aviv

Global feature extraction

hi，Is the global feature extracted by PROTEINBERT model corresponding to the original label？

Using ProteinBERT

!wget ftp://ftp.cs.huji.ac.il/users/nadavb/protein_bert/epoch_92400_sample_23500000.pkl
model = torch.load("/content/epoch_92400_sample_23500000.pkl")

ERRORS

RuntimeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model = torch.load("/content/epoch_92400_sample_23500000.pkl")

1 frames
/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
1033 magic_number = pickle_module.load(f, **pickle_load_args)
1034 if magic_number != MAGIC_NUMBER:
-> 1035 raise RuntimeError("Invalid magic number; corrupt file?")
1036 protocol_version = pickle_module.load(f, **pickle_load_args)
1037 if protocol_version != PROTOCOL_VERSION:

RuntimeError: Invalid magic number; corrupt file?

Missing files/directories for running

In the notebook "ProteinBERT - final paper analyses.ipynb"https://github.com/nadavbra/protein_bert/blob/bc2c881f44073a9d105d719f20229e70ab975ec6/bin/ProteinBERT%20-%20final%20paper%20analyses.ipynb

there are command which refer to local directories:


'
`/cs/phd/nadavb/logs/slurm_jobs/finetuning_from_different_pretraining_snapshots/`
'/cs/phd/nadavb/proteinbert_project/data/test_set_performance.csv'
import h5py
MODEL_DUMP_DIRS = ['/cs/labs/michall/nadavb/proteinbert_models/original_dumps', '/cs/labs/michall/nadavb/proteinbert_models/continue_original']
H5_DATASET_FILE_PATH = '/cs/phd/nadavb/proteinbert_project/data/original_dataset.h5'
```H5_DATASET_FILE_PATH = '/cs/phd/nadavb/proteinbert_project/data/original_dataset.h5'



It is possible to reference them through a link like gdrive etc?

nadavbra / protein_bert Goto Github PK

protein_bert's People

Contributors

Stargazers

Watchers

Forkers

protein_bert's Issues

I need help unloading/unpickling the epoch_92400_sample_23500000.pkl file. I tried to install the exact or closest Python packages required by ProteinBERT. I also show the stack trace when I try to load the pretrained model (see below).

ERRORS

Recommend Projects

Recommend Topics

Recommend Org