Hi, Could you please tell how can I extract the embeddings learned by the model, simil

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello. Is this for embedding free from training? I ran it on a machine with a G

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Comments (49)

nadavbra commented on July 24, 2024 1

@xinformatics Ok I realized what your problem is: you provided some sequences of lengths that exceed the seq_len you provided (512). When you encode sequences, you need to use a seq_len that is at least (L + 2), where L is the longest sequence you provide.

from protein_bert.

nadavbra commented on July 24, 2024 1

This representation is a concatenation of the representations from all layers of the model. If you want a smaller dimension, you can take the representations of a specific layer in the model (e.g. the output layer, which is a prediction of GO annotations).

from protein_bert.

ddofer commented on July 24, 2024 1

Another popular approach is to add a tf.keras.layers.GlobalAveragePooling1D layer on the output

from protein_bert.

nadavbra commented on July 24, 2024 1

@Sherman-1 It's difficult to advise you based on the limited information you've shared (I don't know what you're planning to use the embeddings for). Generally speaking, it usually makes more sense to allow different residues have different embeddings depending on their specific contexts, even if they are the same amino acid.

from protein_bert.

nadavbra commented on July 24, 2024

You have to use ProteinBert's Python interface.
You can parse the FASTA file with Biopython, and use ProteinBert's pretrained model to get the embeddings of each sequence (similar to how it is finetuned in our demo):

# After parsing the sequences from the FASTA file into 'seqs' and choosing 'seq_len' (e.g. 512) and 'batch_size' (e.g. 32)

from proteinbert import load_pretrained_model
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

pretrained_model_generator, input_encoder = load_pretrained_model()
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
X = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations= model.predict(X, batch_size = batch_size)

from protein_bert.

dongli96 commented on July 24, 2024

Hello. Is this script for embedding free from training? I ran it on a machine with a GPU of 16GB's memory while it still reported "out-of-memory". What I want is just loading the existing model weights and performing some rounds of matrix mutiplication to get the embedding results.

from protein_bert.

nadavbra commented on July 24, 2024

Strange, I'm pretty sure it should work with 16GB GPU memory (that's what we used). Are you sure your GPU isn't used by another process?
Maybe try a smaller batch_size...

from protein_bert.

dongli96 commented on July 24, 2024

Solved. I misunderstood the problem which came from RAM rather than GPU memory. Anyway thanks for the reply.

from protein_bert.

xinformatics commented on July 24, 2024

Hi @nadavbra , I tried the code with a protein sequence of length 431 Amino Acids, I set the 'seq_len' as 512 as suggest. I am getting local_representations as an numpy ndarray of shape (431, 512, 1562) and global_representations as numpy ndarray of shape (431, 15599). Could you put some light on the dimensions I am getting. the first dimension value 431 is the protein length but I am unable to understand the other dimensions of the output. Also, could you tell more on the difference between local and global representations?

Thank you so much.

from protein_bert.

nadavbra commented on July 24, 2024

@xinformatics Did you give it a single protein sequence? The code I gave you expects 'seqs' to be a list of sequences (i.e. a list of strings). Like everything in deep learning, it is a vectorized process. If you really want to process just a single sequence, then you still need to wrap it inside a list (of size 1).
The global representations (of shape (n, d_global) where n is the number of sequences) provide information about the protein as a whole, whereas the local representations (of shape (n, l, d_local) where l is 512 in your case) provide information about each of the amino acids in each protein. You can read more in our paper.

from protein_bert.

xinformatics commented on July 24, 2024

@nadavbra, my fasta file contains 964 sequences. when i followed the steps you suggested, I get the following error

at the step X = input_encoder.encode_X(seqs, 512)

`TypeError Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in ()
----> 1 X = input_encoder.encode_X(seqs, 512)

1 frames
/usr/local/lib/python3.7/dist-packages/proteinbert/model_generation.py in encode_X(self, seqs, seq_len)
149 def encode_X(self, seqs, seq_len):
150 return [
--> 151 tokenize_seqs(seqs, seq_len),
152 np.zeros((len(seqs), self.n_annotations), dtype = np.int8)
153 ]

/usr/local/lib/python3.7/dist-packages/proteinbert/model_generation.py in tokenize_seqs(seqs, seq_len)
170 def tokenize_seqs(seqs, seq_len):
171 # Note that tokenize_seq already adds and tokens.
--> 172 return np.array([seq_tokens + (seq_len - len(seq_tokens)) * [additional_token_to_index['']] for seq_tokens in map(tokenize_seq, seqs)], dtype = np.int32)
173
174 def clear_session():

ValueError: setting an array element with a sequence.`

from protein_bert.

nadavbra commented on July 24, 2024

@xinformatics Notice that 'seqs' needs to be a list of strings and nothing else (e.g. you need to convert Biopython's Seq objects into strings).

from protein_bert.

xinformatics commented on July 24, 2024

@nadavbra ,Yes I ensured that. They are not Biopython Seq objects, I converted them to strings before appending them to the list.

from protein_bert.

nadavbra commented on July 24, 2024

@xinformatics This is strange...
Could you provide the full code you are running? (And if you are using data files such as FASTA, could you provide them as well?)

from protein_bert.

xinformatics commented on July 24, 2024

@nadavbra, The Colab notebook is available at Colab Notebook

and the fasta file is available at Fasta File

from protein_bert.

xinformatics commented on July 24, 2024

Hi @LDCS96 , could you please share how you were able to run the model on a fasta and extract embeddings?

from protein_bert.

xinformatics commented on July 24, 2024

Thank you for your help, it worked finally. :)

from protein_bert.

xinformatics commented on July 24, 2024

Hi @nadavbra , could you share why the global representations have the dimensions of (num_seqs, 15559). Usually, FAIR's ESM model outputs a dimension of 1280 and Rostlab's ProtBert model output a dimension of 1024.

from protein_bert.

xinformatics commented on July 24, 2024

Thank you @nadavbra and @ddofer

from protein_bert.

dongli96 commented on July 24, 2024

Can I ask why 0s that are padded to the end of each sequence are embedded as different 1562-dimensional vectors? Shouldn't these 0s be meaningless symbols and be ignored or at least be embedded as the same vector?

from protein_bert.

nadavbra commented on July 24, 2024

@LDCS96 The embeddings also depend on the context of each token, and not just on the token itself (otherwise they wouldn't be very useful...). Anyway, I suppose that in most cases you'd want to ignore the embeddings at the protein's padding.

from protein_bert.

Nadam0707 commented on July 24, 2024

Excellent work, Nadav and colleagues, and a very productive discussion. I want to be sure I am not missing something here. Part of what I want to do is use the ProteinBert as a service, i.e., to get the embedding of a set of input proteins and use the embedding as input to our downstream block of CNN, as a replacement for the one-hot-encoding.
I followed the instructions above and below is my code, where I am using a protein dataset stored in JASON format. Below is the code I implemented

seq_len=1000
batch_size=256

fpath='/content/drive/kiba/'

proteins = json.load(open(fpath+"proteins.txt"), object_pairs_hook=OrderedDict)
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
X = input_encoder.encode_X(proteins, seq_len)
local_representations, global_representations= model.predict(X, batch_size)

######################## Output below ############

local_representations (229, 1000, 1562)

############# global_representations (229, 15599)
Is it correct that the model's embedding, which I will use as input to the block of CNN as a replacement of the embedding layer is given by the global_representation?
Also, I tried the following but it gave no error message, the run just stopped?
It seems to be that the code below should work properly. BTW, I tried seq_len+2 but that did not help?
Thanks for your time.

seqs=[]
for t in proteins.keys():
seqs.append(str(proteins[t]))

X = input_encoder.encode_X(seqs, seq_len)

from protein_bert.

nadavbra commented on July 24, 2024

@Nadam0707
If you are using CNN, then it would make sense to feed it with the local representations, not the global ones (which would be a better fit for dense layers).
What error are you getting? I'll need to see a full tack trace to be able to help.

from protein_bert.

Nadam0707 commented on July 24, 2024

Here is the code I ran
seq_len=1000
batch_size=256
proteins = json.load(open(fpath+"proteins.txt"), object_pairs_hook=OrderedDict)
print("#### proteins ####","\n",proteins,"\n" )
seqs=[]
for t in proteins.keys():
seqs.append(str(proteins[t]))
print("#### len(seqs),seqs[0] ####","\n",len(seqs),"\n", seqs[0],"\n" )
X = input_encoder.encode_X(seqs, seq_len+2)
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
local_representations, global_representations= model.predict(X, batch_size)

#local_representations, global_representations= model.predict(X, batch_size)
#local_representations, global_representations= model.predict(XT, batch_size)
print("### local_representations",local_representations.shape,"#####","\n", local_representations,"\n")
print("############# global_representations", global_representations.shape, "\n")
#print("#################### X","\n", X, "###########################", "\n")
print("###################################################################################################")

the output and the error are below? thanks

proteins

OrderedDict([('O00141', 'MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNSYACKHPEVQSILKISQPQEPELMNANPSPPPSPSQQINLGPSSNPHAKPSDFHFLKVIGKGSFGKVLLARHKAEEVFYAVKVLQKKAILKKKEEKHIMSERNVLLKNVKHPFLVGLHFSFQTADKLYFVLDYINGGELFYHLQRERCFLEPRARFYAAEIASALGYLHSLNIVYRDLKPENILLDSQGHIVLTDFGLCKENIEHNSTTSTFCGTPEYLAPEVLHKQPYDRTVDWWCLGAVLYEMLYGLPPFYSRNTAEMYDNILNKPLQLKPNITNSARHLLEGLLQKDRTKRLGAKDDFMEIKSHVFFSLINWDDLINKKITPPFNPNVSGPNDLRHFDPEFTEEPVPNSIGKSPDSVLVTASVKEAAEAFLGFSYAPPTDSFL'), ('O00311', .....])

len(seqs),seqs[0]

229
MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNSYACKHPEVQSILKISQPQEPELMNANPSPPPSPSQQINLGPSSNPHAKPSDFHFLKVIGKGSFGKVLLARHKAEEVFYAVKVLQKKAILKKKEEKHIMSERNVLLKNVKHPFLVGLHFSFQTADKLYFVLDYINGGELFYHLQRERCFLEPRARFYAAEIASALGYLHSLNIVYRDLKPENILLDSQGHIVLTDFGLCKENIEHNSTTSTFCGTPEYLAPEVLHKQPYDRTVDWWCLGAVLYEMLYGLPPFYSRNTAEMYDNILNKPLQLKPNITNSARHLLEGLLQKDRTKRLGAKDDFMEIKSHVFFSLINWDDLINKKITPPFNPNVSGPNDLRHFDPEFTEEPVPNSIGKSPDSVLVTASVKEAAEAFLGFSYAPPTDSFL

TypeError Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in ()
35 seqs.append(str(proteins[t]))
36 print("#### len(seqs),seqs[0] ####","\n",len(seqs),"\n", seqs[0],"\n" )
---> 37 X = input_encoder.encode_X(seqs, seq_len+2)
38 model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
39 local_representations, global_representations= model.predict(X, batch_size)

1 frames
/usr/local/lib/python3.7/dist-packages/proteinbert/model_generation.py in tokenize_seqs(seqs, seq_len)
170 def tokenize_seqs(seqs, seq_len):
171 # Note that tokenize_seq already adds and tokens.
--> 172 return np.array([seq_tokens + (seq_len - len(seq_tokens)) * [additional_token_to_index['']] for seq_tokens in map(tokenize_seq, seqs)], dtype = np.int32)
173
174 def clear_session():

ValueError: setting an array element with a sequence.

from protein_bert.

nadavbra commented on July 24, 2024

@Nadam0707
Maybe one of your sequences is longer than 256 amino-acid letters?

from protein_bert.

Nadam0707 commented on July 24, 2024

No. It works fine with the one-hot encoding and the label methods.
Also, it works and I get results if I have the following code. But notice, in the code below, I am passing proteins (the key and the values, not just the values, as per your instructions)?

pretrained_model_generator, input_encoder = load_pretrained_model("/content/drive/My Drive/protein_bert-master/",'epoch_92400_sample_23500000.pkl')
seq_len=1000
batch_size=128
proteins = json.load(open(fpath+"proteins.txt"), object_pairs_hook=OrderedDict)
#print("#### proteins ####","\n",proteins,"\n" )
print("### proteins.values(0)", proteins.values(),"\n")
#seqs=[]

#for t in proteins.keys():
#seqs.append(str(proteins[t]))

#seqs=str(proteins.values())
#print("#### len(seqs),seqs[0] ####","\n",len(seqs),"\n", seqs[0],"\n" )
#X = input_encoder.encode_X(seqs, seq_len+2)

X = input_encoder.encode_X(proteins, seq_len+2)
print("#### len(X),X[0] ####","\n",len(X),"\n", X[0],"\n" )

len(X),X[0]

2
[[23 22 22 ... 25 25 25]
[23 22 22 ... 25 25 25]
[23 22 22 ... 25 25 25]
...
[23 13 22 ... 25 25 25]
[23 13 22 ... 25 25 25]
[23 13 22 ... 25 25 25]]

/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adam.py:105: UserWarning: The lr argument is deprecated, use learning_rate instead.
super(Adam, self).init(name, **kwargs)

local_representations (229, 1002, 1562)

[[[ 2.07034871e-02 1.17638186e-02 -1.00728534e-01 ... 1.00000000e+00
6.24292718e-12 6.53283466e-18]
[ 3.29996608e-02 8.29180470e-04 -7.95217976e-02 ... 6.07004985e-02
3.97656159e-03 1.28602295e-09]
[ 6.28490970e-02 6.93527237e-03 -7.52108544e-02 ... 7.58444937e-03
1.67014319e-02 1.18731724e-09]
...

And when I use the local representation as input to the CNN block, I receive the following error?
ValueError: Found unexpected instance while processing input tensors for keras functional model. Expecting KerasTensor which is from tf.keras.Input() or output from keras layer call(). Got: [[[ 2.07034871e-02 1.17638186e-02 -1.00728534e-01 ... 1.00000000e+00
6.29207103e-12 6.57054872e-18]

from protein_bert.

Nadam0707 commented on July 24, 2024

I duplicated your code, using the same protein file you used (964_seqs.festa) and encountered the same error using my protein data (ValueError: setting an array element with a sequence.) see below?

pretrained_model_generator, input_encoder = load_pretrained_model("/content/drive/My Drive/protein_bert-master/",'epoch_92400_sample_23500000.pkl')
seq_len=1000
batch_size=128

seqs = []
id = []

for record in SeqIO.parse("/content/drive/My Drive/DeepDTA/DeepDTA-master/source/data/kiba/964_seqs.fasta", "fasta"):
#print(record.seq)
id.append(record.id)
seqs.append(str(record.seq))

print("#### len(seqs),seqs[0] ####","\n",len(seqs),"\n", seqs[0],"\n" )
X = input_encoder.encode_X(seqs, seq_len+2)
print("#### len(X),X[0] ####","\n",len(X),"\n", X[0],"\n" )

model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len+2))
local_representations, global_representations= model.predict(X, batch_size)
print("### local_representations",local_representations.shape,"#####","\n", local_representations,"\n"
, "########## global_representations", global_representations.shape, "\n")
#print("#################### X","\n", X, "###########################", "\n")
print("###################################################################################################")
---------------------------------------- output ------------------------------------

len(seqs),seqs[0]

964
MKSVHSSPQNTSHTIMTFYPTMEEFADFNTYVAYMESQGAHQAGLAKVIPPKEWKARQMYDDIEDILIATPLQQVTSGQGGVFTQYHKKKKAMRVGQYRRLANSKKYQTPPHQNFADLEQRYWKSHPGNPPIYGADISGSLFEESTKQWNLGHLGTILDLLEQECGVVIEGVNTPYLYFGMWKTTFAWHTEDMDLYSINYLHFGEPKTWYVVPPEHGQHLERLARELFPDISRGCEAFLRHKVALISPTVLKENGIPFNCMTQEAGEFMVTFPYGYHAGFNHGFNCAEAINFATPRWIDYGKMASQCSCGESTVTFSMDPFVRIVQPESYELWKHRQDLAIVEHTEPRVAESQELSNWRDDIVLRRAALGLRLLPNLTAQCPTQPVSSGHCYNPKGCGTDAVPGSAFQSSAYHTQTQSLTLGMSARVLLPSTGSWGSGRGRGRGQGQGRGCSRGRGHGCCTRELGTEEPTVQPASKRRLLMGTRSRAQGHRPQLPLANDLMTNLSL

TypeError Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in ()
14
15 print("#### len(seqs),seqs[0] ####","\n",len(seqs),"\n", seqs[0],"\n" )
---> 16 X = input_encoder.encode_X(seqs, seq_len+2)
17 print("#### len(X),X[0] ####","\n",len(X),"\n", X[0],"\n" )
18

ValueError: setting an array element with a sequence.

from protein_bert.

nadavbra commented on July 24, 2024

@Nadam0707
The embeddings you get from ProteinBERT seem to be in order.
If you are having trouble feeding it into a CNN, I think you will get better answers at a general Keras/Tensorflow forum.

from protein_bert.

Nadam0707 commented on July 24, 2024

The above code, which you provided and I duplicated using the same protein dataset as input resulted in an error as you see above!! It does not result in an embedding!
The embedding I get is the result of having as input the label and the values of a protein record as input... That should not be the case. The input should be only the values as a string?
Is it possible to share your code that works with no errors? It cannot be more than 5-6 statements anyway. Whatever version in this post that you provided above had some error?

from protein_bert.

nadavbra commented on July 24, 2024

@Nadam0707
I'm a bit confused. As you indicated above you appear to have been able to get a proper local_representations output as a numpy array of shape (229, 1002, 1562). Why don't you consider it a successful use of ProteinBERT?

from protein_bert.

Nadam0707 commented on July 24, 2024

I included above the code that resulted in an array of shape (228, 102, 1562). But, this result is obtained by having the entire JASON record (label and values) of each protein as input X = input_encoder.encode_X(proteins, seq_len+2).
However, your code and instructions below generate an error as I showed above. Is there an explanation for that?
Also, what makes me suspicious of the result with shape (228, 102, 1562), is that when having this array as input to the CNN block instead of the one-hot enclosing, I encounter an error.
The real success in my opinion is to demonstrate that using the output of your model as input to CNN results in better predictions compared with that of one-hot-encoding and this is what I am trying to show!
"ValueError: Found unexpected instance while processing input tensors for keras functional model. Expecting KerasTensor which is from tf.keras.Input() or output from keras layer call(). Got: [[[ 2.07034871e-02 1.17638186e-02 -1.00728534e-01 ... 1.00000000e+00
6.29207103e-12 6.57054872e-18]"

for record in SeqIO.parse("/content/drive/My Drive/964_seqs.fasta", "fast"):
id.append(record.id)
seqs.append(str(record.seq))
X = input_encoder.encode_X(seqs, seq_len+2)

from protein_bert.

nadavbra commented on July 24, 2024

@Nadam0707
(228, 102, 1562) should be a valid input shape for a CNN.
About your code that fails, what's the longest sequence among the 964 sequences?

from protein_bert.

Nadam0707 commented on July 24, 2024

I just used the same code and the same proteins you have in your earlier post Here it is

for record in SeqIO.parse("/content/drive/My Drive/964_seqs.fasta", "fast"):
id.append(record.id)
seqs.append(str(record.seq))
X = input_encoder.encode_X(seqs, seq_len+2)

--- error
TypeError Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

Teh above exception was teh direct cause of teh following exception:

1 frames
/usr/local/lib/python3.7/dist-packages/proteinbert/model_generation.py in tokenize_seqs(seqs, seq_len)
170 def tokenize_seqs(seqs, seq_len):
171 # Note dat tokenize_seq already adds and tokens.
--> 172 return np.array([seq_tokens + (seq_len - len(seq_tokens)) * [additional_token_to_index['']] for seq_tokens in map(tokenize_seq, seqs)], dtype = np.int32)
173
174 def clear_session():

ValueError: setting an array element wif a sequence.

from protein_bert.

nadavbra commented on July 24, 2024

@Nadam0707 So what's the length of the longest sequence?

from protein_bert.

Nadam0707 commented on July 24, 2024

Please have a look at your message of July 25, 2021, where you provided the following code, which I duplicated and used the SAME FASTA File (964_seqs) you provided in your July 28, 2021 post.
I assume you did test it and have generated the correct results. That is, generating an embedding from input proteins and using this embedding as input to a CNN or other DL model as a replacement to one-hot including or to the embedding layer of the network?

IF so, it might be more productive if you would kindly provide a copy of a working code that has been tested and has been fully executed with no errors. The one you included here has errors. Please try it yourself https://colab.research.google.com/drive/1aTEfqp4iVecglkFfPujWWVjL8kuGuXlZ?usp=sharing#scrollTo=LqCI5uzrIWDx

Also, if there are some restrictions on the len of the proteins, it would be useful to let us know about that.

------------ Your (Nadav) July 25, 2021 message ----
“You has to use ProteinBert's Python interface.
You can parse teh FASTA file wif Biopython, and use ProteinBert's pretrained model to get the embeddings of each sequence (similar to how it is finetuned in our demo):

After parsing teh sequences from teh FASTA file into 'seqs' and choosing 'seq_len' (e.g. 512) and 'batch_size' (e.g. 32)

from proteinbert import load_pretrained_model
from proteinbert.conv_and_global_attention_model import get_model_wif_hidden_layers_as_outputs

pretrained_model_generator, input_encoder = load_pretrained_model()
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))

for record in SeqIO.parse("/964_seqs.fasta", "fasta"):
#print(record.seq)
id.append(record.id)
seqs.append(str(record.seq))

X = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations= model.predict(X, batch_size = batch_size)
mentioned.

from protein_bert.

xixinhy commented on July 24, 2024

解决了。我误解了来自 RAM 而不是 GPU 内存的问题。无论如何感谢您的回复。

I also appear the same problem, error out of memory, how do you solve？

from protein_bert.

dongli96 commented on July 24, 2024

@xixinhy Out of which memory? GPU memory or RAM? You might try with a smaller batch size.

from protein_bert.

wangze09 commented on July 24, 2024

Hi @nadavbra , could you share why the global representations have the dimensions of (num_seqs, 15559). Usually, FAIR's ESM model outputs a dimension of 1280 and Rostlab's ProtBert model output a dimension of 1024.

Hello guys @Nadam0707 @xinformatics xinformatics
I also met the problem when I tried to extract protein features from embeddings to do machine learning jobs, I just wonder which feature is better to be chosen as features of proteins; the local representations or the global representations, and I was also confused with the number 15599 and 1562, did they have something in common with the d_local:128 and d_global:512?
Sorry, I'm fresh in deep learning. Thanks for your time!

from protein_bert.

ddofer commented on July 24, 2024

I would try either or both at once (with mean or max pooling of the local representations). If you have a sequence level task then start with global, if local level (e.g. predicting PTMs per position) then use local

from protein_bert.

nadavbra commented on July 24, 2024

@wangze09 The global representation is indeed a much higher dimension (especially when concatenating all layers). The ESM dimension of 1280 is a whole different story because it is 1) per position (i.e. local, not global), and 2) it is only for the last hidden layer (not all layers combined).

from protein_bert.

wangze09 commented on July 24, 2024

@wangze09 The global representation is indeed a much higher dimension (especially when concatenating all layers). The ESM dimension of 1280 is a whole different story because it is 1) per position (i.e. local, not global), and 2) it is only for the last hidden layer (not all layers combined).

Okay, I got that. Thanks for your kind explanations, have a nice weekend!

from protein_bert.

wangze09 commented on July 24, 2024

I would try either or both at once (with mean or max pooling of the local representations). If you have a sequence level task then start with global, if local level (e.g. predicting PTMs per position) then use local

Okay, I will have a try. Thanks for your kind explanations, have a nice weekend!

from protein_bert.

xinformatics commented on July 24, 2024

Hey @wangze09, I had the same question when I started working on the embeddings. I ended up using ESM1v, and it was sufficient for my objective.

That said, protein_bert global embeddings are helpful. I didn't use it because my dataset was large and I was unable to load the (batch, 15599) dimensional matrix into memory. Hope it helps.

from protein_bert.

wangze09 commented on July 24, 2024

Hey @wangze09, I had the same question when I started working on the embeddings. I ended up using ESM1v, and it was sufficient for my objective.

That said, protein_bert global embeddings are helpful. I didn't use it because my dataset was large and I was unable to load the (batch, 15599) dimensional matrix into memory. Hope it helps.

Thanks for your advice, that really helps me. Have a nice weekend!

from protein_bert.

sjr1405 commented on July 24, 2024

Hi, Could you please tell how can I training protbert to other fasta file?

from protein_bert.

ddofer commented on July 24, 2024

@sjr1405 Look at the example notebooks. Load the protein sequences (and labels) from the fasta and another file, then train on that following the notebook examples.

from protein_bert.

mat10d commented on July 24, 2024

Just a quick question about global vs. local embeddings, given that it has been heavily touched on in this thread. I am working in a low-n setting, with say close to 100 proteins, where I am training a small top layer model on some PLM embeddings, such as the mean embeddings from esm.

Therefore, I would prefer to have a single vector embedding for each protein that I am working with. Would it be best to work with global embeddings, or some transformation of local embeddings? Do the global embeddings retain information from each layer in protein_bert?

In reading the paper, I notice that for predicting attributes of proteins, like fluorescence in the Sarkisyan dataset, you used global embeddings. But in that case, you had over 20 thousand sequences to work with. What would you suggest? I can also compare global embeddings vs some transformation of local embeddings, but I'm not sure what transformation to use in this case, should I mean average across d_local to get an embedding of 1562, or should I just take some specific layer of this 2 dimensional embedding?

from protein_bert.

nadavbra commented on July 24, 2024

Is the thing you are trying to predict a global property of the protein, or something that's different per position? If it's global I'd use the global embeddings.

from protein_bert.

Sherman-1 commented on July 24, 2024

Hi, I hope my question will fit in this discussion. I'm trying to extract, for a given set of proteins, the embeddings of every protein per-position.
My final goal is to construct a graph where each node represent an amino acid and is labelled by features for this specific amino acid.
Two options : compute generic features for each category of AA and label each node according to those physico-chemicals features, or compute per-position features for every proteins using ProteinBERT.

In this setup , maybe the hypothetical peptide ABCDA would have different features for the first and the last " A " amino acid ?

Thank you for you amazing work and dedication to answer our issues !

from protein_bert.

How to extract protein embeddings/representations learned by the model? about protein_bert HOT 49 CLOSED