<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Question - are you looking for the embeddings of each individual AA (aka - just

How to extract the embedding of an amino acid? about protein_bert HOT 10 CLOSED

thnhan commented on June 29, 2024

How to extract the embedding of an amino acid?

from protein_bert.

Comments (10)

nadavbra commented on June 29, 2024 1

I will say again that it sounds like you should use the full embeddings of ProteinBERT (local_representations and global_representations in your above code) rather than just the per-amino-acid embeddings.

If you insist on using the amino-acid embeddings, you can extract them along these lines:

embedding_layer = model.get_layer('embedding-seq-input')
aa_embeddings = embedding_layer.get_weights()[0]

aa_embeddings should be of shape (vocab_size, d_hidden_seq).

from protein_bert.

nadavbra commented on June 29, 2024

@thnhan Why do you want per-aa embeddings rather than embeddings for the entire protein sequences? ProteinBERT was not designed to embed individual amino acids.
If you really are interested in ProteinBERT amino-acid embeddings, I would try to get them from the first layer of the model (of type keras.layers.Embedding).

from protein_bert.

thnhan commented on June 29, 2024

@nadavbra Thank you for your help.
I want per-aa embeddings because my model was designed to use amino acid embeddings through the Embedding layer (a keras layer of my model).
After this Embedding layer, an entire fix-length protein sequence can be converted to embeddings and fed to MLPs. My model will need to further adjust the weights of that layer.

In practice, I also used the "global_representations" for my model but to use them I removed that layer from my model. I used the below codes, which are according to your github demo.

pretrained_model_generator, input_encoder = load_pretrained_model(
    local_model_dump_dir=r"D:\protein_bert\protein_bert",
    local_model_dump_file_name='default.pkl',
    download_model_dump_if_not_exists=False)
protbert_model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(fixlen))

encoded_x = input_encoder.encode_X(protseq, fixlen)
local_representations, global_representations = protbert_model.predict(encoded_x, batch_size=64)

Would you help me get per-aa embeddings? Or could you show me the codes to get them?
Thank you so much!

from protein_bert.

ddofer commented on June 29, 2024

Question - are you looking for the embeddings of each individual AA (aka - just extract it from the embedding layer of the model), or the transformer model's embeddings of each position ? (For that- look at the position specific classificiation examples, e.g. secondary structure).
But I agree with Nadav, naively, sounds like you may just want to use the full embeddings, the vocab/tokenizer is per AA, not multi-char, so it's small and easy to learn . (The position and contextual embeddings are where the real meat is)

from protein_bert.

ddofer commented on June 29, 2024

Nadav's solution will get you the per AA embedding matrix.

from protein_bert.

thnhan commented on June 29, 2024

I will say again that it sounds like you should use the full embeddings of ProteinBERT (local_representations and global_representations in your above code) rather than just the per-amino-acid embeddings.

If you insist on using the amino-acid embeddings, you can extract them along these lines:
embedding_layer = model.get_layer('embedding-seq-input')
aa_embeddings = embedding_layer.get_weights()[0]
aa_embeddings should be of shape (vocab_size, d_hidden_seq).

Thank you, @nadavbra and @ddofer so much,
I attempt to get the embedding of the entire protein sequence using the embeddings of amino acids in that protein sequence. I think your solution will help me a lot. My code about local_representations and global_representations is an example that I used to mine your model for my work.

I also consulted two additional threads "GO Annotation Vector #38" and "What to do with the local_representations and global_representations #6", to understand how to use "representations".
Kindly could I ask you two more questions?

Would you tell me, in order, which row does each amino acid correspond to in this matrix, (vocab_size, d_hidden_seq)? Or can I find this information in your paper?
global_representations vector that I am getting is of the size 15599, so how do I get the 8943-dimensional global_representations vector?

Thank you!

from protein_bert.

nadavbra commented on June 29, 2024

In the tokenization submodule you can use the token_to_index and index_to_token dictionaries that map between token and token indices (rows in the tokem embedding matrix).
global_representations is of dimension 15599 because you used get_model_with_hidden_layers_as_outputs. If you don't use it you'll only get features from the output layer.

from protein_bert.

thnhan commented on June 29, 2024

In the tokenization submodule you can use the token_to_index and index_to_token dictionaries that map between token and token indices (rows in the tokem embedding matrix).

global_representations is of dimension 15599 because you used get_model_with_hidden_layers_as_outputs. If you don't use it you'll only get features from the output layer.

Thank you @nadavbra,
according to your solution, I have gotten the aa_embedding of size (26,128) and global_representations of dimension 8943.

I used the command model = pretrained_model_generator.create_model(seq_len) for global_representations
In the token sequence of the protein sequence, I removed its first and last tokens because they are not for amino acids.

Is my above usage exactly as you suggest?

Thank you!

from protein_bert.

nadavbra commented on June 29, 2024

Yes, that seems about right.
I will say that I expect the higher-dimensional global_representations (with get_model_with_hidden_layers_as_outputs) to be more useful for most tasks.

from protein_bert.

thnhan commented on June 29, 2024

Yes, that seems about right. I will say that I expect the higher-dimensional global_representations (with get_model_with_hidden_layers_as_outputs) to be more useful for most tasks.

Thank you so much, I'm glad to have your help.
I will try to use both types of global_representations.

from protein_bert.

How to extract the embedding of an amino acid? about protein_bert HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent