Giter VIP home page Giter VIP logo

Comments (10)

nadavbra avatar nadavbra commented on June 29, 2024 1

I will say again that it sounds like you should use the full embeddings of ProteinBERT (local_representations and global_representations in your above code) rather than just the per-amino-acid embeddings.

If you insist on using the amino-acid embeddings, you can extract them along these lines:

embedding_layer = model.get_layer('embedding-seq-input')
aa_embeddings = embedding_layer.get_weights()[0]

aa_embeddings should be of shape (vocab_size, d_hidden_seq).

from protein_bert.

nadavbra avatar nadavbra commented on June 29, 2024

@thnhan Why do you want per-aa embeddings rather than embeddings for the entire protein sequences? ProteinBERT was not designed to embed individual amino acids.
If you really are interested in ProteinBERT amino-acid embeddings, I would try to get them from the first layer of the model (of type keras.layers.Embedding).

from protein_bert.

thnhan avatar thnhan commented on June 29, 2024

@nadavbra Thank you for your help.
I want per-aa embeddings because my model was designed to use amino acid embeddings through the Embedding layer (a keras layer of my model).
After this Embedding layer, an entire fix-length protein sequence can be converted to embeddings and fed to MLPs. My model will need to further adjust the weights of that layer.

In practice, I also used the "global_representations" for my model but to use them I removed that layer from my model. I used the below codes, which are according to your github demo.

pretrained_model_generator, input_encoder = load_pretrained_model(
    local_model_dump_dir=r"D:\protein_bert\protein_bert",
    local_model_dump_file_name='default.pkl',
    download_model_dump_if_not_exists=False)
protbert_model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(fixlen))

encoded_x = input_encoder.encode_X(protseq, fixlen)
local_representations, global_representations = protbert_model.predict(encoded_x, batch_size=64)

Would you help me get per-aa embeddings? Or could you show me the codes to get them?
Thank you so much!

from protein_bert.

ddofer avatar ddofer commented on June 29, 2024
  • Question - are you looking for the embeddings of each individual AA (aka - just extract it from the embedding layer of the model), or the transformer model's embeddings of each position ? (For that- look at the position specific classificiation examples, e.g. secondary structure).
    But I agree with Nadav, naively, sounds like you may just want to use the full embeddings, the vocab/tokenizer is per AA, not multi-char, so it's small and easy to learn . (The position and contextual embeddings are where the real meat is)

from protein_bert.

ddofer avatar ddofer commented on June 29, 2024

Nadav's solution will get you the per AA embedding matrix.

from protein_bert.

thnhan avatar thnhan commented on June 29, 2024

I will say again that it sounds like you should use the full embeddings of ProteinBERT (local_representations and global_representations in your above code) rather than just the per-amino-acid embeddings.

If you insist on using the amino-acid embeddings, you can extract them along these lines:

embedding_layer = model.get_layer('embedding-seq-input')
aa_embeddings = embedding_layer.get_weights()[0]

aa_embeddings should be of shape (vocab_size, d_hidden_seq).

Thank you, @nadavbra and @ddofer so much,
I attempt to get the embedding of the entire protein sequence using the embeddings of amino acids in that protein sequence. I think your solution will help me a lot. My code about local_representations and global_representations is an example that I used to mine your model for my work.

I also consulted two additional threads "GO Annotation Vector #38" and "What to do with the local_representations and global_representations #6", to understand how to use "representations".
Kindly could I ask you two more questions?

  1. Would you tell me, in order, which row does each amino acid correspond to in this matrix, (vocab_size, d_hidden_seq)? Or can I find this information in your paper?
  2. global_representations vector that I am getting is of the size 15599, so how do I get the 8943-dimensional global_representations vector?

Thank you!

from protein_bert.

nadavbra avatar nadavbra commented on June 29, 2024
  1. In the tokenization submodule you can use the token_to_index and index_to_token dictionaries that map between token and token indices (rows in the tokem embedding matrix).
  2. global_representations is of dimension 15599 because you used get_model_with_hidden_layers_as_outputs. If you don't use it you'll only get features from the output layer.

from protein_bert.

thnhan avatar thnhan commented on June 29, 2024
  1. In the tokenization submodule you can use the token_to_index and index_to_token dictionaries that map between token and token indices (rows in the tokem embedding matrix).
  2. global_representations is of dimension 15599 because you used get_model_with_hidden_layers_as_outputs. If you don't use it you'll only get features from the output layer.

Thank you @nadavbra,
according to your solution, I have gotten the aa_embedding of size (26,128) and global_representations of dimension 8943.

  1. I used the command model = pretrained_model_generator.create_model(seq_len) for global_representations
  2. In the token sequence of the protein sequence, I removed its first and last tokens because they are not for amino acids.

Is my above usage exactly as you suggest?

Thank you!

from protein_bert.

nadavbra avatar nadavbra commented on June 29, 2024

Yes, that seems about right.
I will say that I expect the higher-dimensional global_representations (with get_model_with_hidden_layers_as_outputs) to be more useful for most tasks.

from protein_bert.

thnhan avatar thnhan commented on June 29, 2024

Yes, that seems about right. I will say that I expect the higher-dimensional global_representations (with get_model_with_hidden_layers_as_outputs) to be more useful for most tasks.

Thank you so much, I'm glad to have your help.
I will try to use both types of global_representations.

from protein_bert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.