Comments (10)
I will say again that it sounds like you should use the full embeddings of ProteinBERT (local_representations
and global_representations
in your above code) rather than just the per-amino-acid embeddings.
If you insist on using the amino-acid embeddings, you can extract them along these lines:
embedding_layer = model.get_layer('embedding-seq-input')
aa_embeddings = embedding_layer.get_weights()[0]
aa_embeddings
should be of shape (vocab_size, d_hidden_seq)
.
from protein_bert.
@thnhan Why do you want per-aa embeddings rather than embeddings for the entire protein sequences? ProteinBERT was not designed to embed individual amino acids.
If you really are interested in ProteinBERT amino-acid embeddings, I would try to get them from the first layer of the model (of type keras.layers.Embedding
).
from protein_bert.
@nadavbra Thank you for your help.
I want per-aa embeddings because my model was designed to use amino acid embeddings through the Embedding layer (a keras layer of my model).
After this Embedding layer, an entire fix-length protein sequence can be converted to embeddings and fed to MLPs. My model will need to further adjust the weights of that layer.
In practice, I also used the "global_representations" for my model but to use them I removed that layer from my model. I used the below codes, which are according to your github demo.
pretrained_model_generator, input_encoder = load_pretrained_model(
local_model_dump_dir=r"D:\protein_bert\protein_bert",
local_model_dump_file_name='default.pkl',
download_model_dump_if_not_exists=False)
protbert_model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(fixlen))
encoded_x = input_encoder.encode_X(protseq, fixlen)
local_representations, global_representations = protbert_model.predict(encoded_x, batch_size=64)
Would you help me get per-aa embeddings? Or could you show me the codes to get them?
Thank you so much!
from protein_bert.
- Question - are you looking for the embeddings of each individual AA (aka - just extract it from the embedding layer of the model), or the transformer model's embeddings of each position ? (For that- look at the position specific classificiation examples, e.g. secondary structure).
But I agree with Nadav, naively, sounds like you may just want to use the full embeddings, the vocab/tokenizer is per AA, not multi-char, so it's small and easy to learn . (The position and contextual embeddings are where the real meat is)
from protein_bert.
Nadav's solution will get you the per AA embedding matrix.
from protein_bert.
I will say again that it sounds like you should use the full embeddings of ProteinBERT (
local_representations
andglobal_representations
in your above code) rather than just the per-amino-acid embeddings.If you insist on using the amino-acid embeddings, you can extract them along these lines:
embedding_layer = model.get_layer('embedding-seq-input') aa_embeddings = embedding_layer.get_weights()[0]
aa_embeddings
should be of shape(vocab_size, d_hidden_seq)
.
Thank you, @nadavbra and @ddofer so much,
I attempt to get the embedding of the entire protein sequence using the embeddings of amino acids in that protein sequence. I think your solution will help me a lot. My code about local_representations
and global_representations
is an example that I used to mine your model for my work.
I also consulted two additional threads "GO Annotation Vector #38" and "What to do with the local_representations and global_representations #6", to understand how to use "representations".
Kindly could I ask you two more questions?
- Would you tell me, in order, which row does each amino acid correspond to in this matrix,
(vocab_size, d_hidden_seq)
? Or can I find this information in your paper? global_representations
vector that I am getting is of the size 15599, so how do I get the 8943-dimensionalglobal_representations
vector?
Thank you!
from protein_bert.
- In the
tokenization
submodule you can use thetoken_to_index
andindex_to_token
dictionaries that map between token and token indices (rows in the tokem embedding matrix). global_representations
is of dimension 15599 because you usedget_model_with_hidden_layers_as_outputs
. If you don't use it you'll only get features from the output layer.
from protein_bert.
- In the
tokenization
submodule you can use thetoken_to_index
andindex_to_token
dictionaries that map between token and token indices (rows in the tokem embedding matrix).global_representations
is of dimension 15599 because you usedget_model_with_hidden_layers_as_outputs
. If you don't use it you'll only get features from the output layer.
Thank you @nadavbra,
according to your solution, I have gotten the aa_embedding
of size (26,128)
and global_representations
of dimension 8943.
- I used the command
model = pretrained_model_generator.create_model(seq_len)
forglobal_representations
- In the token sequence of the protein sequence, I removed its first and last tokens because they are not for amino acids.
Is my above usage exactly as you suggest?
Thank you!
from protein_bert.
Yes, that seems about right.
I will say that I expect the higher-dimensional global_representations
(with get_model_with_hidden_layers_as_outputs
) to be more useful for most tasks.
from protein_bert.
Yes, that seems about right. I will say that I expect the higher-dimensional
global_representations
(withget_model_with_hidden_layers_as_outputs
) to be more useful for most tasks.
Thank you so much, I'm glad to have your help.
I will try to use both types of global_representations
.
from protein_bert.
Related Issues (20)
- Dead Link - CAFA 4 HOT 1
- Log directory content required for running -final papper analyses.ipynb HOT 1
- Error when trying to run benchmarks HOT 1
- Failing to get the weights from the dedicated github repo HOT 5
- Use ProteinBERT with Own Dataset HOT 3
- Original h5 file HOT 5
- loss plot during pretraining HOT 1
- signal peptide detection HOT 1
- KeyError: "Unable to open object (object 'test_set_mask' doesn't exist)" HOT 6
- Graph execution error HOT 6
- Extract local and global representation using finetune model HOT 1
- Running Benchmarks HOT 4
- Evaluation on larger data set HOT 3
- Using vector representations in the "weights" parameter in the "embedding" section of an LSTM model after fine-tuning my own data HOT 1
- Failing to extract global embedding (1,15599) -> (1,512) HOT 1
- What do the settings mean? HOT 3
- Error when trying to run the finetuning code given in the jupyter notebook HOT 2
- ValueError, set_weights error
- model_generation.py list is not callable error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from protein_bert.