kensho-technologies / bubs Goto Github PK
View Code? Open in Web Editor NEWKeras Implementation of Flair's Contextualized Embeddings
License: Apache License 2.0
Keras Implementation of Flair's Contextualized Embeddings
License: Apache License 2.0
This repository gives a way to use pytorch-based pretrained flair embedding weights by conversion.
By the way, flair embeddings are based on char-level, which means that it has a mapping of characters into integers.
This repository also gives (275,100) mapping table (in char_to_int.py) and this is well fit to the news-fast weights.
However, I tried to use news weights version, not news-fast, It gives an error on mapping of char to int, because the news weights require (300,100) mapping table.
I don't well understand the difference between news and news-fast.
Where can I get (300,100) char-to-int mapping table?
I could not find any char-to-int mapping table from the Flair repository.
I understand that forward and backward index input is used for gathering token embedding from char-lstm sequence result. This is achieved by calculating sentence id (in a batch) and forward/backward index (the forward and backward index input's shape is (batch_size, MAX_TOKEN_SEQUENCE_LEN, 2)) and using tf.gather_nd() in embedding_layer.py. By the way, tf.gather_nd() can be used with setting batch_dims=1 (the number of batch dimensions; default is 0). So, without the sentence id (in a batch), we can gather token embedding only with forward/backward index and the forward and backward index input shape could just be (batch_size, MAX_TOKEN_SEQUENCE_LEN, 1). This would be better especially when we used model.fit() (not generator).
So, I changed the code related to them as follows:
def _prepare_index_array(self, index_list):
"""Make a 2D array where each row is a padded array of character-level token-end indices."""
pad_len = self.max_token_sequence_len
batch_size = len(index_list)
padding_index = 0
# padded_sentences = np.full((batch_size, pad_len, 2), padding_index, dtype=np.int32)
padded_sentences = np.full((batch_size, pad_len, 1), padding_index, dtype=np.int32)
for i in range(batch_size):
clipped_len = min(len(index_list[i]), pad_len)
# padded_sentences[i, :, 0] = i
if self.prepad:
# padded_sentences[i, pad_len - clipped_len:, 1] = index_list[i][:clipped_len]
padded_sentences[i, pad_len - clipped_len:, 0] = index_list[i][:clipped_len]
else:
# padded_sentences[i, :clipped_len, 1] = index_list[i][:clipped_len]
padded_sentences[i, :clipped_len, 0] = index_list[i][:clipped_len]
return padded_sentences
def batch_indexing(inputs):
"""Index a character-level embedding matrix at token end locations.
Args:
inputs: a list of two tensors:
tensor1: tensor of (batch_size, max_char_seq_len, char_embed_dim*2) of all char-level
embeddings
tensor2: tensor of (batch_size, max_token_seq_len, 2) of indices of token ends.
Something like [[[0, 1], [0, 5]], [[1, 2], [1, 3]], ...]. The last dimension is 2
because pairs of (sentence_index, token_index)
Returns:
A tensor of (batch_size, max_token_seq_len, char_embed_dim*2) of char-level embeddings
at ends of tokens
"""
embeddings, indices = inputs
# this will break on deserialization if we simply import tensorflow
# we have to use keras.backend.tf instead of tensorflow
# return tf.gather_nd(embeddings, indices)
return tf.gather_nd(embeddings, indices, batch_dims=1)
forward_index_input = Input(
# batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 2), name="forward_index_input", dtype="int32"
batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 1), name="forward_index_input", dtype="int32"
)
backward_index_input = Input(
# batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 2), name="backward_index_input", dtype="int32"
batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 1), name="backward_index_input", dtype="int32"
)
The ContextualizedEmbedding layer seems not to support masking.
It does not set self.supports_masking = True
and does not implement compute_mask()
Is it possible to add some code to support masking?
For example,
I try to implement compute_mask() as follows:
def compute_mask(self, inputs, mask=None):
(
forward_input,
backward_input,
forward_index_input,
backward_index_input,
forward_mask_input,
backward_mask_input,
) = inputs
return [forward_mask_input, backward_mask_input]
Is this correct?
Hello, how can I get PooledContextualEmbeddings as mentioned in http://alanakbik.github.io/papers/naacl2019_embeddings.pdf
The flair embedding result is pre-padded, but I want to get post-padded embedding.
Is it possible to change pre-padding to post-padding with modifying InputEncoder?
Actually, I modified _prepare_index_array() and _prepare_mask_array() in InputEncoder class and could get post-padded embedding result, but I am not sure whether this result is actually reasonable or not when I use it in downstream task.
Below is the code I've modified.
I added self.prepad (bool) and selectively padded according to this value.
def _prepare_index_array(self, index_list):
"""Make a 2D array where each row is a padded array of character-level token-end indices."""
pad_len = self.max_token_sequence_len
batch_size = len(index_list)
padding_index = 0
padded_sentences = np.full((batch_size, pad_len, 2), padding_index, dtype=np.int32)
for i in range(batch_size):
clipped_len = min(len(index_list[i]), pad_len)
padded_sentences[i, :, 0] = i
if self.prepad:
padded_sentences[i, pad_len - clipped_len:, 1] = index_list[i][:clipped_len]
else:
padded_sentences[i, :clipped_len, 1] = index_list[i][:clipped_len]
return padded_sentences
def _prepare_mask_array(self, index_list):
"""Make 2D array where each row contains 1's where real tokens were and 0's where padded."""
pad_len = self.max_token_sequence_len
batch_size = len(index_list)
mask = np.zeros((batch_size, pad_len))
for i, inds in enumerate(index_list):
if self.prepad:
mask[i, pad_len - len(inds):] = 1
else:
mask[i, :len(inds)] = 1
return mask
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.