kensho-technologies / bubs Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 9.0 33 MB

Keras Implementation of Flair's Contextualized Embeddings

License: Apache License 2.0

Python 98.14% Shell 1.86%

bubs's People

Contributors

Stargazers

Watchers

Forkers

hwchase17 alexpadron strategist922 cookieblues stjordanis mikeyshulman ydovzhenko pinesnow72 poneill

bubs's Issues

char to int map for news, not news-fast

This repository gives a way to use pytorch-based pretrained flair embedding weights by conversion.
By the way, flair embeddings are based on char-level, which means that it has a mapping of characters into integers.
This repository also gives (275,100) mapping table (in char_to_int.py) and this is well fit to the news-fast weights.
However, I tried to use news weights version, not news-fast, It gives an error on mapping of char to int, because the news weights require (300,100) mapping table.
I don't well understand the difference between news and news-fast.
Where can I get (300,100) char-to-int mapping table?
I could not find any char-to-int mapping table from the Flair repository.

On forward and backward index input

I understand that forward and backward index input is used for gathering token embedding from char-lstm sequence result. This is achieved by calculating sentence id (in a batch) and forward/backward index (the forward and backward index input's shape is (batch_size, MAX_TOKEN_SEQUENCE_LEN, 2)) and using tf.gather_nd() in embedding_layer.py. By the way, tf.gather_nd() can be used with setting batch_dims=1 (the number of batch dimensions; default is 0). So, without the sentence id (in a batch), we can gather token embedding only with forward/backward index and the forward and backward index input shape could just be (batch_size, MAX_TOKEN_SEQUENCE_LEN, 1). This would be better especially when we used model.fit() (not generator).

So, I changed the code related to them as follows:

  def _prepare_index_array(self, index_list):
       """Make a 2D array where each row is a padded array of character-level token-end indices."""
       pad_len = self.max_token_sequence_len
       batch_size = len(index_list)
       padding_index = 0
       # padded_sentences = np.full((batch_size, pad_len, 2), padding_index, dtype=np.int32)
       padded_sentences = np.full((batch_size, pad_len, 1), padding_index, dtype=np.int32)
       for i in range(batch_size):
           clipped_len = min(len(index_list[i]), pad_len)
           # padded_sentences[i, :, 0] = i
           if self.prepad:
               # padded_sentences[i, pad_len - clipped_len:, 1] = index_list[i][:clipped_len]
               padded_sentences[i, pad_len - clipped_len:, 0] = index_list[i][:clipped_len]
           else:
               # padded_sentences[i, :clipped_len, 1] = index_list[i][:clipped_len]
               padded_sentences[i, :clipped_len, 0] = index_list[i][:clipped_len]
       return padded_sentences

def batch_indexing(inputs):
    """Index a character-level embedding matrix at token end locations.

    Args:
        inputs: a list of two tensors:
            tensor1: tensor of (batch_size, max_char_seq_len, char_embed_dim*2) of all char-level
                embeddings
            tensor2: tensor of (batch_size, max_token_seq_len, 2) of indices of token ends.
                Something like [[[0, 1], [0, 5]], [[1, 2], [1, 3]], ...]. The last dimension is 2
                because pairs of (sentence_index, token_index)

    Returns:
        A tensor of (batch_size, max_token_seq_len, char_embed_dim*2) of char-level embeddings
            at ends of tokens
    """
    embeddings, indices = inputs
    # this will break on deserialization if we simply import tensorflow
    # we have to use keras.backend.tf instead of tensorflow
    # return tf.gather_nd(embeddings, indices)
    return tf.gather_nd(embeddings, indices, batch_dims=1)

forward_index_input = Input(
    # batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 2), name="forward_index_input", dtype="int32"
    batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 1), name="forward_index_input", dtype="int32"
)
backward_index_input = Input(
    # batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 2), name="backward_index_input", dtype="int32"
    batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 1), name="backward_index_input", dtype="int32"
)

Supports masking

The ContextualizedEmbedding layer seems not to support masking.
It does not set self.supports_masking = True
and does not implement compute_mask()

Is it possible to add some code to support masking?
For example,
I try to implement compute_mask() as follows:

    def compute_mask(self, inputs, mask=None):
        (
            forward_input,
            backward_input,
            forward_index_input,
            backward_index_input,
            forward_mask_input,
            backward_mask_input,
        ) = inputs
        return [forward_mask_input, backward_mask_input]

Is this correct?

Pooled Contextual Embedding

Hello, how can I get PooledContextualEmbeddings as mentioned in http://alanakbik.github.io/papers/naacl2019_embeddings.pdf

Changing pre-padding to post-padding

The flair embedding result is pre-padded, but I want to get post-padded embedding.
Is it possible to change pre-padding to post-padding with modifying InputEncoder?
Actually, I modified _prepare_index_array() and _prepare_mask_array() in InputEncoder class and could get post-padded embedding result, but I am not sure whether this result is actually reasonable or not when I use it in downstream task.

Below is the code I've modified.
I added self.prepad (bool) and selectively padded according to this value.

    def _prepare_index_array(self, index_list):
        """Make a 2D array where each row is a padded array of character-level token-end indices."""
        pad_len = self.max_token_sequence_len
        batch_size = len(index_list)
        padding_index = 0
        padded_sentences = np.full((batch_size, pad_len, 2), padding_index, dtype=np.int32)
        for i in range(batch_size):
            clipped_len = min(len(index_list[i]), pad_len)
            padded_sentences[i, :, 0] = i
            if self.prepad:
                padded_sentences[i, pad_len - clipped_len:, 1] = index_list[i][:clipped_len]
            else:
                padded_sentences[i, :clipped_len, 1] = index_list[i][:clipped_len]
        return padded_sentences

    def _prepare_mask_array(self, index_list):
        """Make 2D array where each row contains 1's where real tokens were and 0's where padded."""
        pad_len = self.max_token_sequence_len
        batch_size = len(index_list)
        mask = np.zeros((batch_size, pad_len))
        for i, inds in enumerate(index_list):
            if self.prepad:
                mask[i, pad_len - len(inds):] = 1
            else:
                mask[i, :len(inds)] = 1
        return mask

kensho-technologies / bubs Goto Github PK

bubs's People

Contributors

Stargazers

Watchers

Forkers

bubs's Issues

char to int map for news, not news-fast

On forward and backward index input

Supports masking

Pooled Contextual Embedding

Changing pre-padding to post-padding

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent