Giter VIP home page Giter VIP logo

bubs's People

Contributors

hwchase17 avatar obi1kenobi avatar ydovzhenko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bubs's Issues

char to int map for news, not news-fast

This repository gives a way to use pytorch-based pretrained flair embedding weights by conversion.
By the way, flair embeddings are based on char-level, which means that it has a mapping of characters into integers.
This repository also gives (275,100) mapping table (in char_to_int.py) and this is well fit to the news-fast weights.
However, I tried to use news weights version, not news-fast, It gives an error on mapping of char to int, because the news weights require (300,100) mapping table.
I don't well understand the difference between news and news-fast.
Where can I get (300,100) char-to-int mapping table?
I could not find any char-to-int mapping table from the Flair repository.

On forward and backward index input

I understand that forward and backward index input is used for gathering token embedding from char-lstm sequence result. This is achieved by calculating sentence id (in a batch) and forward/backward index (the forward and backward index input's shape is (batch_size, MAX_TOKEN_SEQUENCE_LEN, 2)) and using tf.gather_nd() in embedding_layer.py. By the way, tf.gather_nd() can be used with setting batch_dims=1 (the number of batch dimensions; default is 0). So, without the sentence id (in a batch), we can gather token embedding only with forward/backward index and the forward and backward index input shape could just be (batch_size, MAX_TOKEN_SEQUENCE_LEN, 1). This would be better especially when we used model.fit() (not generator).

So, I changed the code related to them as follows:

  def _prepare_index_array(self, index_list):
       """Make a 2D array where each row is a padded array of character-level token-end indices."""
       pad_len = self.max_token_sequence_len
       batch_size = len(index_list)
       padding_index = 0
       # padded_sentences = np.full((batch_size, pad_len, 2), padding_index, dtype=np.int32)
       padded_sentences = np.full((batch_size, pad_len, 1), padding_index, dtype=np.int32)
       for i in range(batch_size):
           clipped_len = min(len(index_list[i]), pad_len)
           # padded_sentences[i, :, 0] = i
           if self.prepad:
               # padded_sentences[i, pad_len - clipped_len:, 1] = index_list[i][:clipped_len]
               padded_sentences[i, pad_len - clipped_len:, 0] = index_list[i][:clipped_len]
           else:
               # padded_sentences[i, :clipped_len, 1] = index_list[i][:clipped_len]
               padded_sentences[i, :clipped_len, 0] = index_list[i][:clipped_len]
       return padded_sentences
def batch_indexing(inputs):
    """Index a character-level embedding matrix at token end locations.

    Args:
        inputs: a list of two tensors:
            tensor1: tensor of (batch_size, max_char_seq_len, char_embed_dim*2) of all char-level
                embeddings
            tensor2: tensor of (batch_size, max_token_seq_len, 2) of indices of token ends.
                Something like [[[0, 1], [0, 5]], [[1, 2], [1, 3]], ...]. The last dimension is 2
                because pairs of (sentence_index, token_index)

    Returns:
        A tensor of (batch_size, max_token_seq_len, char_embed_dim*2) of char-level embeddings
            at ends of tokens
    """
    embeddings, indices = inputs
    # this will break on deserialization if we simply import tensorflow
    # we have to use keras.backend.tf instead of tensorflow
    # return tf.gather_nd(embeddings, indices)
    return tf.gather_nd(embeddings, indices, batch_dims=1)
forward_index_input = Input(
    # batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 2), name="forward_index_input", dtype="int32"
    batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 1), name="forward_index_input", dtype="int32"
)
backward_index_input = Input(
    # batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 2), name="backward_index_input", dtype="int32"
    batch_shape=(None, MAX_TOKEN_SEQUENCE_LEN, 1), name="backward_index_input", dtype="int32"
)

Supports masking

The ContextualizedEmbedding layer seems not to support masking.
It does not set self.supports_masking = True
and does not implement compute_mask()

Is it possible to add some code to support masking?
For example,
I try to implement compute_mask() as follows:

    def compute_mask(self, inputs, mask=None):
        (
            forward_input,
            backward_input,
            forward_index_input,
            backward_index_input,
            forward_mask_input,
            backward_mask_input,
        ) = inputs
        return [forward_mask_input, backward_mask_input]

Is this correct?

Changing pre-padding to post-padding

The flair embedding result is pre-padded, but I want to get post-padded embedding.
Is it possible to change pre-padding to post-padding with modifying InputEncoder?
Actually, I modified _prepare_index_array() and _prepare_mask_array() in InputEncoder class and could get post-padded embedding result, but I am not sure whether this result is actually reasonable or not when I use it in downstream task.

Below is the code I've modified.
I added self.prepad (bool) and selectively padded according to this value.

    def _prepare_index_array(self, index_list):
        """Make a 2D array where each row is a padded array of character-level token-end indices."""
        pad_len = self.max_token_sequence_len
        batch_size = len(index_list)
        padding_index = 0
        padded_sentences = np.full((batch_size, pad_len, 2), padding_index, dtype=np.int32)
        for i in range(batch_size):
            clipped_len = min(len(index_list[i]), pad_len)
            padded_sentences[i, :, 0] = i
            if self.prepad:
                padded_sentences[i, pad_len - clipped_len:, 1] = index_list[i][:clipped_len]
            else:
                padded_sentences[i, :clipped_len, 1] = index_list[i][:clipped_len]
        return padded_sentences

    def _prepare_mask_array(self, index_list):
        """Make 2D array where each row contains 1's where real tokens were and 0's where padded."""
        pad_len = self.max_token_sequence_len
        batch_size = len(index_list)
        mask = np.zeros((batch_size, pad_len))
        for i, inds in enumerate(index_list):
            if self.prepad:
                mask[i, pad_len - len(inds):] = 1
            else:
                mask[i, :len(inds)] = 1
        return mask

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.