kieranlitschel / xswem Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 183 KB

A simple and explainable deep learning model for NLP.

License: MIT License

Python 21.34% Jupyter Notebook 78.66%

nlp swem deep-learning word-embeddings machine-learning simple global-explanation local-explanation tensorflow keras

xswem's Introduction

XSWEM

A simple and explainable deep learning model for NLP implemented in TensorFlow.

Based on SWEM-max as proposed by Shen et al. in Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms, 2018.

This package is currently in development. The purpose of this package is to make it easy to train and explain SWEM-max.

You can find demos of the functionality we have implemented in the notebooks directory of the package. Each notebook has a badge that allows you to run it yourself in Google Colab. We will add more notebooks as new functionality is added.

For a demo of how to train a basic SWEM-max model see train_xswem.ipynb.

Local Explanations

We are currently implementing some methods we have developed for local explanations.

local_explain_most_salient_words

So far we have only implemented the local_explain_most_salient_words method. This method extracts the words the model has learnt as most salient from a given input sentence. Below we show an example of this method using a sample from the ag_news dataset. This method is explained in more detail in the local_explain_most_salient_words.ipynb notebook.

Global Explanations

We have implemented the global explainability method proposed in section 4.1.1 of the original paper. You can see a demo of this method in the notebook global_explain_embedding_components.ipynb.

How to install

This package is hosted on PyPI and can be installed using pip.

pip install xswem

xswem's People

Contributors

Stargazers

Watchers

xswem's Issues

Add option to initialize model with pre-trained GloVe word embeddings

We have found that we are able to achieve similar performance by initializing word embeddings randomly. But in the original paper, the author initialized them with pre-trained GloVe word embeddings. We should enable this functionality by implementing the following:

Allow users to initialize the embedding layer with their own pre-trained weights. We should recommend them to use the pre-trained GloVe weights. Words that the user does not have pre-trained weights for should be initialized using a random uniform distribution with values in the range -0.01 to 0.01.

The authors also sometimes added a Dense layer between embedding and pooling layers to allow the model to adapt the embeddings to the task of interest. We should allow users to do this by implementing the following:

Allow users to optionally specify a Dense layer should be included between the embedding and pooling layer. It should have the same number of units as the embedding layer and use a ReLU activation function.

Typically we would freeze the embedding layer when using pre-trained weights. But the author does not mention this explicitly in their paper, nor freeze the weights in their original source code. So in our implementation embedding weights are trainable.

Remove datasets as a requirement for this package

We only require the datasets package for the demo notebooks. We should remove datasets as a requirement and instead pip install it at the start of each notebook

Implement SWEM-max

Create child class of tf.keras.Model which implements SWEM-max as described in section 3.3 of the original paper.

Dropout before max pooling killing embedding components during training

When a unit is dropped out its value is set to 0. As we are applying dropout directly to the word embeddings, for long input sequences, it becomes increasingly likely that at least one component in each dimension will be set to zero. This means that negative components can often die, as they get stuck with negative values due to the zeros being introduced by dropout being taken as the maximum.

This is particularly problematic as our distribution for initializing embeddings is centred at zero, meaning around half of the components are initialized as values less than zero. The histogram below exemplifies this issue.

One possible solution is to initialize all embedding weights with values greater than zero. This should significantly reduce the number of dying units, but units will still die if they are updated with a value less than zero.

A better solution would be to make it so that zero is ignored during the max-pooling operation. But this may slow down training significantly, which would make the first solution more preferable.

Apply dropout to the input of the pooling layer instead of the output

Currently, dropout is applied to the output of the pooling layer. After reading the paper again and looking at the source code for it, we realized that it should be applied to the input of the pooling layer instead.

Add option to freeze embedding weights

Setup automatic documentation

We are using numpy style docstrings. These can be used to automatically generate documentation in Sphinx using the Napoleon extension. We should set this up.

Implement global explainability for word embedding components

In section 4.1.1 of the original paper, the authors proposed a method for interpreting the components of the embeddings learned by SWEM-max. We should implement this method in XSWEM.

To do this we need to first implement a function that allows users to generate a histogram from their word embeddings so that they can confirm whether the embeddings learned for their model are also sparse. Second, we need to implement a function that returns the n words with the largest values for each component (n=5 should be the default).

Investigate how to visualize local explanations

It'd be nice if we could visualize the local explanations on the input sentence to make it easier to understand explanations. Investigate how to do this

Implement method to determine most salient words

At maximum, d words (where d is size of the embedding size) from the input sentence contribute to the output of the network. This is because of the max-pooling layer, with it only keeping the maximum value of each dimension across the embeddings of the input sentence. This means at maximum d words contribute to the output of the max-pooling layer.

Thus where d is smaller than the unique words in the input sentence, the max-pooling layer has the effect of shortlisting the d most important words needed to make a prediction. If d is larger than the number of the unique words in the input sentence, it still can have the effect of shortlisting words because some words may have the maximum value for multiple dimensions, but shortlisting is not guaranteed.

We can find the shortlisted words by taking an argmax for each dimension across the embeddings of the input sentence. We should add a function to XSWEM to do this. This can be used as a method for local explainability.

Allow users to set the parameters of any layer

We should allow users to set the parameter of any layer.

We could do this using a configuration dictionary passed to the constructor of the model. This could map layer names to the configuration dictionary for the corresponding layer. The layers configuration dictionary could then be unpacked in the layers constructor using **kwargs.

Investigate setting embedding weights not seen in training to zero to reduce saved model size

In #10 we observed that it looks like a lot of the weights in the embeddings are never seen during training so maintain their initialized values. From what we can tell this seems to be happening for most weights initialized with a negative value. If we randomly initialize our embedding layer, weights that have never been seen during training have little to contribute to prediction at test time as their value is random. We may be able to use this to make our saved models smaller.

After training, we could check which weights have values that have not changed from their initialized value and set them to zero. Then when saving the matrix of embedding weights we just need to save the non-zero values and where they occur in the matrix.