Giter VIP home page Giter VIP logo

gated-conv-nets's Introduction

This is an implementation of a sigmoid gated convolutional network, as per https://arxiv.org/pdf/1612.08083v1.pdf.

x

To run:

Run the training process on the GPU:

unset CUDA_VISIBLE_DEVICES
python model.py --train True

Simultaneously run the validation process on the CPU:

export CUDA_VISIBLE_DEVICES=""
python model.py --valid True

The validation process continually loads the latest training weights as both processes run.

Discussion:

The technique described in this paper is an attempt to set up a convolutional network to achieve the same sorts of contextual inputs that an LSTM or RNN is traditionally good at, while taking advantage of the CNNs non-temporal nature to effect big speed gains.

For the language modeling task, this means that the inputs are a sequence of learned word embeddings, and the outputs are that same sequence, but shifted to the left. The final output embedding for a word (a vector within the final hidden layer) is thus trying to predict the word in front of it, a probability which is calculated with the softmax for each vector in the final hidden layer.

In this way, all output projections are computed simultaneously through the convolutional layers. Care must be taken to properly pad the inputs such that each convolution kernel cannot see any context in front of it, as this would constitute data leakage.

We can see that "context" for each word in this scenario is determined by the receptive field of the last layer, which is depends on the kernel widths throughout the entire network, and is thus finite, as opposed to recurrent architectures, which theoretically can have infinite contexts. However, the receptive field can be expressed as a function of the exponent of the layers. As illustrated above, a net with three hidden layers and a kernel size of 3 has, for each token, a receptive field of 9. In this way it is quite easy to achieve very large contexts without requiring the networks to be very deep. The authors also note that in practice, performance tends to decrease for both GCNNs and RNNs as context size increases above 20-25 steps.

As with most language modeling tasks, the most expensive part of the computation is the softmax stage, where each output vector must calculate the probability with respect to the target word. This is expensive because all of the output probabilities must sum to one, so the total across all possible tokens in the vocabulary must be computed as the divisor for normalization.

The authors use a newer technique called the adaptive softmax to approximate the softmax for speed. In this implementation, I use a full softmax, as the vocabulary for Wikitext-2 is not prohibitively large at ~33k. Of course, this will slow things down very much if we tried to train on a bigger dataset like Wikitext-103. However, until I have time to read the adaptive softmax paper [0] and understand how it works, the full softmax suffices.

In this implementation, I use a depthwise 2d convolution, treating each input embedding dimension as a different channel. This particular net is 11 layers deep, with 2 residual layers, and an output embedding projection of 256 dimensions. The kernel width is set to 3 throughout the network, and the sequence length is set at 25. I use weight normalization [1], as per the paper. The batch size is set to 500.

Note: In the original paper, the initial sequence padding is listed to be k/2 (k= kernel width). It should in fact be k-1, as k/2 would allow future context to leak into the current word. I have verified this with the authors and the error will be corrected in the next version of the paper.

Moving forward, there are some likely some interesting experiments to be done around various kernel width/layer depth permutations for identical effective context sizes.

Performance:

In 14k steps (at a batch size of 500), this is able to achieve a test perplexity of 275 Wikitext-2. SOTA is 68.9 (unknown number of steps)[2]. Below is the validation loss over time: x

[0] https://arxiv.org/pdf/1609.04309v2.pdf

[1] https://arxiv.org/pdf/1602.07868v3.pdf

[2] https://openreview.net/pdf?id=B184E5qee

gated-conv-nets's People

Contributors

astanway avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.