Giter VIP home page Giter VIP logo

Comments (5)

jerrybai1995 avatar jerrybai1995 commented on August 15, 2024

You should be able to use TCN as a direct replacement of the LSTM interface (so if you "snaked through" the image with an LSTM, you can also do it with a TCN). However, since the hand-writing images are in left-to-right order, I would start with a bunch of 2D convolutions that collapse the "height" dimension. In particular, given an image of shape HxWxC, you can probably perform a series of Conv2D(in_chan, out_chan, kernel_size=3, stride=(2,1), padding=1), so that after a few such downsamplings your hidden unit will be of shape 1xWxC. You shall then be able to pass it to a Conv1D sequence model like TCN (as the width is now the length).

Since alignment will be important in your particular application, I think CTC would be a good choice. However, you can also always just do it the traditional way: compress the image to a single vector, and pass it to TCN one time step at a time. (And to answer your question: no, I have not experimented with image-to-text model(s).)

Let me know if this helps!

from tcn.

addisonklinke avatar addisonklinke commented on August 15, 2024

@jerrybai1995 Thank you for the quick response! I see your point about collapsing the height to 1 via convolutions, however for images with a more square aspect ratio than my example I do not think this would be feasible. At the same time the height shrinks, you reduce the width (i.e. sequence length) dimension as well. If that becomes shorter than your maximum target sequence length, then you do not have enough timesteps to make a full prediction and the CTC loss would not apply.

A couple options seem like

  1. Convolve until the width is at the minimum acceptable value (relative to target lengths), and then concatenate the height dimension into one long vector
  2. Same as above, but have some simple linear layer(s) map the column features into a single dimension
  3. Selectively upsample the width after the height has been reduced to one - although I am not really sure how you implement this with convolutions and it would probably distort the feature maps

Any of those sound better than the others?

Also, in OpenNMT's recurrent architecture the decoder LSTM does "snake" through the image in a sense. However it is an attention mechanism, so at each timestep of the decoder, it can choose to focus selectively on different rows of the feature volume. Correct me if I am wrong, but I do not think the TCN needs anything like attention because it has access to all the timesteps at once whereas the decoder LSTM has to handle them one-by-one?

from tcn.

jerrybai1995 avatar jerrybai1995 commented on August 15, 2024

Why is the width reduced as the height shrinks? As I mentioned, if you take stride=(2,1) (i.e., stride 2 on height, stride 1 on width), the width dimension will NOT be changed but the height dimension will reduce by a factor of 2 for each layer.

This should save you from the three proposed options, I think.

from tcn.

addisonklinke avatar addisonklinke commented on August 15, 2024

I see now, sorry I glossed over your stride specification

What are your thoughts about the attention mechanism?

Also, in OpenNMT's recurrent architecture the decoder LSTM does "snake" through the image in a sense. However it is an attention mechanism, so at each timestep of the decoder, it can choose to focus selectively on different rows of the feature volume. Correct me if I am wrong, but I do not think the TCN needs anything like attention because it has access to all the timesteps at once whereas the decoder LSTM has to handle them one-by-one?

I also found a paper implementing a similar conv2conv architecture (except for video captioning). They used an attention mechanism discussed on pg. 5, and their decoder seems quite similar to a TCN concept. The main addition looks like the temporal deformations in their encoder, however they don't have an ablation study as substantial as your paper to show whether those are really necessary or simple convolutions could suffice

from tcn.

jerrybai1995 avatar jerrybai1995 commented on August 15, 2024

But still, in the decoding process you would want to do things in a generative, one-by-one way: you first generate t=1, and then using that generation you can do t=2, and so forth. Although a TCN does have access to all the time steps, at generation time you do not have all time steps--- you still need to generate them (so t=1 first, and then t=2, etc.)

from tcn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.