My goal is to train a model that can output sequences of text from image inputs. Using

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Recommendation for image to text about tcn HOT 5 CLOSED

locuslab commented on August 15, 2024

Recommendation for image to text

from tcn.

Comments (5)

jerrybai1995 commented on August 15, 2024

You should be able to use TCN as a direct replacement of the LSTM interface (so if you "snaked through" the image with an LSTM, you can also do it with a TCN). However, since the hand-writing images are in left-to-right order, I would start with a bunch of 2D convolutions that collapse the "height" dimension. In particular, given an image of shape HxWxC, you can probably perform a series of Conv2D(in_chan, out_chan, kernel_size=3, stride=(2,1), padding=1), so that after a few such downsamplings your hidden unit will be of shape 1xWxC. You shall then be able to pass it to a Conv1D sequence model like TCN (as the width is now the length).

Since alignment will be important in your particular application, I think CTC would be a good choice. However, you can also always just do it the traditional way: compress the image to a single vector, and pass it to TCN one time step at a time. (And to answer your question: no, I have not experimented with image-to-text model(s).)

Let me know if this helps!

from tcn.

addisonklinke commented on August 15, 2024

@jerrybai1995 Thank you for the quick response! I see your point about collapsing the height to 1 via convolutions, however for images with a more square aspect ratio than my example I do not think this would be feasible. At the same time the height shrinks, you reduce the width (i.e. sequence length) dimension as well. If that becomes shorter than your maximum target sequence length, then you do not have enough timesteps to make a full prediction and the CTC loss would not apply.

A couple options seem like

Convolve until the width is at the minimum acceptable value (relative to target lengths), and then concatenate the height dimension into one long vector
Same as above, but have some simple linear layer(s) map the column features into a single dimension
Selectively upsample the width after the height has been reduced to one - although I am not really sure how you implement this with convolutions and it would probably distort the feature maps

Any of those sound better than the others?

Also, in OpenNMT's recurrent architecture the decoder LSTM does "snake" through the image in a sense. However it is an attention mechanism, so at each timestep of the decoder, it can choose to focus selectively on different rows of the feature volume. Correct me if I am wrong, but I do not think the TCN needs anything like attention because it has access to all the timesteps at once whereas the decoder LSTM has to handle them one-by-one?

from tcn.

jerrybai1995 commented on August 15, 2024

Why is the width reduced as the height shrinks? As I mentioned, if you take stride=(2,1) (i.e., stride 2 on height, stride 1 on width), the width dimension will NOT be changed but the height dimension will reduce by a factor of 2 for each layer.

This should save you from the three proposed options, I think.

from tcn.

addisonklinke commented on August 15, 2024

I see now, sorry I glossed over your stride specification

What are your thoughts about the attention mechanism?

Also, in OpenNMT's recurrent architecture the decoder LSTM does "snake" through the image in a sense. However it is an attention mechanism, so at each timestep of the decoder, it can choose to focus selectively on different rows of the feature volume. Correct me if I am wrong, but I do not think the TCN needs anything like attention because it has access to all the timesteps at once whereas the decoder LSTM has to handle them one-by-one?

I also found a paper implementing a similar conv2conv architecture (except for video captioning). They used an attention mechanism discussed on pg. 5, and their decoder seems quite similar to a TCN concept. The main addition looks like the temporal deformations in their encoder, however they don't have an ablation study as substantial as your paper to show whether those are really necessary or simple convolutions could suffice

from tcn.

jerrybai1995 commented on August 15, 2024

But still, in the decoding process you would want to do things in a generative, one-by-one way: you first generate t=1, and then using that generation you can do t=2, and so forth. Although a TCN does have access to all the time steps, at generation time you do not have all time steps--- you still need to generate them (so t=1 first, and then t=2, etc.)

from tcn.

Recommendation for image to text about tcn HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent