Giter VIP home page Giter VIP logo

lstm-ctc-ocr's Introduction

LSTM-CTC-OCR Toy experiment

The project is just a toy experiment trying to apply CTC and LSTM for OCR problem, however, I only succeed in 20-digits recognition while longer context text is still hard to train. I may or may not pick up this project in the future. So basically, this is a project for summary.

The trend of line recognition

Recognizing lines of unconstrained text from images has always suffered from segmentation problems, which requires carefully designed character segmentation methods and heuristic tuning of the cost functions. However, due to the develpment of Recurrent Neural Network, espectially LSTM(Long-Short-Term-Memory) and GRU(Gated Recurrent Unit), it is a trend to recognize the whole line for a time and output line text from end to end.

CTC, Connectionist Temporal Classfication

CTC, which was deviced by Alex Grave in 2006, is essentially a kind of loss function. For temporal classification tasks and sequence labelling problems, the alignment between the inputs and outputs is unknown, so we need CTC loss function to measure the distance between softmax activation and groundtrue label.

Baidu Research had implemented a fast parallel version of CTC, along with bindings for Torch, refer to this README for more information about CTC and warp-ctc.

Origin Reference

Application of CTC

Alex Graves developed CTC and used it to speech recognition and handwriting recognition. Some researchers continued his works, like project ocropy, paragraph recognition, [this version] (https://arxiv.org/abs/1604.08352), and online seq learning

You can also refer to [Recursive Recurrent Nets with Attention Modeling for OCR in the Wild] (http://arxiv.org/abs/1603.03101) to compare these two modern different architectures.


You can โญ this project if you like it.

lstm-ctc-ocr's People

Contributors

halfish avatar nightfury13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lstm-ctc-ocr's Issues

Bad argument #1 to 'set'

I get this output when I tried to run 3_phonernn.lua:

train size = 8, valid size = 2 nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output] (1): nn.SplitTable (2): nn.Sequencer @ nn.FastLSTM(36 -> 128) (3): nn.Sequencer @ nn.Recursor @ nn.ReLU (4): nn.Sequencer @ nn.Recursor @ nn.BatchNormalization (5): nn.Sequencer @ nn.Recursor @ nn.Dropout(0.5, busy) (6): nn.Sequencer @ nn.Recursor @ nn.Linear(128 -> 12) (7): nn.JoinTable (8): nn.View(255, 12) } [=================== 1/1 =====================>] Tot: 0ms | Step: 0ms /home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: In 8 module of nn.Sequential: /home/ubuntu/torch/install/share/lua/5.1/torch/Tensor.lua:458: bad argument #1 to 'set' (expecting number or torch.DoubleTensor or torch.DoubleStorage at /home/ubuntu/torch/pkg/torch/generic/Tensor.c:1125)

What's happening here? How do I solve this?
I've run 1_generateImage.py and 2_dump.lua successfully prior to this. Thanks in advance.

Query : How to work with variable width input images?

@Halfish the images of the numeral-sequences you create are all of the same dimensions (36 x 255). How can one go about using variable width images instead? (36 x var-width)

I see that you feed in a complete image at a time to the model (not as 255-strips of 36pix-height of the image -- a sequence of pixel-strips). Can you suggest how this can be modified to incorporate variable length inputs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.