ocropus / ocropy Goto Github PK

Python-based tools for document analysis and OCR

License: Apache License 2.0

Python 28.38% HTML 0.04% Shell 0.27% Jupyter Notebook 71.27% Dockerfile 0.03%

ocropy's Introduction

ocropy

OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do some image preprocessing, and possibly also train new models.

In addition to the recognition scripts themselves, there are a number of scripts for ground truth editing and correction, measuring error rates, determining confusion matrices, etc. OCRopus commands will generally print a stack trace along with an error message; this is not generally indicative of a problem (in a future release, we'll suppress the stack trace by default since it seems to confuse too many users).

Installing

To install OCRopus dependencies system-wide:

$ sudo apt-get install $(cat PACKAGES)
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ sudo python setup.py install

Alternatively, dependencies can be installed into a Python Virtual Environment:

$ virtualenv ocropus_venv/
$ source ocropus_venv/bin/activate
$ pip install -r requirements.txt
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ python setup.py install

An additional method using Conda is also possible:

$ conda create -n ocropus_env python=2.7
$ conda activate ocropus_env
$ conda install --file requirements.txt
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ python setup.py install

To test the recognizer, run:

$ ./run-test

Running

To recognize pages of text, you need to run separate commands: binarization, page layout analysis, and text line recognition. The default parameters and settings of OCRopus assume 300dpi binary black-on-white images. If your images are scanned at a different resolution, the simplest thing to do is to downscale/upscale them to 300dpi. The text line recognizer is fairly robust to different resolutions, but the layout analysis is quite resolution dependent.

Here is an example for a page of Fraktur text (German); you need to download the Fraktur model from https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz to run this example:

# perform binarization
./ocropus-nlbin tests/ersch.png -o book

# perform page layout analysis
./ocropus-gpageseg 'book/????.bin.png'

# perform text line recognition (on four cores, with a fraktur model)
./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png'

# generate HTML output
./ocropus-hocr 'book/????.bin.png' -o ersch.html

# display the output
firefox ersch.html

There are some things the currently trained models for ocropus-rpred will not handle well, largely because they are nearly absent in the current training data. That includes all-caps text, some special symbols (including "?"), typewriter fonts, and subscripts/superscripts. This will be addressed in a future release, and, of course, you are welcome to contribute new, trained models.

You can also generate training data using ocropus-linegen:

ocropus-linegen -t tests/tomsawyer.txt -f tests/DejaVuSans.ttf

This will create a directory "linegen/..." containing training data suitable for training OCRopus with synthetic data.

Roadmap

Project Announcements
The text line recognizer has been ported to C++ and is now a separate project, the CLSTM project, available here: https://github.com/tmbdev/clstm
New GPU-capable text line recognizers and deep-learning based layout analysis methods are in the works and will be published as separate projects some time in 2017.
Please welcome @zuphilip and @kba as additional project maintainers. @tmb is busy developing new DNN models for document analysis (among other things). (10/15/2016)

A lot of excellent packages have become available for deep learning, vision, and GPU computing over the last few years. At the same time, it has become feasible now to address problems like layout analysis and text line following through attentional and reinforcement learning mechanisms. I (@tmb) am planning on developing new software using these new tools and techniques for the traditional document analysis tasks. These will become available as separate projects.

Note that for text line recognition and language modeling, you can also use the CLSTM command line tools. Except for taking different command line options, they are otherwise drop-in replacements for the Python-based text line recognizer.

Contributing

OCRopy and CLSTM are both command line driven programs. The best way to contribute is to create new command line programs using the same (simple) persistent representations as the rest of OCRopus.

The biggest needs are in the following areas:

text/image segmentation
text line detection and extraction
output generation (hOCR and hOCR-to-* transformations)

CLSTM vs OCRopy

The CLSTM project (https://github.com/tmbdev/clstm) is a replacement for ocropus-rtrain and ocropus-rpred in C++ (it used to be a subproject of ocropy but has been moved into a separate project now). It is significantly faster than the Python versions and has minimal library dependencies, so it is suitable for embedding into C++ programs.

Python and C++ models can not be interchanged, both because the save file formats are different and because the text line normalization is slightly different. Error rates are about the same.

In addition, the C++ command line tool (clstmctc) has different command line options and currently requires loading training data into HDF5 files, instead of being trained off a list of image files directly (image file-based training will be added to clstmctc soon).

The CLSTM project also provides LSTM-based language modeling that works very well with post-processing and correcting OCR output, as well as solving a number of other OCR-related tasks, such as dehyphenation or changes in orthography (see our publications). You can train language models using clstmtext.

Generally, your best bet for CLSTM and OCRopy is to rely only on the command line tools; that makes it easy to replace different components. In addition, you should keep your OCR training data in .png/.gt.txt files so that you can easily retrain models as better recognizers become available.

After making CLSTM a full replacement for ocropus-rtrain/ocropus-rpred, the next step will be to replace the binarization, text/image segmentation, and layout analysis in OCRopus with trainable 2D LSTM models.

ocropy's People

Contributors

Stargazers

Watchers

Forkers

tedyhabtegebrial bygreencn doubaokun ceubex chagge bx5974 winnetou wollmers adnanulhasan stevenlol zxytim amoliu tsivkyn kirkhadley ddohler yiiwood kaishengyao kostyll icecream4u djj88 zelladoor wqren sauravbiswasiupr rbjork vincent-ucas eric013 danvk donsunsoft overstable splade abhigarg cstollw pengming273 kushal124 sherjilozair antimatter15 xuanhan863 stamhe fangzheng354 fanfannothing ak9527lq wangdongfrank chrisrammy cgenie timwee riordan vanl kaynewest inndy abhilash-potharaju yanweifu fireae kuronekodaisuki zengqiang2006 tajmorton nonva hughp aphilippi shuk nagyistoce rd-wixproducts stweil commonssibi mhr xshhhm cdsj pgrens vrqin nicodjimenez qulogic zjucsxxd agrawal-mohit zuphilip yodebu wanghong-yang azridev spideryan gotomypc uikit0 ashokpant darkseed markismus mikepatrickryan liu4lin mnjstwins a-hilaly wavelets kba supersom ginking liulei2776 jimitit matrixplayer wikicarlos lesliekim llp1992 pythonpunters lunactic duum kalyanp

ocropy's Issues

can't download en-default.pyrnn.gz

$ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
--2015-01-09 20:06:04--  http://www.tmbdev.net/en-default.pyrnn.gz
Resolving www.tmbdev.net (www.tmbdev.net)... 69.163.203.33
Connecting to www.tmbdev.net (www.tmbdev.net)|69.163.203.33|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-01-09 20:06:05 ERROR 404: Not Found.

Could you recommend some materials about the algorithm you use?

Hi! I feel this project is very interesting and I want to learn from it.
So could you recommend me some materials(papers or books) you referred in this project?
Thank you very much

multi-language documents

hello
can ocropy support multi-language text in the same document( image) ?

ocropus-linegen does not handle unicodes well

This versions of ocropus-linegen does not handle languages other than English. E.g. If you try to use it to generate french text-lines, you will see boxes for some accented letters. Beware!

test case failed

After I installed everything and I run the test, there is an error in ocropus-hocr regarding to matplotlib

 File "./ocropus-hocr", line 13, in <module>
  from pylab import *
File "/usr/local/lib/python2.7/dist-packages/pylab.py", line 1, in <module>
  from matplotlib.pylab import *
File "/usr/local/lib/python2.7/dist-packages/matplotlib/pylab.py", line 274, in <module>
  from matplotlib.pyplot import *
File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 109, in <module>
  _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
  globals(),locals(),[backend_name],0)
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_gtk.py", line 36, in <module>
  from matplotlib.backends.backend_gdk import RendererGDK, FigureCanvasGDK
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_gdk.py", line 33, in <module>
  from matplotlib.backends._backend_gdk import pixbuf_get_pixels_array
ImportError: No module named _backend_gdk

Please have a look.
My matplotlib version is 1.4.2, which version should I use?
And I am using python 2.7, on ubuntu 14.04.

Thank you.

Please provide en-default.pyrnn.gz and fraktur.pyrnn.gz models

http://www.tmbdev.net is not accessiable.

It would be great if en-default.pyrnn.gz and fraktur.pyrnn.gz models can be bundled with the code

ocropus-rpred crashes when printing unicode text without locale

When running ocropus-rpred without a locale (e.g. using subprocess.Popen) and recognizing non-ASCII character it tries to print them directly onto the command line causing an UnicodeEncodeError and subsequent skipping of the line. The offending line is 197 in ocropus-rpred:

   print fname,":",pred

As similar calls appear to occur in a wide range of ocropus utilities it would be sensible to wrap them all in the correct encode() statements.

Possible bugs?

Hello,
When I running ocropus-ltrain, it will occasionally warning: "FloatingPointError: overflow encountered in exp", and the program seems to restart from the nearest saved state. The problem occurs mainly in the "ffunc" function in lstm.py, which defines the softmax function using: 1.0/(1.0+exp(-x)). Same problem also occurs in the "sigmoid" function. I think this may be caused by large values in x. In the CLSTM source code, the values x is clipped to 20 for positive values, and -20 for negtive values. After clipping like this, the program goes well without warning.

Another problem is that the "backward" method in class "Parallel" returns None. This is correct for 1-layer BLSTM system, but for multiple layers BLSTM configuration which stacking paralleled BLSTM one over another, this will lead to error, as the deltas of subsequent layer is assigned as the current deltas. So, maybe the method should return deltas.

Best,

doc

Is there any documentation for this ?

Pickle fails with EOFError while loading models on Windows

Here's the traceback (I edited the path to \ocropy):

D:\ocropy>python ocropus-rpred -Q 2 -m D:\ocropy\models\fraktur.pyrnn.gz  T:\0001.bin.png

########## ocropus-rpred -Q 2 -m D:\ocropy\models\fr

#inputs 1
# loading object D:\ocropy\models\fraktur.pyrnn.gz
Traceback (most recent call last):
  File "ocropus-rpred", line 103, in <module>
    network = ocrolib.load_object(args.model,verbose=1)
  File "D:\ocropy\ocrolib\common.py", line 513, in load_object
    return unpickler.load()
EOFError

0001.bin.png is:

fraktur.pyrnn.gz is freshly downloaded from http://www.tmbdev.net/fraktur.pyrnn.gz. I am running Win7 x64 and Python 2.7.6 x64.

Also, why do the instructions first suggest downloading en-default.pyrnn.gz and then running with fraktur.pyrnn.gz?

ValueError: shape mismatch in `ocropus-gpageseg`

When I run ocropus-gpageseg on the following image:

I get a ValueError which prevents any lines from being extracted:

$ ocropus-gpageseg -n --minscale 10 --maxcolseps 0 book-703662b.crop/0001.bin.png

########## /usr/local/bin/ocropus-gpageseg -n --minscale 10 --maxcolsep

book-703662b.crop/0001.bin.png
scale 13.8564064606
computing segmentation
computing column separators
computing lines
propagating labels
spreading labels
number of lines 4
finding reading order
writing lines
1: (slice(3L, 32L, None), slice(3L, 607L, None))
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-gpageseg", line 435, in safe_process1
    process1(job)
  File "/usr/local/bin/ocropus-gpageseg", line 408, in process1
    binline = psegutils.extract_masked(1-cleaned,l,pad=args.pad,expand=args.expand)
  File "/usr/local/lib/python2.7/site-packages/ocrolib/toplevel.py", line 213, in argument_checks
    result = f(*args,**kw)
  File "/usr/local/lib/python2.7/site-packages/ocrolib/psegutils.py", line 114, in extract_masked
    line = where(mask,line,amax(line))
ValueError: shape mismatch: objects cannot be broadcast to a single shape

error while training

After executing (on 156 files of groundtruth text and imagery):
ocropus-rtrain gt/????/*.png -F 10000 -o mub_combined &
I've got the following reproduceable error:

454 150.32 (1486, 48) gt/0001/01000b.bin.png
TRU: u'quod dicitur Fulda, quod est situm in pago Grapfeld, constructum in honore sancti'
ALN: u'quuod dicituur Fuulda, qquod et situumm in pagoo Grapfeld, construuctuuumm in honnore '
OUT: u' iiii ii te ti imm tm e iii eutmut m mi eii '

oops, got FloatingPointError overflow encountered in exp

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 228, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 863, in trainSequence
self.outputs = array(self.lstm.forward(xs))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 587, in forward
xs = net.forward(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 636, in forward
outputs = [net.forward(xs) for net in self.nets]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 545, in forward
self.WIP,self.WFP,self.WOP)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 419, in forward_py
go[t] = ffunc(gox[t])
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 367, in ffunc
return 1.0/(1.0+exp(-x))
FloatingPointError: overflow encountered in exp
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 232, in
network = ocrolib.load_object(last_save)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 502, in load_object
fname = ocropus_find_file(fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 680, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

another case with half of the files (dir 0001 only):

960 110.63 (1490, 48) gt/0001/010022.bin.png
TRU: u'in honorem\u2074 domini salvatoris Jesu Christi et beate Marie genetricis\u2075 eius episco-'
ALN: u'in honorem~ domini salvatoris Jesu Christi et beate MMarie genetricis eius episco-'
OUT: u'iu bouoreu ouiui salvatoris lesu bristi et beate arie geuetricis eius episoo-'

oops, got FloatingPointError overflow encountered in exp

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 228, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 863, in trainSequence
self.outputs = array(self.lstm.forward(xs))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 587, in forward
xs = net.forward(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 636, in forward
outputs = [net.forward(xs) for net in self.nets]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 619, in forward
return self.net.forward(xs[::-1])[::-1]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 545, in forward
self.WIP,self.WFP,self.WOP)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 419, in forward_py
go[t] = ffunc(gox[t])
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 367, in ffunc
return 1.0/(1.0+exp(-x))
FloatingPointError: overflow encountered in exp
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 232, in
network = ocrolib.load_object(last_save)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 502, in load_object
fname = ocropus_find_file(fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 680, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Understanding ocropy installation

Hello everyone,

I'm working on a project and need to use ocropy, I tried to install it on windows but failed, so I moved to Ubuntu. I'm not a nerdy when it comes to Ubuntu, so I'm stuck now.

I have installed python 2.7 on Ubuntu and all the requirements 1 and 2, also I've installed opencv.

Then I tried to install ocropy as written in the read-me but failed at this line:
mv en-default.pyrnn.gz models/

I actually don't understand it, because its previous line gets a .gz then we want to move it to a model directory (which is not created yet!) then we need to run setup.py which is not their. So I don't know if I'm missing something, I know I might sound so ignorant to some of you but I'm really new to this and I'm doing my best to understand, I also didn't find any helpful information on the net regarding my issue.

Any help is appreciated, Thank you in advance.

Installing Ocropus in Mac Yosemite

Hi,
I want to install Ocropus in mac, I've followed the guidelines from [http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html] and from Ocropy repository. I managed o do this part:
<$brew install python
$brew install opencv
$brew install homebrew/python/scipy>
and also this part:
<$ cd /usr/local/Cellar/python/2.7.6_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
$ rm cv.py cv2.so
$ ln -s /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv.py cv.py
$ ln -s /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv2.so cv2.so>
But I'm not sure about the instructions provided in the Ocropy Github, see, what section do I use? System-wide or Python Virtual Environment, I'm on mac Yosemite.

To install OCRopus dependencies system-wide:

$ sudo apt-get install $(cat PACKAGES)
$ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ sudo python setup.py install

Alternatively, dependencies can be installed into a Python Virtual Environment:

$ virtualenv ocropus_venv/
$ source ocropus_venv/bin/activate
$ pip install -r requirements_1.txt

tables has some dependencies which must be installed first:

$ pip install -r requirements_2.txt
$ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/

Could someone give a step-by-step guide to follow? I'm a bit lost. thanks!!!

A maxheight option for lines

I'm running ocropus-nlbin and ocropus-gpageseg on this image:

The first line that comes out of ocropus-gpageseg is this:

i.e. two lines joined into one. Naturally this produces nonsensical output when I run it through ocropus-rpred. Admittedly this is a hard case (there's literally one pixel separating the two lines in that image), but it might be nice to have some kind of maxlineheight option I could pass to ocropus-gpageseg to give it a hint that this is wrong.

Running ocropus-gpageseg --scale 11 --minscale 11 does split the lines, but I'm reluctant to explicitly set --scale for all my images. There are at least two fonts present in the collection, one of which is taller than the other. So it's hard to say the x-height in advance. But I can safely say that if any image of a line is >50 px, then something went wrong.

I'm not sure if such an option would make sense or if there's a better solution, but I wanted to toss the idea out there!

Commands:

ocropus-nlbin -n 734090b.crop.png -o book
ocropus-gpageseg -n [--scale 11 --minscale 11] --maxcolseps 0 book/????.bin.png

Thanks for the great OCR library!

plan for supporting CUDA or OpenCL?

Hello,
Currently I'm training a big model, It seems that the learning is kind of slow without gpu.
Do you have some plan for supporting CUDA or OpenCL?
Thanks!

Python-based ocropy faster than c++ version

Hello,
When I training text-line recognizer, I found Python-based ocropy is even faster than c++ version (on Red Hat 4.4.7, using ocropus-ltrain). This is very strange, as it is said CLSTM is faster than ocropy. Has anyone found the same phenomenon? What is the reason behind this?
Best,
Thanks!

Using Character Probabilities as Confidence Estimates

Is there a way that I could use the character probabilities output by the LSTM network (and shown in the --show diagrams of rpred) to estimate confidence for a given transcription? It's unclear how I would actually access those values.

Where can I find source code for previous Ocropus?

The http://code.google.com/p/ocropus is not available any more. I am reading a few papers related to the old Ocropus, and want to take a look of the code.

Avoid text recognition in image areas of the page

In several tests ocropus/ocropy tries to recognize some text within an image area in my pages. How can this be avoided? Here is an example of such a page: normal-beispiel-mit-bild

I run the same sequence of commands as in run-test and ocropus/ocropy recognizes two columns one with the actual text and the other with some nonsensical symbols from the picture.

README virtualenv instructions

Hi,

I think that the README instructions should read:

source ocropus_venv/bin/activate

instead of

source ocropus_venv/bin/source

TypeError: dot() takes no keyword arguments

when I execute ./run-test and ./run-rtrain commands, appear below error, please check and help me:

TypeError: dot() takes no keyword arguments
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rpred", line 245, in safe_process1
return process1(arg)
File "/usr/local/bin/ocropus-rpred", line 150, in process1
pred = network.predictString(line)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 934, in predictString
cs = self.predictSequence(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 884, in predictSequence
self.outputs = array(self.lstm.forward(xs))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 605, in forward
xs = net.forward(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 661, in forward
outputs = [net.forward(xs) for net in self.nets]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 559, in forward
self.WIP,self.WFP,self.WOP)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 414, in forward_py
dot(WGI,source[t],out=gix[t])

execute ./run-rtrain, error message as below:

$ ./run-rtrain

tar -zxf tests/uw3-500.tgz
ocropus-rtrain 'book//.bin.png' -d 5 -o uw3-500-model
inputs 500

tests None

CenterNormalizer

using default codec

charset size 157 [ ~!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[]^_`abcdefghijklmnopqrstuvwxyz{|}隆垄拢搂漏芦庐掳露禄驴妹?
```
                                                    妹妹妹妹妹妹妹犆⒚っγ┟疵睹访姑幻济颗排糕犫♀⑩ｂ光衡猹猥猞]
```
last_trial 0
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 285, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 890, in trainSequence
self.outputs = array(self.lstm.forward(xs))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 605, in forward
xs = net.forward(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 661, in forward
outputs = [net.forward(xs) for net in self.nets]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 559, in forward
self.WIP,self.WFP,self.WOP)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 414, in forward_py
dot(WGI,source[t],out=gix[t])
TypeError: dot() takes no keyword arguments

Question about line segmenting

(Examples taken from this pdf - https://www.dropbox.com/s/6sy77shnro7sqdf/6.pdf?dl=0)

I have a bunch of files from which I've extracted the text in both a line format and a coherent blob format and I'm trying to understand what the best practices are for using ocropy-linegen.

An example in the document given is lines 5-8 (reproduced below):

流動資産は、たな卸資産が減少したものの、受取手形及び売掛金などが増加したことなどにより、 前連結会計年度末に比べ5億65百万円増加し、630億33百万円となりました。固定資産は、有形固 定資産、無形固定資産ともに減価償却により減少したものの、投資有価証券の評価差額が増加した ことにより、前連結会計年度末に比べ5億8百万円増加し、212億86百万円となりました。

Here, I could feed that whole blob to ocropy-linegen or I could feed it line by line:

流動資産は、たな卸資産が減少したものの、受取手形及び売掛金などが増加したことなどにより、
...
ことにより、前連結会計年度末に比べ5億8百万円増加し、212億86百万円となりました

I get the sense that the latter is what it expects. Is that right?

For another example, see the table further down on that page. The second row is:

自己資本比率    28.1    ...    23.4

Does ocropy-linegen want the full line (row), the full line with the spacing, or would it rather have each cell individually?

Thanks.

ValueError: setting an array element with a sequence.

On some images, rpred and rtrain's dewarp function throws an error because the padding is insufficient. This seems to be particularly true of images which have borders or other noise near the top and bottom edges. Attached is an example.

I have a simple, but unsatisfying, fix for this where I double the amount of padding. See my github fork commit (branch bugfix) at:
braddockcg@be2e6d1

The full error is:

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rpred", line 245, in safe_process1
return process1(arg)
File "/usr/local/bin/ocropus-rpred", line 145, in process1
line = lnorm.normalize(line,cval=amax(line))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 59, in normalize
dewarped = self.dewarp(img,cval=cval,dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 56, in dewarp
dewarped = array(dewarped,dtype=dtype).T
ValueError: setting an array element with a sequence.

module pylab not contained

I just want to inform you that there may be some packages missing for the installation. I was following the installation instructions. When I came to the point where to run the test, I got the following error:

~/ocropy$ ./run-test
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-nlbin", line 3, in <module>
    from pylab import *

Running

sudo apt-get install python-numpy python-scipy python-matplotlib

afterwards resolved the problem for me and I could successfully run the test.

pylab.uint32 error (ubuntu 14.04)

Hi. I tried installing the dependencies using pip on master and ran tests (on Ubuntu 14.04.2 LTS), and I get ImportError: cannot import name uint32 (full stack trace at the bottom). I also tried installing exact versions as in requirement_1.txt, yet no luck.

I checked out v1.0, removed pip libs and did aptitude install of the dependencies and still get the same error:

Traceback (most recent call last):
  File "/usr/local/bin/ocropus-nlbin", line 9, in <module>
    import ocrolib
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/__init__.py", line 12, in <module>
    from common import *
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 16, in <module>
    import ligatures
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/ligatures.py", line 8, in <module>
    from pylab import uint32
ImportError: cannot import name uint32

What do you think is wrong?

Cannot run the tests on OS X Yosemite

Problem

I am following the installation guide with installed virtualenv and managed to install requirements 1 and 2 successfully. The next step is to run the tests but I cannot execute them. First: simply running ./run-test fired zsh: Command ocropus-nlbin cannot be found or something like that so I changed the script as follows:

#!/bin/zsh -e

rm -rf temp
./ocropus-nlbin tests/testpage.png -o temp
./ocropus-gpageseg 'temp/????.bin.png'
./ocropus-rpred -n 'temp/????/??????.bin.png'
./ocropus-hocr 'temp/????.bin.png' -o temp.html
./ocropus-visualize-results temp
./ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html

echo "to see recognition results, type: firefox temp.html"
echo "to see correction page, type: firefox temp-correction.html"
echo "to see details on the recognition process, type: firefox temp/index.html"

Then it found the commands I guess but ./run-test caused the following error to occur:

clang: warning: -O4 is equivalent to -O3
ld: library not found for -lgomp
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Traceback (most recent call last):
  File "./ocropus-nlbin", line 9, in <module>
    import ocrolib
  File "/Users/nyxz/dev/workspace/python/ocropy/ocrolib/__init__.py", line 12, in <module>
    from common import *
  File "/Users/nyxz/dev/workspace/python/ocropy/ocrolib/common.py", line 18, in <module>
    import lstm
  File "/Users/nyxz/dev/workspace/python/ocropy/ocrolib/lstm.py", line 32, in <module>
    import nutils
  File "/Users/nyxz/dev/workspace/python/ocropy/ocrolib/nutils.py", line 25, in <module>
    lstm_native = compile_and_load(lstm_utils)
  File "/Users/nyxz/dev/workspace/python/ocropy/ocrolib/native.py", line 67, in compile_and_load
    path = compile_and_find(c_string,**keys)
  File "/Users/nyxz/dev/workspace/python/ocropy/ocrolib/native.py", line 63, in compile_and_find
    raise CompileError()
ocrolib.native.CompileError

Previous steps

To make the requirements install successfully I needed to install some additional stuff. I will list everything that I had to install here:

pip
python3
virtualenv
hdf5
pylab

I use ZSH as you may already noticed. I don't know what impact this has on the installation so I am just mentioning it.

Any suggestion on how to finish the installation with test running normally are highly appreciated!

Metadata about detected characters: quality scores + alternatives

The ocropus-rpred tool outputs text files of predicted text for each image. It would be nice if there were a way for it to output quality scores for each character, as well as alternatives.

For example, this line:

is being transcribed as:
2. 14E St. Lrand Loncourse, n.w. cor.

It's possible that G is the second most-likely candidate for the first letter in Lrand and C for Loncourse. If I were to build some kind of language model as a post-processing step, it would be clear that G and C are the better choices at those positions.

Some kind of JSON output would be helpful. It might look something like:

[
  {
    "x": 216,
    "char": "L",
    "candidates": [
      {
        "char": "L",
        "score": 0.9
      },
      {
        "char": "G",
        "score": 0.8
      },
      ...
    ]
  },
  ...
]

What is the "ALN" line?

When I run ocropus-rtrain, I see lines like this:

1000 56.70 (726, 48) 704213b-crop-01000d.png
   TRU: u'Eugene L. Armbruster Collection.'
   ALN: u'Eugene L. Armbbbruster Collection.'
   OUT: u're S. rrter eleoton.'

TRU and OUT are pretty clear. But what is ALN? It's usually better than OUT, especially when I first start training the model (as in the example above). Is ALN based on the predictions, or does it somehow incorporate truth data? Could I use it instead of the output of ocropus-rpred (which matches OUT)?

(Let me know if you'd prefer that I post these sorts of questions on the mailing list—I generally find content much easier to find in GitHub than mailing list archives)

clstm.py not found

When running ocropus-ltrain, 'clstm module not found' is popping up. There is a setup.py file in clstm directory, when I run this, following complain pops up:
file clstm.py (for module clstm) not found

Where can I find clstm.py file?

FloatingPointError while training

I followed the "run-rtrain" document to train RNN. I got the training set by " tar -zxf tests/uw3-500.tgz". I didn't change anything, neither RNN structure nor parameters. I meet the following problem while training: FloatingPointError: overflow encountered in exp. However, the program does not stop. It keeps iterating while showing the above FloatingPointError message again and again.
I reduce the learning rate to half of the default value. But the same problem occurs. Do I train the network in the wrong way?

Possible bug in ocropus-rtrain

During the training, an exception is raised if 'floating point error' occurs. The script (ocropus-rtrain) then loads the previous model using lstm_load(last_save) (line 289 in ocropus-rtrain).
Now, if the floating point error is raised during the first mini-batch, there is no model found, as last_save is still None. The result is:

Traceback (most recent call last):
File "../ocropy/ocropus-rtrain", line 313, in
network = load_lstm(last_save)
File "../ocropy/ocropus-rtrain", line 191, in load_lstm
network = ocrolib.load_object(last_save)
File "/export/home/adnan/ocropy/ocrolib/common.py", line 503, in load_object
fname = ocropus_find_file(fname)
File "/export/home/adnan/ocropy/ocrolib/common.py", line 682, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

In this case, there should be a way to restart the training.

Sorry， I accidentally closed the issue "FloatingPointError while training"

My environment:

python 2.7.6

Ubuntu 14.04.1 LTS \n \l

gcc version 4.8.4

ocropus-gpageseg --> TypeError in error message for scale

This is my problem when trying to ocr with my pngs

book/0001.bin.png SKIPPED too many connnected components for a page image (21112 > 1176) (use -n to disable this check)

./ocropus-gpageseg -n book/0001.bin.png
INFO:
INFO: book/0001.bin.png
INFO: scale 4.472135955
Traceback (most recent call last):
File "./ocropus-gpageseg", line 423, in safe_process1
process1(job)
File "./ocropus-gpageseg", line 373, in process1
print_error("%s: scale (%g) less than --minscale; skipping\n"%(fname,str(scale)))
TypeError: float argument required, not str

Error "expected a segmentation image" in ocropus-gpageseg with uint64

OS: OS X El Captain.

 ./ocropus-gpageseg 'book/0001.bin.png'
INFO:  
INFO:  ########## ./ocropus-gpageseg book/0001.bin.png
INFO:  
INFO:  book/0001.bin.png
INFO:  scale 41.701318924
INFO:  computing segmentation
INFO:  computing column separators
INFO:  computing lines
INFO:  propagating labels
Traceback (most recent call last):
  File "./ocropus-gpageseg", line 423, in safe_process1
    process1(job)
  File "./ocropus-gpageseg", line 379, in process1
    segmentation = compute_segmentation(binary,scale)
  File "./ocropus-gpageseg", line 320, in compute_segmentation
    llabels = morph.propagate_labels(boxmap,seeds,conflict=0)
  File "/Users/lihanli/projects/ocropy/ocrolib/toplevel.py", line 209, in argument_checks
    raise e
CheckError: 
CheckError for argument labels of function <function propagate_labels at 0x1095f5578>
<ndarray-7ff76272e160 (5753, 4304) uint64 [0,168]> of type <type 'numpy.ndarray'>: expected a segmentation image

Line detection with different font sizes

The header line (title) of a document is often written in larger font as the normal text. I experienced that ocropus sometimes cuts a larger font size line into two lines (which are then recognized into nonsense). If the header font is not too much larger (twice seems okay), then the splitting up in lines is okay. But the problem occurs if the header font is 3 times the size of the normal font (36pt and 12pt). E.g. ocropus-gpageseg of 0002 bin

where the headline is split up into three lines:

i.e.

Can the parameters of ocropus-gpagesegavoid such a behaviour? Or line detection tweaked in general?

Other Languages

Is there support for non-latin languages like Chinese, Japanese or Thai?

Applying patterns to line recognition

I have a large photo to OCR, and each line follows the same pattern, for example, we can use regular expression to express as (\d+ [A-Z]+), but there are a lot of error in recognizing parts of them due to the documents are so old and not clear, so I am thinking can I make it read the first part of each line as only digits, which may help some.(I know I can do some cleaning later, but it seems that is not a good choice). I have not familiar with LSTM, do you know which file I should look at?

Installing on Unix

Hello,

Trying to install on mac OS X Yosemite (10.10.1), using anaconda. Have installed packages in ./PACKAGES manually (curl python-scipy python-matplotlib python-tables firefox imagemagick python-opencv python-bs4) mostly using brew. Following commands threw no warnings:
$ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ sudo python setup.py install

Test throws following error:
kungfujams-mbp:ocropy-master kungfujam$ ./run-test
clang: warning: -O4 is equivalent to -O3
ld: library not found for -lgomp
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Traceback (most recent call last):
File "/Users/kungfujam/anaconda/bin/ocropus-nlbin", line 9, in
import ocrolib
File "/Users/kungfujam/anaconda/lib/python2.7/site-packages/ocrolib/init.py", line 12, in
from common import *
File "/Users/kungfujam/anaconda/lib/python2.7/site-packages/ocrolib/common.py", line 18, in
import lstm
File "/Users/kungfujam/anaconda/lib/python2.7/site-packages/ocrolib/lstm.py", line 32, in
import nutils
File "/Users/kungfujam/anaconda/lib/python2.7/site-packages/ocrolib/nutils.py", line 25, in
lstm_native = compile_and_load(lstm_utils)
File "/Users/kungfujam/anaconda/lib/python2.7/site-packages/ocrolib/native.py", line 67, in compile_and_load
path = compile_and_find(c_string,**keys)
File "/Users/kungfujam/anaconda/lib/python2.7/site-packages/ocrolib/native.py", line 63, in compile_and_find
raise CompileError()
ocrolib.native.CompileError

This may be a gcc/clang issue as seen here: http://stackoverflow.com/questions/20321988/error-enabling-openmp-ld-library-not-found-for-lgomp-and-clang-errors

Any advice gratefully received.

James

Unable to install on windows... wtf is 'source'?

The virtualenv/ stuff seems to instlall, but wtf is 'source'? as in:

source ocropus_venv/bin/activate

Can't do it! Can't find it - try finding something called source on google. Can't be done. wtf, over....

differences in Softmax implementation with clstm (separate-derivs branch)

How does W2 and DW2 in ocropy's python Softmax implementation relate to W and w in the C++ implementation in clstm?

"ocropus-gtedit" in ./run-test does not exist

Could you please fix an issue from the 9th line of file ./run-test?

You mentioned a script "ocropus-gtedit" there, but it does not exist in the repo.

Error "array must contain integer values" in ocropus-gpageseg with uint64

Hello,

I'm trying to run the test script on OS X 10.10 with python 2.7. An exception is throw at the second step running ocropus-gpageseg.

MacBook-Pro-2:ocropy pujia$ PATH=$PATH:. ./run-test 
# tests/testpage.png
=== tests/testpage.png 1
estimating skew angle
estimating thresholds
rescaling
tests/testpage.png lo-hi (0.39 1.44) angle  0.1  no-normalization
writing

########## ./ocropus-gpageseg temp/????.bin.png

temp/0001.bin.png
Traceback (most recent call last):
  File "./ocropus-gpageseg", line 414, in safe_process1
    process1(job)
  File "./ocropus-gpageseg", line 356, in process1
    scale = psegutils.estimate_scale(binary)
  File "/Users/pujia/git-workspace/ocropy/ocrolib/psegutils.py", line 41, in estimate_scale
    objects = binary_objects(binary)
  File "/Users/pujia/git-workspace/ocropy/ocrolib/psegutils.py", line 37, in binary_objects
    objects = morph.find_objects(labels)
  File "/Users/pujia/git-workspace/ocropy/ocrolib/toplevel.py", line 209, in argument_checks
    raise e
CheckError: 
CheckError for argument image of function <function find_objects at 0x103972398>
<ndarray-7fbead2b6420 (3000, 2078) uint64 [0,5187]> of type <type 'numpy.ndarray'>: array must contain integer values

########## ./ocropus-rpred -n temp/????/??????.bin.png

Traceback (most recent call last):
  File "./ocropus-rpred", line 92, in <module>
    inputs = ocrolib.glob_all(args.files)
  File "/Users/pujia/git-workspace/ocropy/ocrolib/toplevel.py", line 213, in argument_checks
    result = f(*args,**kw)
  File "/Users/pujia/git-workspace/ocropy/ocrolib/common.py", line 654, in glob_all
    raise FileNotFound("%s: expansion did not yield any files"%arg)
ocrolib.common.FileNotFound: file not found temp/????/??????.bin.png: expansion did not yield any files

I just want to check to see if this issue is known before I start debug from bottom up. Thanks.

Install guide for fedora

Hello,

not sure this is a valid issue, so please just close it, if not.

But could someone add a description how to install everything on fedora?

I tried it and they're obviously not using exactly the same package names,
since I couldn't install the requirements with yum/dnf or even pip.

Or even just a confirmation that someone got it running on fedora would be nice.

Error in ocropus-econf with some k-options

The function ocropus-econf returns an error when comparisons should only be done among the letters or digits, i.e.

$ ocropus-econf -k digits output/*/*.gt.txt
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-econf", line 59, in <module>
    outputs = sorted(list(outputs))
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 560, in parallel_map
    result = fun(e)
  File "/usr/local/bin/ocropus-econf", line 50, in process1
    err,cs = edist.xlevenshtein(txt,gt,context=args.context)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/edist.py", line 43, in xlevenshtein
    cost = current[n]
UnboundLocalError: local variable 'current' referenced before assignment

GPageSeg

Having installed Ocropy, I get a CheckError trying to run the example given in the Readme.

book/0001.bin.png
Traceback (most recent call last):
  File "./ocropus-gpageseg", line 414, in safe_process1
    process1(job)
  File "./ocropus-gpageseg", line 356, in process1
    scale = psegutils.estimate_scale(binary)
  File "[...]/ocropy-master/ocrolib/psegutils.py", line 41, in estimate_scale
    objects = binary_objects(binary)
  File "[...]/ocropy-master/ocrolib/psegutils.py", line 37, in binary_objects
    objects = morph.find_objects(labels)
  File "[...]/ocropy-master/ocrolib/toplevel.py", line 209, in argument_checks
    raise e
CheckError:
CheckError for argument image of function <function find_objects at 0x10e37d9b0>
<ndarray-7f87ebc09080 (5753, 4304) uint64 [0,11315]> of type <type 'numpy.ndarray'>: array must contain integer values

how to get the confidence of predicted output ?

Hello,
how to get the confidence of predicted output ?
any advice will be welcomeed. thanks in advance !
Best regards,
Thanks!

hOCR per word basis?

Hi,

I really like your tool, it's recognition seems to be better than Tesseract's in some cases. Tesseract, however, has a more detailled hOCR output:

Each word gets wrapped in a span with class ocrx_word and has a bbox and x_wconf property.

The bbox property for each word gives the user the possibility to write an own implementation of layout detection, while the x_wconf allows omitting words, which were probably not recognized correctly.

Is this also possible with ocropy or is this planned?

Thank you.

Documentation

Is there any sort of documentation on using the software and an overview of how it works for a new user? Perhaps there is already a work in progress? I would be willing to help write some basic documentation if that is of interest.

ocropus-gpageseg - Segmentation

I'm from Thailand, and I use OCRopus to recognition my language structure, It's work!
but I got some problem in part of segmentation, for Thai structure have upper and lower character (not related to middle line)
your segmentation have cut some part of upper Thai character. I attempt to adjust argument value.
It still not work. can you tell me about how to add a upper height of the upper line or guild me a part of code that i can edit.

the upper character is cut

ocropus / ocropy Goto Github PK

ocropy's Introduction

ocropy

Installing

Running

Roadmap

Contributing

CLSTM vs OCRopy

ocropy's People

Contributors

Stargazers

Watchers

Forkers

ocropy's Issues

oops, got FloatingPointError overflow encountered in exp

another case with half of the files (dir 0001 only):

oops, got FloatingPointError overflow encountered in exp

tables has some dependencies which must be installed first:

inputs 500

tests None

CenterNormalizer

using default codec

charset size 157 [ ~!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[]^_`abcdefghijklmnopqrstuvwxyz{|}隆垄拢搂漏芦庐掳露禄驴妹?

last_trial 0

Problem

Previous steps

Recommend Projects

Recommend Topics

Recommend Org