mittagessen / kraken Goto Github PK

OCR engine for all the languages

License: Apache License 2.0

Python 99.36% HTML 0.16% CSS 0.22% JavaScript 0.26%

ocr neural-networks alto-xml hocr handwritten-text-recognition htr layout-analysis optical-character-recognition page-xml

kraken's Introduction

Description

kraken is a turn-key OCR system optimized for historical and non-Latin script material.

kraken's main features are:

Fully trainable layout analysis, reading order, and character recognition

Right-to-Left, BiDi, and Top-to-Bottom script support

ALTO, PageXML, abbyyXML, and hOCR output

Word bounding boxes and character cuts

Multi-script recognition support

Public repository of model files

Variable recognition network architecture

Installation

kraken only runs on Linux or Mac OS X. Windows is not supported.

The latest stable releases can be installed either from PyPi:

$ pip install kraken

or through conda:

$ conda install -c conda-forge -c mittagessen kraken

If you want direct PDF and multi-image TIFF/JPEG2000 support it is necessary to install the pdf extras package for PyPi:

$ pip install kraken[pdf]

or install pyvips manually with pip:

$ pip install pyvips

Conda environment files are provided for the seamless installation of the main branch as well:

$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment.yml

or:

$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment_cuda.yml

for CUDA acceleration with the appropriate hardware.

Finally you'll have to scrounge up a model to do the actual recognition of characters. To download the default model for printed French text and place it in the kraken directory for the current user:

$ kraken get 10.5281/zenodo.10592716

A list of libre models available in the central repository can be retrieved by running:

$ kraken list

Quickstart

Recognizing text on an image using the default parameters including the prerequisite steps of binarization and page segmentation:

$ kraken -i image.tif image.txt binarize segment ocr

To binarize a single image using the nlbin algorithm:

$ kraken -i image.tif bw.png binarize

To segment an image (binarized or not) with the new baseline segmenter:

$ kraken -i image.tif lines.json segment -bl

To segment and OCR an image using the default model(s):

$ kraken -i image.tif image.txt segment -bl ocr -m catmus-print-fondue-large.mlmodel

All subcommands and options are documented. Use the help option to get more information.

Documentation

Have a look at the docs.

Related Software

These days kraken is quite closely linked to the eScriptorium project developed in the same eScripta research group. eScriptorium provides a user-friendly interface for annotating data, training models, and inference (but also much more). There is a gitter channel that is mostly intended for coordinating technical development but is also a spot to find people with experience on applying kraken on a wide variety of material.

Funding

kraken is developed at the École Pratique des Hautes Études, Université PSL.

This project was partially funded through the RESILIENCE project, funded from the European Union’s Horizon 2020 Framework Programme for Research and Innovation.

Received funding from the Programme d’investissements d’Avenir

Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du Programme d’Investissements d’Avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).

kraken's People

Contributors

Stargazers

Watchers

Forkers

tianyaqu jbaiter david-leon wrznr andbue mtarek asgundogdu yufish slbinilkumar mihai-salari d-k-e doreenruirui rsharmapty dkinitz aucan tbaptista isakbosman chappie74 gsathyanarayana gijsjan amir22010 codemanduy shalevy1 joelibaceta docu9 angelodel80 hell-to-heaven tianyeeee rmalouf aps9 jakobjanot dansonc brobertson dstoekl lucaterre millawell pharos-alexandria eighttails lauxley simon-mebrahtu neuroradiology seanpue ucodai nvog manuelmonjarrezarias moldoteck shazaahmed simhaonline kalecoder ersawant ahmadzoli free-variation nyochai7 knitemblazor blu3s1one voldemortuk chaitusvk pombredanne kreasialamteknologi bencomp yueyueooo jpmjpmjpm mbencherif ericbrasiln stweil ub-mannheim kapitsa2811 ponteineptique v-box sixtyfive evarodrigo raphaelmerx dongpinglai malamatenia jjarosch mohammedgomaa christopherdt salimamamou matgille vxltrxrsmxth ciur maxnth sadnen hnjm notiho premkumar7090 reptilefury raceli alix-tz hubashovd hyq-python davanstrien lamaeldo shreejan-git identeq openiti aaronplasek nicolasrenet sumitlakra1992 waynegraham

kraken's Issues

Error while using clstm models that I trained and test it !

Hi !
when I use default clstm models like arabic-beirut-200.clstm model, it's ok and convert successfully:
user@user ~/Desktop/mags $ kraken -i images/tt1.jpg image.txt binarize segment ocr -m arabic-beirut-
200.clstm
Loading RNN default ✓
Binarizing ✓
Segmenting ✓
Processing ✓
Writing recognition results for /tmp/tmpq1EMeq ✓
but when I use any clstm models that I trained and tested them I get:
user@user ~/Desktop/mags $ kraken -i images/tt1.jpg image.txt binarize segment ocr -m persian-keyhan-5000.clstm
Loading RNN default ✓
Binarizing ✓
Segmenting ✓
Traceback (most recent call last):
File "/usr/local/bin/kraken", line 10, in
sys.exit(cli())
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1093, in invoke
return _process_result(rv)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1031, in _process_result
**ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 167, in process_pipeline
task(base_image=base_image, input=input, output=output)
File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 125, in recognizer
for pred in it:
File "/usr/local/lib/python2.7/dist-packages/kraken/rpred.py", line 211, in mm_rpred
pred = nets[script].predictString(line)
File "/usr/local/lib/python2.7/dist-packages/kraken/lib/models.py", line 88, in predictString
line = line.reshape(-1, self.rnn.ninput(), 1)
ValueError: can only specify one unknown dimension

I use kraken version 0.9.6 :
user@user ~/Desktop/mags $ kraken --version
kraken, version 0.9.6.dev8
and compile separate-derivs branch to train my clstm model.

kraken binarize produces a VisibleDeprecationWarning

When I use

kraken binarize

I always receive the following warning:

/usr/lib/python2.7/site-packages/numpy/core/numeric.py:190: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  a = empty(shape, dtype, order)

The binarized image is still computed. But I am trying to run the command unattended and raise an error if something is reported to the standard error output (which the warning unfortunately is).

Do you know of a way to disable the warning or where I would have to search in order to fix this?

Training a new model for telugu language fails

I am trying to train ocropy model for telugu language. 10 samples of training data is available here

When I try to run it fails with different errors

chillaranand@pavilion:~/projects/python/ocr/data/samples |
→ ketos linegen 0000.gt.txt
Reading texts   ✓
Read 1 unique lines
Σ (len: 22)
Symbols:  ంఇఉకటడణదనబమరసహాిుెేొ
Combining Characters: TELUGU SIGN VIRAMA
Writing images  ⣽/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py:278: VisibleDeprecationWarning: using a non-integer
 number instead of an integer will result in an error in the future                                                                                  
  hs = gaussian_filter(np.random.randn(4*h, 1.5*w), sigma)
/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py:279: VisibleDeprecationWarning: using a non-integer number instead o
f an integer will result in an error in the future                                                                                                   
  ws = gaussian_filter(np.random.randn(4*h, 1.5*w), sigma)
Traceback (most recent call last):
  File "/home/chillaranand/.virtualenvs/p35/bin/ketos", line 11, in <module>
    sys.exit(cli())
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/ketos.py", line 407, in line_generator
    im = linegen.degrade_line(im, np.random.normal(mean), np.random.normal(sigma), np.random.normal(density))
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py", line 239, in degrade_line
    im += np.random.normal(mean, sigma, im.shape)
  File "mtrand.pyx", line 1902, in mtrand.RandomState.normal (numpy/random/mtrand/mtrand.c:17755)
ValueError: scale <= 0

For another file, it throws this error

chillaranand@pavilion:~/projects/python/ocr/data/samples |
→ ketos linegen 0002.gt.txt
Reading texts   ✓
Read 1 unique lines
Σ (len: 23)
Symbols:  ంఅచజడతదనపభమయరలవాిీుెో
Combining Characters: TELUGU SIGN VIRAMA
Writing images  ⣽/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py:278: VisibleDeprecationWarning: using a non-integer
 number instead of an integer will result in an error in the future                                                                                  
  hs = gaussian_filter(np.random.randn(4*h, 1.5*w), sigma)
/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py:279: VisibleDeprecationWarning: using a non-integer number instead o
f an integer will result in an error in the future                                                                                                   
  ws = gaussian_filter(np.random.randn(4*h, 1.5*w), sigma)
Traceback (most recent call last):
  File "/home/chillaranand/.virtualenvs/p35/bin/ketos", line 11, in <module>
    sys.exit(cli())
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/ketos.py", line 407, in line_generator
    im = linegen.degrade_line(im, np.random.normal(mean), np.random.normal(sigma), np.random.normal(density))
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py", line 241, in degrade_line
    coords = [np.random.randint(0, i - 1, int(flipped)) for i in im.shape]
  File "/home/chillaranand/.virtualenvs/p35/lib/python3.5/site-packages/kraken/linegen.py", line 241, in <listcomp>
    coords = [np.random.randint(0, i - 1, int(flipped)) for i in im.shape]
  File "mtrand.pyx", line 1266, in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:14292)
  File "mtrand.pyx", line 1267, in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:14131)
  File "mtrand.pyx", line 749, in mtrand._rand_int64 (numpy/random/mtrand/mtrand.c:9764)
ValueError: negative dimensions are not allowed

Any ideas on how to fix this?

Error: Too many open files

I am trying to create a html for transcription of 1050 png images, but an error appears after my command:
$ ketos transcrib -o output.html *.png
Reading images ?
Error: Could not open file 001020.png: Too many open files

So I reducesed the number of images to 1019, the result was:
$ ketos transcrib -o output.html *.png
Reading images ?
Writing output ?Error: Could not open file output.html: Too many open files

So I divided the number of images to half in 2 folders, containing each about +500 png, and it worked and created an output.html.

What was the problem?
Is there a limit of the number of pages or number of .png files ketos transcrib can handel?

Kraken installation and training

Kraken guidance is unclear, there is an insisting need to provide a Video demonstrating the following:

Installing csltm
Installing kraken and it's dependencies
Training kraken on some Arabic language text, both csltm and pyrnn
Using the trained model to recognize the image
along with posting the example image used and it's trained model

@mittagessen Thank you for your hard work, I really hope that you create a Video guidance of the step-by-step process, since I am creating freely licensed models and training data for the Arabic language, but need your help.
Waiting for your reply

how to use api

hi
do you have an api sample usage?
i tryed to use api in python but libraries was unknown?

Installation requirements on Debian 8.6 (Jessie)

Following the installation instruction in the documentation didn't work on Debian 8.6 (Jessie).

What I had to do:

apt-get install build-essential
apt-get install git

apt-get install libpangocairo-1.0 libxml2 libblas3 liblapack3

apt-get install libxml2-dev
apt-get install libxslt1-dev
apt-get install python-scipy
apt-get install python-pip

pip install lxml
pip install jinja2
pip install regex
pip install python-bidi
pip install numpy
pip install kraken

TypeError: a bytes-like object is required, not 'str'

hi i have the following error when i issue bellow command:
kraken -i image.png image.txt binarize segment ocr
Loading RNN ✓
Binarizing ✓
Segmenting Traceback (most recent call last):
File "/home/rahnema/anaconda3/bin/kraken", line 11, in
sys.exit(cli())
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/click/core.py", line 1093, in invoke
return _process_result(rv)
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/click/core.py", line 1031, in _process_result
**ctx.params)
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/kraken/kraken.py", line 142, in process_pipeline
task(base_image=base_image, input=input, output=output)
File "/home/rahnema/anaconda3/lib/python3.6/site-packages/kraken/kraken.py", line 81, in segmenter
json.dump(res, fp)
File "/home/rahnema/anaconda3/lib/python3.6/json/init.py", line 180, in dump
fp.write(chunk)
TypeError: a bytes-like object is required, not 'str'

Some GT of the ocropus models found

One finds (some) ground truth of the ocropus models at http://www.tmbdev.net/ocrdata-hdf5/, but I don't know how complete this is. For example the Google-1000-Books seems missing. However, there is the data from MNIST, which are just (handwritten) numbers. Found (again) by looking at the IPython-notebook https://github.com/tmbdev/clstm/blob/master/misc/lstm-mnist-py.ipynb and remembered that we talked about this. (This might only partially be related to something specific in kraken, but my email bounced back.)

Training new features

Dear all,
I have installed kraken using the following command : pip install kraken.
I would like to ask if this framework is able to train new set of features.
If yes, how it will be performed??
Thank you.

[Suggestion] Regarding ketos linegen

Peace be upon you

ketos linegen generates png/txt pairs but at finish, it doesn't generate manifest.txt

Please allow linegen to auto generate the mainfest.txt, probably by:
ls *.png > manifest.txt

~~2. When using ketos linegen for the the Arabic language, the generated text in the txt files are in RTL, thus need to be reordered before using it in training.~~

~~- Please give ketos linegen the option of --reorder to reorder the text, instead of manually using reorder.py~~

Training kraken and RTL support?

@amitdo commented here on the specific RTL support in kraken. Since I am unsucessfully training OCR models for Hebrew with ocropy, I wonder if kraken could do the job. Can anyone introduce me to the details of kraken's RLT support? I could not find the related information in the documentation. Many thanks in advance!

Kraken cracks when using a clstm module with 800 hidden

Peace be upon you
I have trained 2 clstm modules, the first using nhidden=100, and the second using nhidden=800.
The purpose was to see if there was any improvement in the recognition.
I have used the clstm of the separate-derivs to train those modules.

Later on, I used the first module in Kraken which successfully recognized the text perfectly.
But when using the second clstm module that I created using nhidden=800, Kraken cracks down in error in the ocr step.
It seems that if i use any module created with anything more than nhidden=100, kraken can't handle.

The nhidden=100 module
The nhidden=800 module
The training data

Here is the error when using the nhidden=800 module:

bmwmy@ubuntu:~/Desktop/test$ kraken -i 000001.png out.txt binarize segment ocr -m arabic-9000-800.clstm
Loading RNN	✓
Binarizing	✓
Segmenting	✓
Traceback (most recent call last):
  File "/usr/local/bin/kraken", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1093, in invoke
    return _process_result(rv)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1031, in _process_result
    **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 142, in process_pipeline
    task(base_image=base_image, input=input, output=output)
  File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 101, in recognizer
    for pred in it:
  File "/usr/local/lib/python2.7/dist-packages/kraken/rpred.py", line 222, in rpred
    yield bidi_record(ocr_record(pred, pos, conf))
  File "/usr/local/lib/python2.7/dist-packages/kraken/rpred.py", line 98, in bidi_record
    for i, j in enumerate(record):
  File "/usr/local/lib/python2.7/dist-packages/future/types/newobject.py", line 71, in next
    return type(self).__next__(self)
  File "/usr/local/lib/python2.7/dist-packages/kraken/rpred.py", line 56, in __next__
    return (self.prediction[self.idx], self.cuts[self.idx],
IndexError: list index out of range

Bad Credentials

Hello, I tried to install kraken using pip3 and everything went fine, but I cannot use it.
As soon as I try to get default, I have the following error message

$ kraken get default
Retrieving model	⣾Traceback (most recent call last):
  File "/usr/local/bin/kraken", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1092, in invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kraken/kraken.py", line 346, in get
    partial(spin, 'Retrieving model'))
  File "/usr/local/lib/python3.6/site-packages/kraken/repo.py", line 46, in get_model
    raise KrakenRepoException(resp['message'])
kraken.lib.exceptions.KrakenRepoException: Bad credentials

I did remove kraken and tried with a pip2 install, but the result remains the same :

$ kraken get default
Retrieving model	⣾Traceback (most recent call last):
  File "/usr/local/bin/kraken", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1092, in invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/kraken/kraken.py", line 346, in get
    partial(spin, 'Retrieving model'))
  File "/usr/local/lib/python2.7/site-packages/kraken/repo.py", line 46, in get_model
    raise KrakenRepoException(resp['message'])
kraken.lib.exceptions.KrakenRepoException: Bad credentials

What did I miss ?
(I use a macOS 10.12.6, python 2.7 or 3.6)

Training Kraken, am I doing something wrong?

A video demonstration of the problem I am facing while training a new CLSTM model.

I have transcribed and extracted a sample Arabic image of 7 lines, just for testing, see attachment.

I have used train.sh from mittagessen/kraken-vagrant.

I have tested training using train.sh setting values to 1000, and second test set to 10.

In both occasions, train.sh started creating multiple arabic-*.clstm which all had the same file size.
I have used one of the .clstm files in kraken recognition, the result was an empty .txt file with only 6 lines of empty space.
I have downloaded a kraken model just to test if the recognition function is working, and yes its working.

Am I doing something wrong here?
Waiting for your reply

Update, this problem was solved, I discovered that I needed a high value of iterations to get the trained module to work.

ketos transcrib Issue

Using ketos transcrib with a two column text doesn't give the desired result. Maybe a column recognition such as that in segment would help. It seems like there is an empty space threshold to separate text boxes into two columns as I have noticed.

use another model

iget bellow error when i want to use another model

my command is: kraken -i en.png image.txt binarize segment ocr -m toy

Usage: kraken ocr [OPTIONS]

Error: Invalid value: No model found

New training interface

I know this is already listed as a missing feature in the documentation. However, I am wondering what exactly the plans are going forward? What sort of interface, API, etc. is it going to have and/or are under consideration?

Error while using clstm models

Hi
I have installed kraken and ran the default pyrnn model successfully but when I tried to use any of clstm models I get: "Loading RNN Segmentation fault (core dumped)" while loading the RNN
I have passed clstm tests also.
N.B. I am using VM with these specs 8cores, 8gb ram, 2gb graphics ram

Regarding the clstm bindings commit

Hi there,
I have noticed that you have added the clstm bindings commit, does that mean if I want to train a new model I can use "ketos train" ? or that it automatically install clstm along with "clstmocrtrain" ?
Because I just installed kraken using pip, but when running "clstmocrtrain" to train a new model, it says the command not recognized....
Should I build clstm from source then install kraken from pip?
Waiting for your reply
@mittagessen

Reorder and Normalize an already extracted .txt

Hi there,
I am training an .clstm module with 1,050 lines of Arabic Language, and using the Arabic double-checked training data,
They already extracted the .txt, the problem is that the text within it is in RTL.

How can I Reorder the text to LTR, and/or Normalize the already extracted .txt files?
Thank you @mittagessen for your hard work
Waiting for your reply

Strange IOError in current version

I'm trying to run the OCR on e test image (http://www.anycount.com/WordCountBlog/wp-content/uploads/2009/07/test-english-shht.jpg) on the current version.

When I run the command (after renaming the image):

kraken -i input1.jpg ocr.txt binarize segment ocr --model ~/.config/kraken/en-default.pronn

I get this strange IOError:

Traceback (most recent call last):
  File "/usr/bin/kraken", line 10, in <module>
    sys.exit(cli())
  File "/usr/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 1087, in invoke
    return _process_result(rv)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 1025, in _process_result
    **ctx.params)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/kraken/kraken.py", line 144, in process_pipeline
    task(base_image=base_image, input=input, output=output)
  File "/usr/lib/python2.7/site-packages/kraken/kraken.py", line 61, in binarizer
    low, high)
  File "/usr/lib/python2.7/site-packages/kraken/binarization.py", line 64, in nlbin
    raw = pil2array(im)
  File "/usr/lib/python2.7/site-packages/kraken/lib/util.py", line 22, in pil2array
    a = np.fromstring(im.tobytes(), 'B')
  File "/usr/lib/python2.7/site-packages/PIL/Image.py", line 673, in tobytes
    self.load()
  File "/usr/lib/python2.7/site-packages/PIL/ImageFile.py", line 222, in load
    "(%d bytes not processed)" % len(b))

From StackOverflow I saw a possible solution but didn't try it out myself yet.
https://stackoverflow.com/questions/12984426/python-pil-ioerror-image-file-truncated-with-big-images
(not the accepted solution but the second one)

Do you have any ideas what might cause this?

SUGG: Print Kraken version and List installed Models

Hi there,
@mittagessen Thank you for all your hard work, your amazing.

This is more of a suggestion than an issue:

Ability to print current Kraken version.
Ability to list current installed .clstm models and there location.

[Suggestion] Reorganize Kraken repo

Peace be upon you, here are some suggestions for you @mittagessen

Create repository kraken-ocr
For kraken-ocr Include:
kraken Kraken Open Source OCR Engine (main repository)
kraken-models Kraken recognition models for various languages (Beta)
kraken-scripts Scripts to automate various aspects of Kraken
kraken-clstm A small C++ implementation of LSTM networks, focused on OCR
kraken-research Research and documents on Kraken
For kraken create a wiki, include:
Kraken Ocr Part 1: Building CLSTM https://youtu.be/ST_XrfcCpKE
Kraken Ocr Part 3: Creating and transcribing the HTML file https://youtu.be/No87TADb9zQ
Kraken Ocr Part 4: Training a new CLSTM model https://youtu.be/Ec9Qi7S8cvA
Also mention that it uses a modified version of the clstm separate-derivs
For kraken add tags of kraken kraken-ocr ocr-engine machine-learning
For kraken-scripts include:
Training
For Training include:

pretrain.sh
#!/bin/bash
set -x
set -a
sort -R manifest.txt > /tmp/manifest2.txt
sed 1,100d /tmp/manifest2.txt > train.txt
sed 100q /tmp/manifest2.txt > test.txt
train.sh
#!/bin/bash
set -x
set -a
report_every=1000
save_every=1000
maxtrain=50000
target_height=48
dewarp=center
display_every=1000
test_every=1000
nhidden=100
lrate=1e-4
save_name=arabic
clstmocrtrain train.txt test.txt

For kraken-clstm fork the clstm separate-derivs and modify clstm.h & extras.h by changing isnan to std::isnan
For kraken-research include the pdf of Important New Developments in Arabographic Optical Character Recognition
also future research and recognition tests might be posted there in the future.

outdated module docstring in lib/models.py

Wraps around legacy pyrnn and HDF5 models to provide a single interface. In the
future it will also include support for clstm models.

Problems in pageseg.py

Hi,

I wanted to test your tool so I installed it on my Ubuntu machine and tried to run it, but I got the following error:

File "/usr/local/lib/python2.7/dist-packages/kraken/pageseg.py", line 250, in compute_line_seeds
    bmarked *= (bottom > threshold*np.amax(bottom)*threshold)*(1-colseps)
TypeError: Cannot cast ufunc multiply output from dtype('int32') to dtype('bool') with casting rule     'same_kind'

Although I don't know Python, I found a stackoverflow answer suggesting to add braces and .astype(). I found that two rows seem to be affected and after replacing them, kraken works for me. The new rows look as follows:

bmarked *= ((bottom > threshold*np.amax(bottom)*threshold)*(1-colseps)).astype(bmarked.dtype)

and

tmarked *= ((top > threshold*np.amax(top)*threshold/2)*(1-colseps)).astype(bmarked.dtype)

Writing images ⣽Segmentation fault (core dumped)

hi
i get bellow error on make training data phase
Writing images ⣽Segmentation fault (core dumped)

Ketos Linegen is randomizing the lines by default

Peace Be upon you
@mittagessen When running the command ketos linegen ...... txtfile.txt
it doesn't generate the lines according to their order in the txt file, meaning the first line in the .txt file is not 000000.png

Python throws 'TypeError' when starting kraken

Using the latest nidaba[kraken] release, I end up with a TypeError when starting kraken. Most likely a misconfiguration problem. I ran through the procedure at http://openphilology.github.io/nidaba/index.html and I am on Ubuntu 14.04

Any hints are greatly appreciated. Many thanks!

Traceback:
~/built/nidaba$ kraken

Traceback (most recent call last):
File "/usr/local/bin/kraken", line 7, in
from kraken.kraken import cli
File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 25, in
from kraken.lib import models
File "/usr/local/lib/python2.7/dist-packages/kraken/lib/models.py", line 29, in
from kraken.lib import pyrnn_pb2
File "/usr/local/lib/python2.7/dist-packages/kraken/lib/pyrnn_pb2.py", line 20, in
serialized_pb=b'\n\x11proto/pyrnn.proto\x12\x06kraken"'\n\x05\x61rray\x12\x0b\n\x03\x64im\x18\x01 \x03(\r\x12\x11\n\x05value\x18\x02 \x03(\x02\x42\x02\x10\x01"\xca\x01\n\x04lstm\x12\x1a\n\x03wgi\x18\x01 \x02(\x0b\x32\r.kraken.array\x12\x1a\n\x03wgf\x18\x02 \x02(\x0b\x32\r.kraken.array\x12\x1a\n\x03wgo\x18\x03 \x02(\x0b\x32\r.kraken.array\x12\x1a\n\x03wci\x18\x04 \x02(\x0b\x32\r.kraken.array\x12\x1a\n\x03wip\x18\x05 \x02(\x0b\x32\r.kraken.array\x12\x1a\n\x03wfp\x18\x06 \x02(\x0b\x32\r.kraken.array\x12\x1a\n\x03wop\x18\x07 \x02(\x0b\x32\r.kraken.array"$\n\x07softmax\x12\x19\n\x02w2\x18\x01 \x02(\x0b\x32\r.kraken.array"\xb1\x01\n\x05pyrnn\x12\x0c\n\x04kind\x18\x01 \x02(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\x0e\n\x06ninput\x18\n \x02(\r\x12\x0f\n\x07noutput\x18\x0b \x02(\r\x12\r\n\x05\x63odec\x18\x0c \x03(\t\x12\x1c\n\x06\x66wdnet\x18\r \x02(\x0b\x32\x0c.kraken.lstm\x12\x1c\n\x06revnet\x18\x0e \x02(\x0b\x32\x0c.kraken.lstm\x12 \n\x07softmax\x18\x0f \x02(\x0b\x32\x0f.kraken.softmax'
TypeError: init() got an unexpected keyword argument 'syntax'

"kraken get default" fails

running

kraken get default

fails with

$ kraken get default
Retrieving model    ⣽Traceback (most recent call last):
  File "/usr/bin/kraken", line 11, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.4/site-packages/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.4/site-packages/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.4/site-packages/click/core.py", line 1053, in invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.4/site-packages/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.4/site-packages/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.4/site-packages/click/decorators.py", line 16, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3.4/site-packages/kraken/kraken.py", line 249, in get
    partial(spin, 'Retrieving model'))
  File "/usr/lib/python3.4/site-packages/kraken/repo.py", line 44, in get_model
    desc = json.loads(raw)
  File "/usr/lib/python3.4/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'

Broken file mode

Attempted to run the example shown in the documentation. Got this:

kraken -i page-00000.tiff out.html binarize segment ocr -h
Loading RNN	✓
Binarizing	✓
Segmenting	Traceback (most recent call last):
  File "/usr/local/bin/kraken", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1093, in invoke
    return _process_result(rv)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1031, in _process_result
    **ctx.params)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/parkerhancock/Projects/kraken/kraken/kraken.py", line 142, in process_pipeline
    task(base_image=base_image, input=input, output=output)
  File "/Users/parkerhancock/Projects/kraken/kraken/kraken.py", line 81, in segmenter
    json.dump(res, fp)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 180, in dump
    fp.write(chunk)
TypeError: a bytes-like object is required, not 'str'

Pulled the repo, and found the culprit. On line 80, the "open" command needs to be changed from "wb" to "w".

The json.dump method returns a string, which raises an error when it attempts to write it to a file opened in binary mode. I have a branch if you want to make this a pull request, but it's just one line to fix. Have it installed editable on my machine with the change, but it would be super-handy if you could make it to the repo version.

Running on Python 3.6.1

Thanks!

Tensorflow/keras support ?

Hi there,

I just found this and I got really interested in such challenging task! I have some experience with RNNs and their implementation in Tensorflow library and I was wondering if Tensorflow models are supported here? The reason I want to use Tensorflow is the ability to build on other models which may give a huge performance boost of kraken.

The 'kraken' package in PyPI is outdated

(0.7.6)
https://pypi.python.org/pypi/kraken

Already forked ocropy, so how to fork kraken?

The Github GUI does not let me do that.

linegen: OpenType ligatures

tesseract-ocr/tesseract#288

This PR is from Nick White, so I guess it's important for Ancient Greek.

[Suggestion] linegen - Use PangoCoverage

https://developer.gnome.org/pango/stable/pango-Coverage-Maps.html

Tesseract's text2image tool use it. You might want to use it in linegen.py.

Search for 'coverage' in these files:
https://github.com/tesseract-ocr/tesseract/blob/master/training/text2image.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/training/pango_font_info.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/pango_font_info.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/training/stringrenderer.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/stringrenderer.cpp

dict object has no attribute 'iteritems'

Hello I tried this : kraken -i dwg.jpg convertex.txt binarize segment ocr
but got an error dict object has no attribute 'iteritems'.

In kraken.py => for k, v in model.iteritems():
I tried several image input (jpg, png) with the same issues...

kraken has been installed through pip install kraken on anconda python (V3.5.3 64bits)

Multi-Page Input for the CLI

Friendly suggestion - I love that Kraken supports python3, and is fairly lightweight on dependencies, but what is starting to be a deal-breaker is lack of support for multiple page input. My workflow (which I suspect is fairly common), is to take a .PDF, split into Group 4 Tiffs, and then OCR the tiff images into a hocr document (and then on to NLP-land)

Ocropy handles glob characters (? and * wildcards) to handle multiple pages of input, and can generate a consolidated hocr file for the whole document. As far as I (and probably a lot of people) are concerned, these are must-have features.

So, for your consideration, I'd recommend either (1) allowing the CLI to accept glob-like input, or (2) build/document an API to use it in Python code without the CLI for multiple page documents.

Maybe 2 already exists in some form or fashion, with some selective imports/etc. But on cursory review, it's tough to pick out the pieces.

Thanks!

Issues in using kraken

Hi,

I'm facing several issues while running kraken. I'm running Ubuntu 16.04 with python 2.7.12.
One of them is, after succesful binarizing, segment throws a segmentation fault.

root@de2e05f9d21b:~/integrated-ocr/images# kraken -i test_image1.jpg image.jpg binarize
Binarizing      ✓
root@de2e05f9d21b:~/integrated-ocr/images# kraken -i image.jpg lines.txt segment
Segmenting      Segmentation fault (core dumped)

The other issue I face is this:

root@de2e05f9d21b:~/integrated-ocr/images# kraken -i test_image1.jpg image.txt binarize segment ocr
Usage: kraken ocr [OPTIONS]

Error: Invalid value for "-m" / "--model": Mappings must be in format script:model

Any clues as to why this is happening? Let me know. I'm even ready to help you in fixing it (if these issues are indeed reproducible).

ERROR=1 & OUT=blank

Peace Be Upon you,
I have been training a new Arabic model for 32 hours reaching 269,000 epochs, it is noticed that starting from epoch 8000 the Error remains 1 and the OUT is empty (from 8000 to 269,000).

My training data are generated artificially, with these features:

Arabic
Contains no diacritics
300 dpi
Times New Roman, regular, size 18
I'm sure that the transcription is 100% correct, it is based on tanzil.net/download "simple clean, no puase marks, no signs"

Attached (click-on):
The transcribed html file
The extracted png/gt.txt files
The training script train.sh
The produced clstm models
The complete terminal log

My training script:

set -x
set -a
sort -R manifest.txt > /tmp/manifest2.txt
sed 1,100d /tmp/manifest2.txt > train.txt
sed 100q /tmp/manifest2.txt > test.txt

report_every=1000
save_every=1000
maxtrain=2000000
target_height=48
dewarp=center
display_every=1000
test_every=1000
hidden=100
lrate=1e-4
save_name=arabic
'/home/bmwmy/Desktop/kra/clstm/clstmocrtrain' train.txt test.txt

No Arabic recognition in Kraken version 0.9.4

Hi there,
When using Arabic clstm models in Kraken versions 0.9.3 or 0.9.4 an error appears at the ocr step.
Note that version 0.9.4 and 0.9.3 can use English models without problems, and also note that earlier versions don't have this problem with Arabic.
The Arabic clstm model arabic-75000.zip

chris@ubuntu:~/Desktop/Untitled Folder$ kraken -i 000002.png out.txt binarize segment ocr -m arabic-75000.clstm
Loading RNN default	✓
Binarizing	✓
Segmenting	✓
Traceback (most recent call last):
  File "/usr/local/bin/kraken", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1093, in invoke
    return _process_result(rv)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1031, in _process_result
    **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 161, in process_pipeline
    task(base_image=base_image, input=input, output=output)
  File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 119, in recognizer
    for pred in it:
  File "/usr/local/lib/python2.7/dist-packages/kraken/rpred.py", line 193, in mm_rpred
    'boxes': map(lambda x: x[1], line)})):
  File "/usr/local/lib/python2.7/dist-packages/kraken/rpred.py", line 133, in extract_boxes
    raise KrakenInputException('Line outside of image bounds')
kraken.lib.exceptions.KrakenInputException: Line outside of image bounds

kraken linegen?

Expected Behavior

@amitdo pointed me to kraken's linegen implementation using pango+cairo. However 'linegen' is no valid subcommand of 'kraken'. It would be neat if one could access the contents of 'linegen.py' from kraken's CLI.

Current Behavior

'linegen.py' is not wrapped in kraken.py

Possible Solution

Add 'linegen' as a subcommand to 'kraken'.

More Installation Details

For the documentation: On a clean install of Linux Mint the following were needed:

pip install -U pip setuptools
apt install build-essential python-dev

ketos train fails with error

After building ground truth data, the train command fails with:

API rate limit exceeded

I haven't used VirtualBox before, and may be making some basic error, but I keep stumbling at the same spot. I followed the installation guide, got everything running and ran kraken on some images successfully. Then I started getting an API rate limit exceeded error. The only way I could get around it was to uninstall and reinstall. Then things would be fine for a while, then the error would return.

I realize the problem is likely on my end, but I'd be grateful for advice. I'm on OSX.

Cant use .png with transparency

Peace be upon you
Original image:

Using command ketos transcrib -o output.html Untitled-1.png the output.html result:

how to use multiple cpu to speed up trainning process

can your library support enabling multiple cpu to speed up trainning process?
if so ,how

Failure using pip install kraken

Peace be upon you
Trying to install Kraken on a fresh Ubuntu 16.04

Using pip install kraken & sudo pip install kraken, results after installation:
The program 'kraken' is currently not installed.
Using pip install -U pip setuptools and sudo apt install build-essential python-dev and pip install kraken, results after installation:
Kraken command can be issued but using any clstm model fails.
Video demonstration: https://youtu.be/Myv25XE11lM
Log:

chris@ubuntu:~/Desktop/test$ kraken -i 000000.png out.txt binarize segment ocr -m arabic-beirut.clstm 
Loading RNN default	✗
Traceback (most recent call last):
  File "/usr/local/bin/kraken", line 11, in <module>
    sys.exit(cli())
  File "/home/chris/.local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/chris/.local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/chris/.local/lib/python2.7/site-packages/click/core.py", line 1092, in invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "/home/chris/.local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/chris/.local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/chris/.local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 262, in ocr
    rnn = models.load_any(location)
  File "/usr/local/lib/python2.7/dist-packages/kraken/lib/models.py", line 213, in load_any
    seq = load_pyrnn(fname)
  File "/usr/local/lib/python2.7/dist-packages/kraken/lib/models.py", line 320, in load_pyrnn
    raise KrakenInvalidModelException(str(e))
kraken.lib.exceptions.KrakenInvalidModelException: invalid load key, '

Faulty line segmentation

Hi mittagessen,

I have a tiny little problem with kraken-pageseg: in some cases (eg. the first two lines of the attached paragraph image), it reverses the order of lines for no obvious reason. Maybe it's because of left_of (and not right_of) in pageseg.reading_order?
My first try was to switch of column detection by setting maxcolseps=0, but this doesn't change anything. I made some additions to switch of columns and reordering if maxcolseps=0 in andbue@69c8009. This fixes the issue for me, my pages are pre-segmented using LAREX anyway, but that doesn't solve the general problem here.

Cheers,
Andreas

Error trying to get hOCR output

Hi,

After getting kraken to work on my machine, I tried to get hOCR output, sadly without any success.

A normal call to kraken works and looks like this:

kraken -i 1.tiff test.txt

No problem so far.

But when I try to call for hOCR output as follows, it doesn't work:

kraken -i 1.tiff test.hocr ocr -h

I get the following error:

File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 132, in process_pipeline
    task(base_image=base_image, input=input, output=output)
  File "/usr/local/lib/python2.7/dist-packages/kraken/kraken.py", line 83, in recognizer
    in csv.reader(fp)]
_csv.Error: line contains NULL byte

The usage help says Usage: kraken [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]..., so I don't get what I'm doing wrong.

Any help is appreciated. Thank you.