Origami

Origami is a self-contained suite of batches and tools for OCR processing of historical newspapers. It covers many essential steps in a digitization pipeline, including (1) building training data for training models, and (2) generating Page-XML OCR output from pages using trained models.

Apart from its specific features, Origami is

easy to setup
easy to use
based on file-based intermediary results that allow customization

Origami's current default implementation features:

DNN segmentation
dewarping
reading order detection
simple table support
Page-XML export

Origami also provides additional tools for:

annotating ground truth
debugging
creating annotated images
evaluation of OCR quality

Installation

conda create --name origami python=3.7 -c defaults -c conda-forge --file origami/requirements/conda.txt
conda activate origami
pip install -r origami/requirements/pip.txt

General Usage

cd /path/to/origami
python -m origami.batch.detect.segment

All command line tools will give you help information on their arguments when called as above.

The given data path should contain processed pages as images. Generated data is put into the same path. Images may be structured into any hierarchy of sub folders.

Batches

Artifacts

Origami's processing happens in separated stages, with batches that read and write information from well-defined files (also called artifacts). Each batch creates and depends upon various artifacts, as shown in the following table. Rows depict artifacts, columns depict detection batches (i.e. the batches found under origami.batch.detect). Blank circles indicate a read, filled circles indicate a write. As illustrated here, later batches depend on information provided by earlier batches.

Click on the names of the artifacts (left column) or batches (top row) below to get more information.

	segment	contours	flow	dewarp	layout	lines	order	ocr	compose
page image	◯		◯		◯	◯		◯
segment.zip	⬤	◯			◯	◯	◯
contours.0.zip		⬤	◯	◯	◯
flow.zip			⬤	◯
lines.0.zip			⬤		◯
contours.1.zip				⬤	◯		◯
dewarp.zip				⬤
contours.2.zip					⬤	◯	◯
tables.json					⬤	◯		◯	◯
contours.3.zip						⬤	◯		◯
lines.3.zip						⬤	◯	◯	◯
order.json							⬤		◯
ocr.zip								⬤	◯
compose.zip									⬤

Running Batches

Order

Given an OCR model, and as illustrated in the table from last section, the necessary order of detection batches for performing OCR for a folder of documents is:

1	segment
2	contours
3	flow
4	dewarp
5	layout
6	lines
7	order
8	ocr
9	compose

Concurrency

Batch processes can be run concurrently. Origami supports file-based locking or by using a database (see --lock-strategy). The latter strategy is more compatible and set by default. Use --lock-database to specify the path to a lock database (if none is specified, Origami will create one in your data folder).

Modifying Results

It is possible to replace Origami pipeline stages/batches by custom implementations by simply reading and writing Origami's artifacts using the documented file formats.

It is also possible to run Origami stages and then postprocess the generated artifacts before continuing with later stages.

The Detection Batches

segment

origami.batch.detect.segment: Performs segmentation (e.g. separation into text and background) on all images using a neural network model. By default, this uses origami’s own model.. The predicted classes and labels are embedded in the downloaded model.

contours

origami.batch.detect.contours: From the pixelwise segmentation information, detects connected components to produce vectorized polygonal contours for blocks and separator lines.

flow

origami.batch.detect.flow: Detects baselines and warping in separators to produce an overall description of page curvature.

dewarp

origami.batch.detect.dewarp: Creates a dewarping transformation that is used in subsequent stages.

layout

origami.batch.detect.layout: Refines regions by fixing over- and under-segmentation via heuristic rules.

lines

origami.batch.detect.lines: Detects baselines and line boundaries for each text line.

order

origami.batch.detect.order: Finds a reading order using a variant of the XY Cut algorithm.

ocr

origami.batch.detect.ocr: Performs OCR on each detected line using the specified Calamari OCR model. Note that the binarization you can specify here in independent of the one performed in origami.batch.detect.binarize.

compose

origami.batch.detect.compose: Composes text into one file using the detected reading order. Can also produce PageXML output.

Debugging

origami.batch.detect.stats: Prints out statistics on computed artifacts and errors. This is useful for understanding how many pages for processed, and for which stages this processing is finished.

origami.batch.annotate.contours: Produces debug images for understanding the result of the contours batch stage.

origami.batch.annotate.lines: Produces debug images for understanding the line detection stage.

origami.batch.annotate.layout: Produces debug images for understanding the result of the layout and order batch stage.

Tools for Ground Truth and Evaluation

Tools

origami.tool.annotate: Tool for annotating, viewing and searching for ground truth.

origami.tool.pick: Tool for adding or removing single lines from the ground truth for fine tuning.

origami.tool.sample: Create a new annotation database by randomly sampling lines from a corpus. The details of sampling (numbers of items for each segmentation label type per page) can be specified. Allows import of transcriptions stored in accompanying PageXML. See command line help for more details.

origami.tool.schema: ⁂ Run an annotation normalization schema on the given ground truth text files.

origami.tool.export: From the given annotation database, export line images of the specified height and binarization together with accompanying ground truth text files. Annotation normalization through a schema is supported. Use this command to generate training data for Calamari. See command line for details.

origami.tool.xycut: Debug internal X-Y cut implementation.

origami.batch.export.lines (debugging only): Export images of lines detected during lines batch.

origami.batch.export.pagexml (debugging only): Export polygons of lines detected during lines batch as PageXML.

How to create ground truth

For generating ground truth for training an OCR engine from a corpus, we suggest this general process:

Run batches up to lines on your page images.
Sample random lines using origami.tool.sample.
Fine tune your training corpus using origami.tool.pick (optional).
Annotate using origami.tool.annotate.
Export annotations using origami.tool.export.
Train your OCR model.

Origami Models

For line-based OCR, Origami uses Calamari internally and therefore can be used with any Calamari model. However, Origami's way of segmenting lines is slightly different from other pipelines: lines are not binarized and they are not scaled horizontally (therefore they might be wider than what some models are trained on).

One model specifically trained for Origami is the model used to perform OCR on the Berliner Börsen-Zeitung. More information can be found under https://github.com/poke1024/origami_models

Evalulation via Dinglehopper

To evaluate performance using Dinglehopper, you probably want to use:

python -m origami.batch.utils.evaluate DATA_PATH

Alternatively, you can create PAGE XMLs manually:

python -m origami.batch.detect.compose DATA_PATH \
    --page-xml --only-page-xml-regions \
    --regions regions/TEXT \
    --ignore-letters "{}[]"

xrosliang / origami Goto Github PK

origami's Introduction