Giter VIP home page Giter VIP logo

origami's Introduction

Origami

Origami is a self-contained suite of batches and tools for OCR processing of historical newspapers. It covers many essential steps in a digitization pipeline, including (1) building training data for training models, and (2) generating Page-XML OCR output from pages using trained models.

Apart from its specific features, Origami is

  • easy to setup
  • easy to use
  • based on file-based intermediary results that allow customization

Origami's current default implementation features:

  • DNN segmentation
  • dewarping
  • reading order detection
  • simple table support
  • Page-XML export

Origami also provides additional tools for:

  • annotating ground truth
  • debugging
  • creating annotated images
  • evaluation of OCR quality

Installation

conda create --name origami python=3.7 -c defaults -c conda-forge --file origami/requirements/conda.txt
conda activate origami
pip install -r origami/requirements/pip.txt

General Usage

cd /path/to/origami
python -m origami.batch.detect.segment

All command line tools will give you help information on their arguments when called as above.

The given data path should contain processed pages as images. Generated data is put into the same path. Images may be structured into any hierarchy of sub folders.

Batches

Artifacts

Origami's processing happens in separated stages, with batches that read and write information from well-defined files (also called artifacts). Each batch creates and depends upon various artifacts, as shown in the following table. Rows depict artifacts, columns depict detection batches (i.e. the batches found under origami.batch.detect). Blank circles indicate a read, filled circles indicate a write. As illustrated here, later batches depend on information provided by earlier batches.

Click on the names of the artifacts (left column) or batches (top row) below to get more information.

segment contours flow dewarp layout lines order ocr compose
page image
segment.zip
contours.0.zip
flow.zip
lines.0.zip
contours.1.zip
dewarp.zip
contours.2.zip
tables.json
contours.3.zip
lines.3.zip
order.json
ocr.zip
compose.zip

Running Batches

Order

Given an OCR model, and as illustrated in the table from last section, the necessary order of detection batches for performing OCR for a folder of documents is:

1 segment
2 contours
3 flow
4 dewarp
5 layout
6 lines
7 order
8 ocr
9 compose

Concurrency

Batch processes can be run concurrently. Origami supports file-based locking or by using a database (see --lock-strategy). The latter strategy is more compatible and set by default. Use --lock-database to specify the path to a lock database (if none is specified, Origami will create one in your data folder).

Modifying Results

It is possible to replace Origami pipeline stages/batches by custom implementations by simply reading and writing Origami's artifacts using the documented file formats.

It is also possible to run Origami stages and then postprocess the generated artifacts before continuing with later stages.

The Detection Batches

segment

origami.batch.detect.segment
Performs segmentation (e.g. separation into text and background) on all images using a neural network model. By default, this uses origami’s own model.. The predicted classes and labels are embedded in the downloaded model.

contours

origami.batch.detect.contours
From the pixelwise segmentation information, detects connected components to produce vectorized polygonal contours for blocks and separator lines.

flow

origami.batch.detect.flow
Detects baselines and warping in separators to produce an overall description of page curvature.

dewarp

origami.batch.detect.dewarp
Creates a dewarping transformation that is used in subsequent stages.

layout

origami.batch.detect.layout
Refines regions by fixing over- and under-segmentation via heuristic rules.

lines

origami.batch.detect.lines
Detects baselines and line boundaries for each text line.

order

origami.batch.detect.order
Finds a reading order using a variant of the XY Cut algorithm.

ocr

origami.batch.detect.ocr
Performs OCR on each detected line using the specified Calamari OCR model. Note that the binarization you can specify here in independent of the one performed in origami.batch.detect.binarize.

compose

origami.batch.detect.compose
Composes text into one file using the detected reading order. Can also produce PageXML output.

Debugging

origami.batch.detect.stats
Prints out statistics on computed artifacts and errors. This is useful for understanding how many pages for processed, and for which stages this processing is finished.
origami.batch.annotate.contours
Produces debug images for understanding the result of the contours batch stage.
origami.batch.annotate.lines
Produces debug images for understanding the line detection stage.
origami.batch.annotate.layout
Produces debug images for understanding the result of the layout and order batch stage.

Tools for Ground Truth and Evaluation

Tools

origami.tool.annotate
Tool for annotating, viewing and searching for ground truth.
origami.tool.pick
Tool for adding or removing single lines from the ground truth for fine tuning.
origami.tool.sample
Create a new annotation database by randomly sampling lines from a corpus. The details of sampling (numbers of items for each segmentation label type per page) can be specified. Allows import of transcriptions stored in accompanying PageXML. See command line help for more details.
origami.tool.schema
⁂ Run an annotation normalization schema on the given ground truth text files.
origami.tool.export
From the given annotation database, export line images of the specified height and binarization together with accompanying ground truth text files. Annotation normalization through a schema is supported. Use this command to generate training data for Calamari. See command line for details.
origami.tool.xycut
Debug internal X-Y cut implementation.
origami.batch.export.lines (debugging only)
Export images of lines detected during lines batch.
origami.batch.export.pagexml (debugging only)
Export polygons of lines detected during lines batch as PageXML.

How to create ground truth

For generating ground truth for training an OCR engine from a corpus, we suggest this general process:

  • Run batches up to lines on your page images.
  • Sample random lines using origami.tool.sample.
  • Fine tune your training corpus using origami.tool.pick (optional).
  • Annotate using origami.tool.annotate.
  • Export annotations using origami.tool.export.
  • Train your OCR model.

Origami Models

For line-based OCR, Origami uses Calamari internally and therefore can be used with any Calamari model. However, Origami's way of segmenting lines is slightly different from other pipelines: lines are not binarized and they are not scaled horizontally (therefore they might be wider than what some models are trained on).

One model specifically trained for Origami is the model used to perform OCR on the Berliner Börsen-Zeitung. More information can be found under https://github.com/poke1024/origami_models

Evalulation via Dinglehopper

To evaluate performance using Dinglehopper, you probably want to use:

python -m origami.batch.utils.evaluate DATA_PATH

Alternatively, you can create PAGE XMLs manually:

python -m origami.batch.detect.compose DATA_PATH \
    --page-xml --only-page-xml-regions \
    --regions regions/TEXT \
    --ignore-letters "{}[]"

origami's People

Contributors

dependabot[bot] avatar poke1024 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.