Giter VIP home page Giter VIP logo

docemul's Introduction

DocEmul

A Toolkit to Generate Structured Historical Documents (This repository is still under construction. Please let us know if you find any issues...)

This toolkit is presented in the paper "DocEmul: a Toolkit to Generate Structured Historical Documents" (https://arxiv.org/abs/1710.03474) in Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), 2017. It has been proposed to generate structured synthetic documents emulating the actual document production process. Synthetic documents can be used to train systems to perform document analysis tasks. The toolkit is able to generate synthetic collections and also perform data augmentation to create a larger trainable dataset. It includes one method to extract the page background from real pages which can be used as a substrate where records can be written on the basis of variable structures and using cursive fonts. Moreover, it is possible to extend the synthetic collection by adding random noise, page rotations, and other visual variations. In the experiments we address the record counting task on handwritten structured collections containing a limited number of examples. More detailed experiments and results for the record counting task are presented in the paper "Deep neural networks for record counting in historical handwritten documents" (https://doi.org/10.1016/j.patrec.2017.10.023 ) published on Pattern Recognition Letters journal.

Please, contact us if you want to collaborate to our project in order to extend the software capability. In the future works we will add also a structural ground truth to train a model for record detection. Keep in touch to help us in this tasks.

Data repo

It is possible to find more data at a dropbox link (https://www.dropbox.com/sh/gx93qumgbvp2ipe/AABUbsexdJg-GuyJbRbNXGdsa?dl=0). In this <repo> there are some generated examples by using the proposed toolkit. In particular, there are examples from two collection used in the experiments. in <repo>/Data you can find some generated files for two different datasets (Esposalles, Brandenburg). In particular, you can find the extracted background (<repo>/Data/<dataset>/Background.zip) used to generate synthetic files; the generated text files over a transparent background (<repo>/Data/<dataset>/Text); the genereted synthetic documents (<repo>/Data/<dataset>/Generated).

Repository description

Document structure

There are several XML files to describe the document structure. This files are used also in the experiments proposed in the research.

Text data

In the text files there are the data used to write text during the production process.

Fonts

Before to start the generation process, you need to download from http://www.dafont.com/ the used fonts for the experiments. Run the script python download_font.py which creates the directory fonts. Here you can find some fonts downloaded from http://www.dafont.com/ and used to generate the synthetic data for the experiments.

Generation process

Pre-requisites

  • Python 3
  • pip install -r py-requirements.txt

Esposalles

It is possible to emulate the generation of documents for the dataset Esposalles.

Download background images

Run the script to download images from our repository.

sh Data/Esposalles/download_extract_background_files.sh

After that, you can find some background images into Data/Esposalles/Background

Generate synthetic pages

Run the script to generate some synthetic image with data augmentation using:

python generate_esposalles_images.py

Try to modify the script to generate more pages.

If you need more instructions to define the document structure, please, contact us and we will glad to help you..

Brandenburg

It is possible to emulate the generation of documents for the dataset Brandenburg. Running the script python generate_brandenburg.py it will be possible to generate documents following the Brandenburg model (branden2.xml). It will build the synthetic dataset (only text over a transparent background) into the local directory GENERATED/Brandenburg/test.

Citing DocEmul

If you find DocEmul useful in your research, please consider citing:

@article{CAPOBIANCO2017, title = "Deep neural networks for record counting in historical handwritten documents", journal = "Pattern Recognition Letters", year = "2017", issn = "0167-8655", doi = "https://doi.org/10.1016/j.patrec.2017.10.023", url = "http://www.sciencedirect.com/science/article/pii/S0167865517303914", author = "Samuele Capobianco and Simone Marinai", keywords = "Handwritten documents, Convolutional neural networks, Synthetic document generation, Record counting" }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.