Giter VIP home page Giter VIP logo

adf2pdf's Introduction

adf2pdf - a tool that turns a batch of paper pages into a PDF with a text layer. By default, it detects empty pages (as they may easily occur during duplex scanning) and excludes them from the OCR and the resulting PDF.

For that, it uses Sane's scanimage for the scanning, Tesseract for the optical character recognition (OCR), and the Python packages img2pdf, Pillow (PIL) and PyPDF2 for some image-processing tasks and PDF mangling.

Example:

$ adf2pdf contract-xyz.pdf

2017, Georg Sauthoff [email protected]

Features

  • Automatic document feed (ADF) support
  • Fast empty page detection
  • Overlaying of scanning, image processing, OCR and PDF creation to minimize the total runtime
  • Fast creation of small PDFs using the fine img2pdf package
  • Only use of safe compression methods, i.e. no error-prone symbol segmentation style compression like JBIG2 or JB2 that is used in Xerox photocopiers and the DjVu format.

Install Instructions

Adf2pdf can be directly installed with pip, e.g.

$ pip3 install --user adf2pdf

or

$ pip3 install adf2pdf

See also the PyPI adf2pdf project page.

Alternatively, the Python file adf2pdf.py can be directly executed in a cloned repository, e.g.:

$ ./adf2pdf.py report.pdf

In addition to that, one can install the development version from a cloned work-tree like this:

$ pip3 install --user .

Hardware Requirements

A scanner with automatic document feed (ADF) that is supported by Sane. For example, the Fujitsu ScanSnap S1500 works well. That model supports duplex scanning, which is quite convenient.

Example continued

Running adf2pdf for a 7 page example document takes 150 seconds on an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the Fujitsu ScanSnap S1500). With the defaults, adf2pdf calls scanimage for duplex scanning into 600 dpi lineart (black and white) images. In this example, 6 pages are empty and thus automatically excluded, i.e. the resulting PDF then just contains 8 pages.

The resulting PDF contains a text layer from the OCR such that one can search and copy'n'paste some text. It is 1.1 MiB big, i.e. a page is stored in 132 KiB, on average.

Software Requirements

The script assumes Tesseract version 4, by default. Version 3 can be used as well, but the new neural network system in Tesseract 4 just performs magnitudes better than the old OCR model. Tesseract 4.0.0 was released in late 2018, thus, distributions released in that time frame may still just include version 3 in their repositories (e.g. Fedora 29 while Fedora 30 features version 4). Since version 4 is so much better at OCR I can't recommend it enough over the stable version 3.

Tesseract 4 notes (in case you need to build it from the sources):

  • Build instructions - warning: if you miss the autoconf-archive dependency you'll get weird autoconf error messages
  • Data files - you need the training data for your languages of choice and the OSD data

Python packages:

  • img2pdf (Fedora package: python3-img2pdf)
  • Pillow (PIL) (Fedora package: python3-pillow-devel)
  • PyPDF2 (Fedora package: python3-PyPDF2)

adf2pdf's People

Contributors

gsauthof avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

jgru

adf2pdf's Issues

missing deps

I needed to do

pip install configargparse img2pdf pypdf2

so the package could be installed could you publish a version with those dips mentioned or make them optional so other doesn't have to guess ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.