Giter VIP home page Giter VIP logo

ocr2text's Introduction

PDF to TXT (with OCR)

Given one or more PDFs that may include text-as-image content, use OCR (Optical Character Recognition) to convert the content to TXT files (in UTF-8 encoding).

Rationale

A survey of existing PDF-to-TXT solutions found no extant solutions that meet all of the following criteria:

  • is an offline tool (to keep secure human-subject information)
  • provides conversion from PDF to TXT (most existing OCR integrations assume an image as input)
  • supports batch processing of multiple files

Assumptions

  • This is (currently) a command-line tool, written in Python. Basic familiarity with executing commands in a terminal, as well as directory structure, is assumed.
  • It is assumed that you have Python version 3.x installed, as well as Pip.
  • This script relies on an industry-standard OCR library managed by Google, called Tesseract. Since it is written in C++, for Python to be able to use it, it needs to be installed separately (instructions below). Similarly, a PDF-to-image library, Poppler, will need to be installed on Windows and Mac systems.

Setup

Windows

  1. Make a new folder on your Desktop called ocr (e.g., C:\Users\mark\Desktop\ocr)
  2. Download and install the Tesseract 4 OCR library from Tesseract at UB Mannheim
  3. The installation should indicate which directory Tesseract-OCR was installed. Most likely, this will either be C:\Program Files (x86)\Tesseract-OCR or C:\Program Files\Tesseract-OCR. Move this folder into your equivalent of C:\Users\mark\Desktop\ocr, so that it is now located at Desktop\ocr\Tesseract-OCR.
  4. Download poppler for Windows.
  5. You may need to install 7Zip to unzip the executable, as well.
  6. Place the unzipped files in Desktop\ocr\poppler-0.68.0_x86).
  7. From your start menu, navigate to Control Panel > System and Security > System > Advanced System Settings
  8. Then click Environment Variables.
  9. In the System Variables window, highlight Path, and click Edit.
  10. Click New to add an additional path.
  11. Paste the full path to the location of Tesseract (e.g., C:\Users\mark\Desktop\\ocr\Tesseract-OCR) and press OK.
  12. Again, click New to add an additional path.
  13. Paste your equivalent of C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin and press OK.
  14. Press OK on any remaining control panel windows.
  15. Download OCR2Text to Desktop\ocr).
  16. Unzip the project.
  17. Open a cmd.exe terminal, and navigate to the folder via the command line (e.g., cd Desktop\ocr\ocr2text-master)
  18. Run pip install --user --requirement requirements.txt
  19. Optionally, you can check that you set up the PATH variable correctly in steps 6-10 by typing echo %PATH%. The output must include your equivalent of C:\Users\mark\Desktop\ocr\Tesseract-OCR and C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin for the script to work.

macOS

  1. Make a new folder on your Desktop called ocr (i.e., /Users/mark/Desktop/ocr)
  2. Install Tesseract-OCR using either MacPorts (sudo port install tesseract) or Homebrew (brew install tesseract
  3. Install poppler for Mac.
  4. Download this Github project to /Users/mark/Desktop/ocr).
  5. Unzip the project.
  6. Open a terminal and navigate to the folder via the command line (e.g., cd /Users/mark/Desktop/ocr/ocr2text)
  7. Run pip install --user --requirement requirements.txt

Linux

  1. sudo apt-get install tesseract-ocr
  2. Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils
  3. Download this Github project.
  4. Unzip the project.
  5. Open a terminal and navigate to the folder
  6. Run pip install --user --requirement requirements.txt

Usage

If you have successfully completed the setup steps and are using Python version 3, usage should now be a breeze:

On the command line, navigate to the directory where you downloaded the script and run:

python ocr2text.py

You will see the following:

********************************
*** PDF to TXT file, via OCR ***
********************************

Indicate file or folder of source PDF(s) []:
(Press [Enter] for current working directory)

Enter the full path to the file or directory to convert.

Destination folder for TXT []:
(Press [Enter] for current working directory)

Enter the full path to the directory where the result file(s) should be outputted.

The script will now covert the PDF via OCR into a plaintext file:

Testing the installation

For testing purposes, a test_files directory is included. You can press [Enter] for the source and destination directories & verify that the image.pdf file is converted. It will also be located in the test_files directory:

Converted C:\Users\mark\ocr2text\image.pdf
Percent: [##########] 100%
1 file converted

ocr2text's People

Contributors

markfullmer avatar dependabot[bot] avatar

Stargazers

Serhii Shyran avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.