A tutorial on extracting text from PDFs and optical character recognition using tesseract, ImageMagick and other open source tools.
This is based on the tutorial by Chad Day and updated for the Windows PC labs at NICAR 2020.
This class seeks to help you solve a common problem in journalism: Data stored in a computer generated PDF or even worse an image PDF. We'll first walk through how to do some quick text extraction using a command line tool. Then we'll step up to Optical Character Recognition, or OCR, to work on image files.
We'll use emoji to indicate different kinds of information.
- ๐: You should run the command in this code cell as part of the tutorial.
- ๐: Run this version of the command if you're on a Mac.
- ๐ฅ: Run this version of the command if you're on a Windows machine.
- ๐: Don't run the command in this cell. It's just an example for reference.
We're using a number of open source software tools to process our PDFS:
-
Xpdf is an open source toolkit to work with pdfs. We'll be using its tool, pdftotext.
-
tesseract is our OCR engine. It was first developed by HP but for the last decade or so it's been maintained by Google.
-
ImageMagick is an open source image processing and conversion power tool.
-
Ghostscript is an interpreter for PDFs and Adobe's PostScript language.
We'll be using a number of files for our examples. You can find them in folders corresponding to each of the three scenarios we'll be walking through together:
scenario_one
scenario_two
scenario_three
This is probably the easiest problem to solve dealing with pdfs. We want to extract the text from a searchable pdf for analysis of some type.
There are many GUI software programs you can use to do this. They all have strengths and weaknesses.
- Cometdocs
- Tabula (free and great for tabular data!)
- Adobe Acrobat Pro ($$)
- Abbyy Finereader ($$ but also very accurate)
For this tutorial, we're going to use an open source powertool from Xpdf called pdftotext. The construction of the command is pretty intuitive. You point it at a file and it outputs a text file.
I often use this tool to check for hidden text, particularly in documents that are redacted. Our example is from just a few months ago when lawyers for Paul Manafort accidentally filed a document that wasn't properly redacted. Reporters, including my colleague Michael Balsamo, quickly realized that even though the document contained blacked out sections, the text of those passages was still present. That text revealed Manafort had shared polling data with a Russian associate during the 2016 election.
One way to get to this text is just to copy and paste the sections out. But this can be tedious, particularly if there are a lot of sections or you have a large document. A faster and easier to read method is what we're going to do with Xpdf's pdftotext.
Our document has several sections like this.
But since we can tell that there's text underneath there, let's run it through pdftotext and see what comes out.
๐ To get started, let's change to the directory for this scenario:
cd scenario_one
๐ pdftotext
needs to know the path of the PDF file and the path of the output text:
pdftotext /path/to/my/file.pdf name-of-my-text-file.txt
๐ It's pretty simple to extract all the text in a text PDF to a text file:
pdftotext Manafort_filing.pdf manafort_filing.txt
But that's just one limited use case. Extracting this text can then be fed into databases or used for visualations.
Let's take a look at another one of our files involving tabular data, found here. This is a salary roster of Trump White House employees. We'll be using a single image page of this file for a later example.
As mentioned before, Tabula is a great tool for getting tabular data out of pdf files, but I wanted to give you another option using pdftotext that works well with fixed-width data files like this White House salaries listing. It also has the added benefit of being easily scriptable.
๐ The -table
option tells pdftotext
to try to extract tabular text from a PDF and maintain the rows/columns:
pdftotext -table /path/to/my/file name-of-my-text-file.txt
We'll test it out on the file.
๐ Run pdftotext
with the -table
option to extract the table.
pdftotext -table 07012018-report-final.pdf tabular-test.txt
You should get something like this:
๐ For comparison, try using just pdftotext without the -table
option.
pdftotext 07012018-report-final.pdf test.txt
You should get something like this (very bad stuff):
Now that we've walked through the basics of text extraction with computer generated (nice) pdfs, let's go onto the harder use cases.
๐ Let's change out of the scenario directory so we're ready to move on to the next scenario.
cd ..
Extracting text from image files is perhaps one of the most common problems reporters face when they get data from government agencies or are trying to build their own databases from scratch (paper records, the dreaded image pdf of an Excel spreadsheet, etc.) To do this, we use OCR and in this example, Tesseract.
Before we get started, change directoy into the directory for this scenario:
cd scenario_two
๐ Tesseract has many options. You can see them by typing:
tesseract -h
We're not going to go into detail on many of these options but you can read more here
๐ The basic command structure looks like this:
tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]
Let's look at a single image file. In this case, that's the wh_salaries.png
file in our imgs folder. This is the first page of our White House salaries pdf but notice that it is not searchable.
This is perhaps the most simple use of tesseract. We will feed in our image file and have it output a searchable pdf.
๐ Run the tesseract
command to create a searchable PDF from an image:
tesseract wh_salaries.png out pdf
You start with a file like this:
You should get a file name out.pdf
and you can see that it's searchable.
๐ As with the previous scenario, change out of the scenario directory so we're ready to move on to the next scenario:
cd ..
So far, we've covered extracting text from computer generated files and doing some basic OCR. Now, we'll turn to creating searchable pdfs out of image files. To do this, we'll be adding another command line tool called ImageMagick, an image editing and manipulation software.
๐ Before we get started, change directory to the one for this scenario:
cd scenario_three
We will be using the convert
tool from ImageMagick.
ImageMagick has some great documentation that explains all of its many options. You can find it here.
๐๐ฅ The general syntax of the convert
command on Windows is:
magick convert [options ...] file [ [options ...] file ...] [options ...] file
๐๐ On a Mac or a Linux system, you can omit the magick
supercommand:
convert [options ...] file [ [options ...] file ...] [options ...] file
If you're familiar with photography or document scanning, you know that the proper image resolution is essential for electronic imaging. When it comes to OCR, this is even more true.
The general standard for OCR is 300 dpi, or 300 dots per inch, though ABBYY recommends using 400-600 for font sizes smaller than 10 point. In ImageMagick, this is specified using the density flag. Below we are telling ImageMagick to take our pdf document and convert it to an image with 300 dpi.
First, we have to convert it to an image so we can run it through tesseract.
๐๐ฅ We'll use ImageMagick's convert
tool.
magick convert image_pdf_2.pdf russia_findings.tiff
๐๐ On a Mac:
convert image_pdf_2.pdf russia_findings.tiff
On a Mac, an easy way to find the dpi of an image is to use Preview. Open the image in preview, go to Tools
and click Show Inspector
.
So let's take a look at our image we just created.
This can be obtained through the file properties Window.
Right-click on the file name in Explorer to get the context menu.
Click Properties
from the context menu.
Click on the Details
tab.
Scroll down to the Image
section.
Look for the Horizontal resolution
and Vertical resolution
labels.
Open in Preview:
Go to 'Show Inspector':
Look in Inspector pane 1:
Look in Inspector pane 2:
So our dpi is 72
, which likely is fine for this document but let's go ahead and up that using convert. This will increase the file size of the tiff we create (so warning about file bloat) but it's only a temporary file that we're using to get the best text recognition.
๐๐ฅ Let's do this with our Russia document.
magick convert -density 300 image_pdf_2.pdf -depth 8 -strip -background white -alpha off russia_findings.tiff
๐๐ On a Mac, this would be:
convert -density 300 image_pdf_2.pdf -depth 8 -strip -background white -alpha off russia_findings.tiff
So let's break this down.
convert
- invokes ImageMagick's convert tool
-density
- ups the dpi of our image to 300
russia_finding.pdf
- our file that we're converting to an image.
-depth 8
- "This the number of bits in a color sample within a pixel. Use this option to specify the depth of raw images whose depth is unknown such as GRAY, RGB, or CMYK, or to change the depth of any image after it has been read", according to ImageMagick documentation.
-strip
- strips off any junk on the file (profiles, comments, etc.)
-background white
- sets the background to white to help with contrasting our text
-alpha off
-generally the transparency of the image. A great explanation here
๐ Run the extracted image through tesseract to create a searchable text PDF.
tesseract russia_findings.tiff -l eng russia_findings_enh pdf
And you've got a searchable PDF!
๐ Let's take a look at the underlying text now.
pdftotext russia_findings_enh.pdf russia_text.txt
๐ We also could have just outputted directly to a text file like this.
tesseract russia_findings.tiff -l eng russia_findings_enh txt
OCRing is not a perfect science and most of the time, it isn't simple. One recent example: public financial disclosures of federal judges are multi-page documents but they are released as extremely long, single tiff files. You can find a similar test file here
And you'll notice that the pages need to be split.
The workflow below walks through one example of how to solve the problem using ImageMagick and Tesseract.
๐ This blows up the images, adjusts the image resolution, ups the contrast to help bring out the text. It then outputs a grayscale version, set at 8-bit depth, named Walker16_enh.tiff.
convert -resize 400% -density 450 -brightness-contrast 5x0 Walker16.tiff -set colorspace Gray -separate -average -depth 8 -strip Walker16_enh.tiff
Next we use ImageMagick's crop to split it up into a multi-page pdf.
To find the dimensions, first use Preview's Inspector tool. You 'll see the dimensions of the entire image file. (NOTE: This screenshot is from a different file since I added this later.)
The first value is the width and the second value is the length. To get the pixel length of each page, just divide by the number of pages you should have in the final file.
๐
convert Walker16_enh.tiff -crop 3172x4200 Walker16_to_ocr.tiff
๐ Then we convert that image into a searchable pdf.
tesseract Walker16_to_ocr.tiff -l eng Walker16 pdf
Exploring the various options and fine-tuning your skills with ImageMagick can help prepare you for the next big step: Batch processing of documents, which you can hear more about here at NICAR.
If you want to run the tutorial on your machine, you'll need to install Xpdf, tesseract, ImageMagick and possibly Ghostscript on your computer.
For Mac, we'll be using the Homebrew package manager.
๐๐ To install tesseract, you will use the following command.
brew install tesseract
๐๐ For Xpdf, you will use this.
brew install xpdf
๐๐ We will also install libtiff, a dependency for ImageMagick that we will need.
brew install libtiff
๐๐ Then we'll install ghostscipt
brew install ghostscript
๐๐ And for ImageMagick you will use this.
brew install imagemagick
๐๐ Before we go on from here, let's make sure we have the tiff delegate installed. You can check like this:
convert -list configure
Scroll down to DELEGATES
and make sure it includes tiff
For example:
DELEGATES bzlib mpeg freetype jng jpeg lzma png tiff xml zlib
๐๐ First check to make sure that libtiff is installed. You can do this by running
brew list
๐๐ If libtiff is not in the list, then install it using brew.
brew install libtiff
๐๐ Now check to make sure that imagemagick is recognizing libtiff is installed as a dependency.
brew info imagemagick
If you're good to go, it should look something like this:
==> Dependencies
Build: pkg-config โ
Required: freetype โ, jpeg โ, libheif โ, libomp โ, libpng โ, libtiff โ, libtool โ, little-cms2 โ, openexr โ, openjpeg โ, webp โ, xz โ
Now that we've installed ghostscript and the tiff delegate, let's continue on with our example.
See the installation instructions in the documentation for these packages to find and install the software on Windows or Linux.
- Xpdf documentation
- tesseract documention.
- ImageMagick documentation
- Ghostscript documentation
I created this tutorial for NICAR 2019 but it relies on many helpful open source resources that deserve credit. They are listed below. Thanks for sharing your work with the rest of the world.
Tesseract documentation
ImageMagick documentation
pdftotext documentation