Giter VIP home page Giter VIP logo

pdf2image's Introduction

pdf2image TravisCI PyPI version codecov

A python3 module that wraps the pdftoppm utility to convert PDF to a PIL Image object

How to install

pip3 install pdf2image

Install Pillow if you don't have it already with pip3 install pillow

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler.

Windows users will have to install poppler for Windows.

Mac users will have to install poppler for Mac.

Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils

How does it work?

from pdf2image import convert_from_path, convert_from_bytes

Then simply do:

images = convert_from_path('/home/kankroc/example.pdf')

OR

images = convert_from_bytes(open('/home/kankroc/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/home/kankroc/example.pdf', output_folder=path)
     # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm')

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm')

What's new?

  • userpw parameter allows you to set a password to unlock the converted PDF (-upw in the cli of pdftoppm)
  • thread_count parameter allows you to set how many thread will be used for conversion.
  • first_page parameter allows you to set a first page to be processed by pdftoppm (-f in the cli of pdftoppm)
  • last_page parameter allows you to set a last page to be processed by pdftoppm (-l in the cli of pdftoppm)
  • fmt parameter allows you to specify an output format. Currently supported formats are jpg, png, and ppm

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, I am investigating the issue.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Exception handling

There are no exception thrown by pdftoppm therefore any file that couldn't be convert/processed will return an empty Image list. The philosophy behind this choice is simple, if the file was corrupted / not found, no image could be extracted and returning an empty list makes sense. (This is up for discussion)

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)

pdf2image's People

Contributors

belval avatar minarth avatar josephernest avatar

Watchers

James Cloos avatar Tobias Happ avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.