Giter VIP home page Giter VIP logo

pdfextract's Introduction

pdfextract

A wrapper around tabula that turns PDFs into document trees. Useful for data scraping.

Usage

lein dependency: [pdfextract "0.2.0"]

(require [pdfextract.core :as ex])

(def tree (ex/extract-content (java.io.File. "example.pdf"))) ;; extract-content takes anything that byte-streams can convert to an InputStream

;; value of tree:
({:min-x 140.4600067138672,
  :max-x 454.24000549316406,
  :min-y 58.7599983215332,
  :max-y 68.45999813079834,
  :text
  ([{:text "1.  PRODUCT AND COMPANY IDENTIFICATION",
     :direction 0.0,
     :space-width 3.8864403,
     :font {:font-family nil, :font-name nil, :font-stretch nil},
     :font-size 1.0}])}
 {:min-x 28.3799991607666,
  :max-x 445.4099979400635,
  :min-y 78.06999969482422,
  :max-y 85.01999950408936,
  :text
  ([{:text "Product Name: ",
     :direction 0.0,
     :space-width 2.7855604,
     :font {:font-family nil, :font-name nil, :font-stretch nil},
     :font-size 1.0}]
   [{:text
     "Klean Strip Adhesive Remover / Klean Strip Premium Stripper",
     :direction 0.0,
     :space-width 2.7855604,
     :font {:font-family nil, :font-name nil, :font-stretch nil},
     :font-size 1.0}])})

(def text-nodes (ex/text-tree tree))

;; value of text-nodes:
((("1.  PRODUCT AND COMPANY IDENTIFICATION"))
 (("Product Name: ")
  ("Klean Strip Adhesive Remover / Klean Strip Premium Stripper")))

License

Copyright © 2017 Bob Poekert

Distributed under the MIT license.

pdfextract's People

Contributors

bobpoekert avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.