Giter VIP home page Giter VIP logo

pdf-gold-digger's Introduction

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js with various output formats.

GitHub npm GitHub commits since tagged version GitHub last commit doc

Install

npm install -g pdf-gold-digger

Usage

pdfdig -i some_file.pdf

Avaliable commands

pdfdig -h
ex. pdfdig -i input-file -o output_directory -f json
  
  --input    or  -i   pdf file location (required)
  --output   or  -o   pdf file location (optional default "out")
  --debug    or  -d   show debug information (optional - default "false")
  --format   or  -f   format (optional - default "text") - ("text,json,xml,html") 
  --font     or  -t   extract fonts as ttf files (optional)
  --password or  -p   password
  --help     or  -h   display this help message
  --version  or  -v   display version information

Advanced usage

git clone https://github.com/vane/pdf-gold-digger
sh demo.sh

and see results in out directory

Documentation

pdf-gold-digger

Features:

  • extract text
    • separate each page
    • separate each line
    • separate font information
  • extract images
  • output formats
    • text -f text (default)
    • json -f json
    • xml -f xml
    • html -f html
  • specify output directory

TODO:

  • load pdf from remote location
    • from url
  • output to markdown format
  • pack output to zip
  • extract tables
  • extract forms
  • extract drawings
  • extract text from glyphs
    • ability to provide input file for glyph path to letter
    • detect when unicode is not provided or mangled
    • get bounding box from text and draw it on canvas
    • use tesseract.js as optional fallback

pdf-gold-digger's People

Contributors

vane avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

pdf-gold-digger's Issues

Extract Images

Extract Images in format
page_{page_number}image{image_number}.{image_format}

pdf.OPS.paintJpegXObject
pdf.OPS.paintImageXObject
pdf.OPS.paintInlineImageXObject
pdf.OPS.paintImageMaskXObject

Draw basic svg data from pdf

pdf.OPS.setFillRGBColor
pdf.OPS.setStrokeRGBColor
pdf.OPS.setStrokeColorN
pdf.OPS.setFillColorN
pdf.OPS.shadingFill
pdf.OPS.setDash
pdf.OPS.setGState
pdf.OPS.fill
pdf.OPS.eoFill
pdf.OPS.stroke
pdf.OPS.fillStroke
pdf.OPS.eoFillStroke
pdf.OPS.clip
pdf.OPS.eoClip

XML Output

Implement class src/pdf/Formatter/FormatterXML

To much spaces between characters

Some sentences having spaces inside words.
Solution:
Probably need to measure every glyph and based on space between two glyphs add generic space between words.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.