Giter VIP home page Giter VIP logo

pdfextract's Introduction

pdf-extract

A tool and library that can extract various areas of text from a PDF, especially a scholarly article PDF. It performs structural analysis to determine column bounds, headers, footers, sections, titles and so on. It can analyse and categorise sections into reference and non-reference sections and can split reference sections into individual references.

The latest version is 0.1.1. Earlier versions are far less reliable.

pdf-extract requires Ruby 1.9.1 or above.

Quick start

Install the latest version with:

$ gem install pdf-extract

Quick examples

Extract references from a PDF:

$ pdf-extract extract --references myfile.pdf

Extract references and a title from a PDF:

$ pdf-extract extract --references --titles myfile.pdf

Mark the locations of headers, footers and columns in a new PDF:

$ pdf-extract mark --columns --headers --footers myfile.pdf

Extract regions of text from a PDF, preserving line information (offsets from region origin):

$ pdf-extract extract --regions myfile.pdf

Extract regions of text from a PDF without line information (prettier and easier to read):

$ pdf-extract extract --regions --no-lines myfile.pdf

Resolve references to DOIs and output related metadata as BibTeX:

$ pdf-extract extract-bib --resolved_references myfile.pdf

Problems

pdf-extract mistakes normal text for references when attempting to extract references.

pdf-extract attempts to identify reference sections by comparing section features to an idealised model of a reference section. Sometimes this can go wrong. If pdf-extract is producing reference output that clearly includes something that is not a reference, try reducing the reference_flex slightly:

$ pdf-extract extract --references --set reference_flex:0.18 myfile.pdf

The default for reference_flex is 0.2. Make small decrements.

pdf-extract extracts no references.

As above, but try to increase the reference_flex a bit a time:

$ pdf-extract extract --references --set reference_flex:0.25 myfile.pdf

Keep trying with small increments to reference_flex. Note that a reference_flex of 1 means pdf-extract will identify all sections as reference sections.

pdf-extract is still producing weird output after fiddling with reference_flex.

Have a look at pdf-extract's settings:

$ pdf-extract settings

This command will produce a list of settings along with descriptions of what they affect. They can be set by passing a --set key:value argument to pdf-extract.

pdfextract's People

Contributors

jdherman avatar kjw avatar pwnall avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.