Giter VIP home page Giter VIP logo

carpenter's Introduction

Carpenter

Creates tables out of images

Carpenter takes images that contain tables as pixels and tries to convert them to HTML tables (e.g. HTML tables).

If you need to extract tables out of text PDFs, have a look at Tabula.

Installation

You need OpenCV and the OpenCV python bindings. On OS X a brew install opencv installs OpenCV, but you have to make the Python library available (look at the last lines of brew's output). Copying the cv*.(so|py) to you virtual env's site-packages folder is enough.

Carpenter Make Tables (Command line)

python make_tables.py [options] imagefile.png > output.html

Carpenter Workshop (Web Interface)

The Carpenter Workshop has the goal to make it easy to extract the same kinds of tables out of multiple PDF files by detecting table layouts and applying predefined extraction steps. It's not yet there, though.

It requires libpoppler with the pdftohtml/pdfimages commands and ImageMagick with convert. You also have to install the web dependencies with

pip install -r pip-requirements.txt

Start the workshop with

python open_workshop.py

Also run the Carpenter task worker with a Celery worker queue of your choosing:

rabbitmq-server &
celeryd -l INFO -I carpenter.tasks

Workshop configuration's defaults are in carpenter.default_settings and can be overridden with something like

export CARPENTER_SETTINGS=my_settings.cfg

Carpenter Tools (Python library)

There are a couple of modules inside carpenter that can be used more or less independently to accomplish some carpentry tasks:

  • carpenter.bench: Takes a PDF file and extracts pages and images
  • carpenter.ruler: Detects horizontal and vertical lines in an image
  • carpenter.paper: Takes horizontal and vertical lines and creates a table structure out of them
  • carpenter.cutter: Cuts table cells out of images
  • carpenter.plane: Runs OCR on extracted table cell images

For usage see make_tables.py, carpenter.tasks and the code itself.

carpenter's People

Contributors

stefanw avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.