Giter VIP home page Giter VIP logo

camelot's Introduction

Camelot: PDF Table Extraction for Humans

tests Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!


Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

Camelot also comes packaged with a command-line interface!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions here.

Why Camelot?

  • Configurability: Camelot gives you control over the table extraction process with tweakable settings.
  • Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
  • Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Support the development

If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.

Installation

Using conda

The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:

$ pip install "camelot-py[base]"

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[base]"

Documentation

The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

Contributing

The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

camelot's People

Contributors

anakin87 avatar christinegarcia avatar davidkong0987 avatar dependabot[bot] avatar dimitern avatar eamanu avatar gison93 avatar jedie avatar jonathanlloyd avatar kolanich avatar kshitiz305 avatar lucas-c avatar martinthoma avatar miltonarango avatar oshawk avatar pecey avatar pevisscher avatar pqrth avatar pravarag avatar stevestock avatar suyashb95 avatar symroe avatar tchx84 avatar tiagosamaha avatar tksumanth1994 avatar vaibhavmule avatar vasantvohra avatar vinayak-mehta avatar vmesel avatar yatintaluja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

camelot's Issues

AttributeError from PDFMiner

@igormp

Although I can't upload the bad PDF due to NDA reasons, this issue is well documented here, along with some solutions to it, and there's even a PR in place to fix that, but there seems to be no maintainer available to merge it.

I'm not sure how this should be handled, since it's a PDFMiner problem, which seems to be unmaintained, and that reflects directly on camelot.

ImportError: cannot import name 'PDFObjectNotFound'

I got this error when I ran the piece of code below "from .pdftypes import PDFObjectNotFound"

I have installed both dependencies ghostscript and tkinter but no idea why it's throwing import error?

import camelot

table = camelot.read_pdf('data/esign.pdf', suppress_stdout=True)
print(table)

Hybrid flavor combining lattice and stream

Shift text up based on the presence of horizontal lines and some metric based on blank rows. If the vertical lines are not present then, Stream generated columns/user given separators should be used.

For example:
image

Error: openpyxl.utils.exceptions.IllegalCharacterError

ERROR:root:
Traceback (most recent call last):
  File "/home/myusername/.local/lib/python3.7/site-packages/excalibur/tasks.py", line 123, in extract
    tables.export(f_datapath, f=f, compress=True)
  File "/home/myusername/.local/lib/python3.7/site-packages/camelot/core.py", line 745, in export
    table.df.to_excel(writer, sheet_name=sheet_name, encoding="utf-8")
  File "/home/myusername/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 2257, in to_excel
    engine=engine,
  File "/home/myusername/.local/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 739, in write
    freeze_panes=freeze_panes,
  File "/home/myusername/.local/lib/python3.7/site-packages/pandas/io/excel/_openpyxl.py", line 416, in write_cells
    xcell.value, fmt = self._value_with_fmt(cell.val)
  File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 252, in value
    self._bind_value(value)
  File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 205, in _bind_value
    value = self.check_string(value)
  File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 169, in check_string
    raise IllegalCharacterError
openpyxl.utils.exceptions.IllegalCharacterError

By the way, which is the official issue tracker? This one or https://github.com/atlanhq/camelot/issues

Print table to stdout

@crotoc

Hi there,
Thanks very much for your great project and save me a lot of time! Now I want to build a pipeline using camelot and need to know how to print the output to the stdout. Please let me know if there is a way!

Thanks,
Rui

Useful stats in parsing_report

Use-case:

Help the user drop tables in an ETL workflow based on parsing accuracy, whitespace in table cells.

More stat ideas:

  • A boolean that tells if there might be encoding errors in the output.
  • Distribution of font sizes?

Great library, but dependencies ??!!

Note: This is not an issue, yet no better place to discuss on this.

Stats below are pulled from PyPI downloads.
Despite being a better process than the others, what do you think supports the less usage.

image

Remove dependency on ghostscript and opencv

Something to think about for the future:

  • OpenCV: maybe implement morph transform within the library itself/vendorize the code (not sure about dependency on C extensions)?
  • tk: Required for matplotlib.
  • ghostscript: maybe use some Python library to convert PDF to image (same quality as ghostscript).

Some questions:
[1] Can pdftoppm be an alternative to ghostscript?
[2] Are poppler-utils more widely available (pre-installed) than ghostscript?


@tkelman wrote:

Could the matplotlib dependency be made optional? The plotting features here look like not a lot of code, and it's a pretty complicated dependency to pull in.

Similarly might pillow be a viable smaller alternative to the use of opencv here?


Hello @tkelman! I think making matplotlib optional makes sense. Let me look into it as I go on to adding more tests for the plotting code atlanhq/camelot#127.

Camelot uses adaptive threshold and morphological transformations from opencv. I haven't worked with pillow in the past but a quick google search got me this morph transform equivalent in pillow. I think removing opencv as a dependency would mean replacing the current image processing code with a combination of pillow + adaptive threshold / morph transform implementations. Let me explore this a bit further. Meanwhile if you have any other alternatives or suggestions on how we could do this, would love if you could share them on this thread!


matplotlib is now an optional requirement!


@sweco-sekrsv wrote:

I'm not exaclty sure what you are using Ghostscript for but I switched to pdftoppm for rasterizing pdf to images. I'm using the CLI tool and calling it from python.
For my scenarios, it's stable and generate images quicker than Ghostscript. I have had better success with fonts using pdftoppm as well.

I'm on windows and are using the latest binaries from here:
http://blog.alivate.com.au/poppler-windows

On a side note it can also fix "broken" PDF' files. As the ones in this ticket:
atlanhq/camelot#306
Resaving them with pdftocairo in the poppler tools makes the file load ok with pdf-miner

On another side note I tried making Ghostscript run using multiprocessing (to speed things up) but that did not seem to work very good. Not sure Ghostscript is designed to run using several threads.

why module 'camelot' has no attribute 'read_pdf'

"""""""""""""""""""
import camelot

tables=camelot.read_pdf("foo.pdf")
tables[0].df
tables.export("foo.csv",f="csv",comress=True)
tables[0].to_csv("foo.csv")

"""""""""""""""""
why ?

File "C:/Users/jiuyang.wei/Desktop/ocr/another/camelot.py", line 8, in
import camelot

File "C:\Users\jiuyang.wei\Desktop\ocr\another\camelot.py", line 10, in
tables=camelot.read_pdf("foo.pdf")

AttributeError: module 'camelot' has no attribute 'read_pdf'

why module 'camelot' has no attribute 'read_pdf'

winreg fails to find dll for ghostscript in 64bit Windows 10

I got the following error using camelot.read_pdf('some.pdf') Tracing back the RuntimeError, it looks like winreg is looking at the wrong view of the registry:

image

I don't know if my setup is odd (my system path may be a bit of a mess), but it may be good to detect which version of Windows is in use and then add the flags access=winreg.KEY_READ | winreg.KEY_WOW64_64KEY.

Error below:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-42c019f21b87> in <module>
----> 1 tables = camelot.read_pdf('PROCES-RM003D (Diagnostic Objects).pdf', pages='47-49', flavor='lattice')

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
    115             suppress_stdout=suppress_stdout,
    116             layout_kwargs=layout_kwargs,
--> 117             **kwargs
    118         )
    119         return tables

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
    170             for p in pages:
    171                 t = parser.extract_tables(
--> 172                     p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
    173                 )
    174                 tables.extend(t)

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\parsers\lattice.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
    401             return []
    402 
--> 403         self._generate_image()
    404         self._generate_table_bbox()
    405 

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\parsers\lattice.py in _generate_image(self)
    210 
    211     def _generate_image(self):
--> 212         from ..ext.ghostscript import Ghostscript
    213 
    214         self.imagename = "".join([self.rootname, ".png"])

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript\__init__.py in <module>
     22 #
     23 
---> 24 from . import _gsprint as gs
     25 
     26 

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript\_gsprint.py in <module>
    245     libgs = __win32_finddll()
    246     if not libgs:
--> 247         raise RuntimeError("Please make sure that Ghostscript is installed")
    248     libgs = windll.LoadLibrary(libgs)
    249 else:

RuntimeError: Please make sure that Ghostscript is installed

Make PDFHandler more efficient

Every time read_pdf is called, a new PDFHandler object is created, and parse (which splits a PDF into multiple single page PDFs). This is inefficient. Instead:

  • Split and store single page PDFs into a temp directory named after the md5 hash of the master PDF file. And then calculate the actual new single page PDFs that are needed, based on the page numbers provided by the user (which can change).
  • In Lattice, convert a single page PDF into an image, if and only if the PNG doesn't exist?

access violation writing 0x076ED670

rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x076ED670

32-bit python
latest ghostscript

Negative value as accuracy of table.

While testing I have faced a case where table.accuracy is negative number.

PDF:page-3.pdf
Code:

tables=camelot.read_pdf('/Users/skatipomu/Table_Extraction_Camelot/page3.pdf',pages="all)
[table.accuracy for table in tables]

Output:
[99.99999999999997, -20.852716930856104]

I think the reason is because in compute_accuracy method in utils.py while calculating accuracy we are subtracting error percentage from 1. It is supposed to be in the range [0.0,1.0] but the errors passed on to this method contains error percentages in the range[0 to 100] which inturn is from get_table_index method. So dividing this error by 100 solved the issue for me.

def compute_accuracy(error_weights):
    """Calculates a score based on weights assigned to various
    parameters and their error percentages.

    Parameters
    ----------
    error_weights : list
        Two-dimensional list of the form [[p1, e1], [p2, e2], ...]
        where pn is the weight assigned to list of errors en.
        Sum of pn should be equal to 100.

    Returns
    -------
    score : float

    """
    SCORE_VAL = 100
    try:
        score = 0
        if sum([ew[0] for ew in error_weights]) != SCORE_VAL:
            raise ValueError("Sum of weights should be equal to 100.")
        for ew in error_weights:
            weight = ew[0] / len(ew[1])
            for error_percentage in ew[1]:
                **score += weight * (1 - error_percentage)**
    except ZeroDivisionError:
        score = 0
    return score

from score += weight * (1 - error_percentage) to score += weight * (1 - error_percentage/100.0)

TableList error in Camelot and in Excalibur

Occurs when running the camelot example (or when uploading a pdf in excalibur):

Traceback (most recent call last):
File "camelottest.py", line 1, in
import camelot
File "C:\Program Files\Python37\lib\site-packages\camelot_init_.py", line 6, in
from .io import read_pdf
File "C:\Program Files\Python37\lib\site-packages\camelot\io.py", line 5, in
from .handlers import PDFHandler
File "C:\Program Files\Python37\lib\site-packages\camelot\handlers.py", line 8, in
from .core import TableList
ImportError: cannot import name 'TableList' from 'camelot.core' (C:\Program Files\Python37\lib\site-packages\camelot\core_init_.py)

Add OCR support

The experimental version exists before this commit 9753889. It uses Tesseract (using pyocr). ocropy looked promising the last time I checked, opening this issue for discussion and experiments around OCR.

Reduce file reads in camelot.handlers._save_page

@niazangels

camelot.handler._save_page is called as many times as there are pages passed to camelot.read_pdf. Each time this function is invoked, the source PDF is read from disk, parsed using PdfFileReader and is decrypted. This is something which can be reduced that contributes significantly to performance.

A great way to avoid this is accept a list of pages instead of page and run _save_pages function only once. The PdfFileReader object can be created once and we can loop over pages to save the pages separately.

I have this already working on a private fork with one hiccup that the PdfFileReader object gets modified for certain files after successfully looping and extracting ~80 pages in some of my sample PDFs. I create a copy of the original object to work around this but its a whole lot faster than the current approach as it completely avoids the 80+ file reads.

Let me know if this is something you'd like to incorporate, and I'd be happy to raise a pull request.

Cheers, and thanks for all the great work! smile

There's an associated PR atlanhq/camelot#311

Assuming whole page as one table in stream flavour

Camelot is assuming whole page as one table even there is sufficient space before and after table.
Only setting I could find is column_tol which is default at Zero. It doesn't make any difference.
Is there any other setting for this?
And please answer one more question.
How are your coordinates different from pdfplumber?

pdf

Page splitting is very slow for some PDFs

The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.

Adding a kwarg which lets user specify rotation can is a minor optimization that can fix this.

Automatically choose flavor based on type of table in PDF

Continuing the conversation from #102.

@imri:

When you say that lattice should work perfectly - I sort of wish to create a generic way to detect and extract tables without having to know which detection method (lattice / stream) is best for a given document - I want to decouple them as much as possible.

@vinayak-mehta

I get your use-case and it is not possible currently through the library itself. But I see two possibilities which can be implemented (both heuristics):

  1. As far as I can tell from NurminenDetectionAlgorithm.java, Tabula first filters out all Lattice-type tables from the document and then looks for Stream-type tables, till it cannot find any more tables. Similarly, we can "couple" both flavors into a single one inside Camelot.
  2. We can create a flavor called guess which automatically chooses between Lattice and Stream.

Conversion to csv for Hindi pdf is not correct

I am using Camelot to convert this document to csv. The csv file is created however, the issue is it is not correct.

As can be seen from the shared original file and the converted csv file, the Hindi characters are not converted properly.

import camelot
tables = camelot.read_pdf('Demand_ Estimate.pdf', flavor='stream')
tables[0].to_csv('demand_estimate.csv')

This is my code.

Add more pdf-to-image engines?

Ghostscript does the job of doing this currently but is a pain to install and debug and does not have a friendly license. Before we can do #13, does it make sense to use python-pdfbox. Then again, it downloads the pdfbox jar file and would need java to be installed on user systems.

atlanhq/camelot#346

Optimize memory usage for long PDFs

Using Camelot for some very long PDFs (>500 pages), I noticed that memory usage can grow significantly (in my experience, it can reach 30 GB and more).

I don't know if I'm doing something wrong.

Anyway, I found this solution: to divide the extraction into some chunks (for example, chunks of 50 pages); at the end of every chunk extraction, data are saved to disk.
Doing so, I succeed in limiting memory usage to a maximum of 4 GB, even for PDF of about 3000 pages.

@vinayak-mehta : what do you think about this approach? It could be useful? Are there better ways to limit memory usage?

(obviously, if the data saved on disk, later are all loaded into memory, the problem persists)

ValueError: max() arg is an empty sequence

When running on this document (https://www.qao.qld.gov.au/sites/qao/files/annual-reports/annual_report_2016-17.pdf), when it reaches page 4, it throws the following ValueError:

import camelot
camelot.read_pdf(path, pages='3', flavor='stream')

Traceback (most recent call last):
File "", line 2, in
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\io.py", line 117, in read_pdf
**kwargs
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\handlers.py", line 172, in parse
p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 458, in extract_tables
cols, rows = self._generate_columns_and_rows(table_idx, tk)
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 349, in _generate_columns_and_rows
ncols = max(set(elements), key=elements.count)
ValueError: max() arg is an empty sequence

Easy enough to capture with a try/except but thought I would pop it up here to let you know
Thanks for writing this package, excellent work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.