Giter VIP home page Giter VIP logo

pyxpdf's Introduction

pyxpdf

pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources.

docs Read the Docs
tests Azure DevOps builds (branch) Travis (.com) Codecov
package PyPI PyPI - Python Version PyPI - Wheel PyPI - Downloads
license GitHub

Features

  • Almost x20 times faster than pure python based pdf parsers (see Speed Comparison)
  • Extract text while maintaining original document layout (best possible)
  • Support almost all PDF encodings, CMaps and predefined CMaps.
  • Extract LZW, RLE, CCITTFax, DCT, JBIG2 and JPX compressed images and image masks along with their BBox.
  • Render PDF Pages as image with support of '1', 'L', 'LA', 'RGB', 'RGBA' and 'CMYK' color modes.
  • No explict dependencies (except optional ones, see Installation)
  • Thread Safe

More Information

License

pyxpdf is licensed under the GNU General Public License (GPL), version 2 or 3. See the LICENSE

Credits

pyxpdf's People

Contributors

ashutoshvarma avatar dependabot-preview[bot] avatar pb-jeff-oneill avatar prithvitewatia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pyxpdf's Issues

inconsistent behavior of `get()` methods of `PDFOutputDevice`s

When calling get() with index out of page range RawImageOutput returns last page's image whereas TextOutput throws a IndexError.

Steps To Reproduce:-

d = x.Document("samples/simple1.pdf")
iout = x.RawImageOutput(d)
tout = x.TextOutput(d)

print(len(d))
print(iout.get(10))            # will return same as iout.get(0)
print(tout.get(10))            # will throw Index Error

Output:-

1
<PIL.Image.Image image mode=RGB size=1275x1651 at 0x7F179F573370>
Traceback (most recent call last):
  File "_test.py", line 15, in <module>
    tout.get(10)
  File "src/pyxpdf/textoutput.pxi", line 268, in pyxpdf.xpdf.TextOutput.get
    cpdef object get(self, int page_no):
  File "src/pyxpdf/textoutput.pxi", line 286, in pyxpdf.xpdf.TextOutput.get
    return self._get_bytes(page_no).decode('UTF-8', errors='ignore')
  File "src/pyxpdf/textoutput.pxi", line 209, in pyxpdf.xpdf.TextOutput._get_bytes
    if self._cache_texts[page_no] == None:
IndexError: list index out of range

PDF file is not closed correctly

When I use the following command, the file is not closed correctly. For example, I cannot delete the file afterwards because the PDF file is still being used by a process.

doc = Document("samples/nonfree/mandarin.pdf")`

If I write the code as follows instead, the PDF file will be closed correctly.

with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
    doc = Document(fp)

Text Breaking when used for Gurmukhi(punjabi) script

I want to extract text from PDF for Gurmukhi script which is punjabi laguage
but characters wrongly read while extracting the text from pdf

`pdf_path='/content/Punjab2_new.pdf'
doc = Document(pdf_path)

text_control=TextControl("physical",insert_bom=True)
for page in range(len(doc)):
out_res=doc[page].text((0,90,155,700),text_control)
print('\n_______________New_page_output_________________________\n')
print(out_res)`

here are my expected and actual result images
expected image is sample of my input :

expected_text

and with text function I am having false charecter recognition issue:

actual_output

PDF
download.pdf

It will be a great help if any parameters of pyxpdf solve the issue

Config.load_file() removes encodings from pyxpdf_data

Config.load_file() removes the xpdfrc settings from pyxpdf_data.
As pyxpdf_data introduce new encodings with the help of automatic generated xpdfrc and loading another xpdfrc will discard them.

It can be solved by appending the user provided xpdfrc to the pyxpdf_data's xpdfrc.

image

Config file loading function can be found here:-

def load_file(self, cfg_path=None):

Python 3.10 and 3.11 Compatibility

Hello @ashutoshvarma

What is the maintenance status of pyxpdf? It seams that the library is effectively unmaintained!

When trying to install (or build) pyxpdf for Python 3.10 or 3.11 a large number of errors occur, and the build is terminated.

Please try to update this package to work with latest python versions.

Best
Musharraf

Dual license with GPL2?

xpdf is dual licensed under either GPL2 or GPL3.

Could you do the same for this project?

Many people prefer GPL2 over GPL3.

Mac support?

Hi,
I love this library. It works well, and reads everything I could possibly want it to. However, when running

pip3.9 install pyxpdf

On mac, it throws an error. Is Mac support planned /easy to do? If not, can you recommend other libraries that are almost as good?
Thanks.

add proper thread safe logging mechanism

As of now xpdf errors/warnings/logs are sent to stdout and only way to partially control it is with Config.error_quiet. Also pyxpdf in most cases cannot detect errors in xpdf sources as there is no error callback or logging mechanism implemented.

Ideas:-
libxpdf has an error callback which can be utilized for log/error reporting.

# https://github.com/ashutoshvarma/libxpdf/blob/bc061b4e3da53b08a74e81706c6d8721af6b6094/xpdf-4.02/xpdf/Error.h#L39
extern void setErrorCallback(void (*cbk)(void *data, ErrorCategory category,

Also take a look at lxml's logging mechanism, their use case is also similar to us
https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi

Error when running tests for site-packages install

See the CI run - https://github.com/ashutoshvarma/pyxpdf/runs/990111561
importing test_segfault_image_text_image.py is causing problem

Using package from site-packages.
Traceback (most recent call last):
  File "runtests.py", line 213, in import_module
    mod = __import__(modname)
ModuleNotFoundError: No module named 'tests.test_segfault_image_text_image'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "runtests.py", line 661, in <module>
    exitcode = main(sys.argv)
  File "runtests.py", line 598, in main
    test_cases = get_test_cases(test_files, cfg, cov=cov)
  File "runtests.py", line 281, in get_test_cases
    module = import_module(file, cfg, cov=cov)
  File "runtests.py", line 221, in import_module
    mod = __import__(modname)
ModuleNotFoundError: No module named 'tests.test_segfault_image_text_image'

xpdf 4.04 support

Is it possible to compile the sources with xpdf 4.04 ( i am using Visual Studio 2022 on Windows )
i've sucessfully compiled libxpdf with xpdf 4.04 sources, but when i compile the pyxpdf sources i get an error....
( had to alter get_libxpdf.py to use the 4.04 libxpdf sources )

$ setup.py --build-libxpdf build
...
src\pyxpdf\xpdf.cpp(836): fatal error C1083: Cannot open include file: 'Form.h': No such file or directory

it appears the Form.h/.cc are missing in the new sources and XFAForm.h/.cc are renamed to XFAScanner.h/.cc

Python Support for python 3.11

Installing on Ubuntu 22.04 in a venv environment using Python 3.11 failed with the following output:
I did just notice that listed python versions on PyPI is 3.8, if > 3.8 is not supported will support for > 3.8 be added anytime soon?

Output from pip intstall pyxpdf:

Using cached pyxpdf-0.2.3.tar.gz (1.9 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/setup.py", line 165, in
**setup_extra_options()
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/setup.py", line 98, in setup_extra_options
ext_modules = setupinfo.ext_modules(
^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/setupinfo.py", line 100, in ext_modules
get_prebuilt_libxpdf(
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/get_libxpdf.py", line 91, in get_prebuilt_libxpdf
lib_dest_path = download_and_extract_libxpdf(download_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/get_libxpdf.py", line 60, in download_and_extract_libxpdf
libname = [name for name in filenames if "linux" in name and arch in name][0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Building pyxpdf version 0.2.3.
Latest version of libxpdf is 0.1.3
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Config.add_font_file() does not check for font file

When adding font in pyxpdf with new Config.add_font_file() method if font_path is incorrect(file does not exist) no error is thrown.

Steps to Reproduce:-

from pyxpdf import xpdf as x
doc = x.Document('samples/simple1.pdf')
iout = x.RawImageOutput(d)
iout.get(0).show()                # Output: Syntax Error: Couldn't find a font for 'Helvetica'

x.Config.add_font_file('Helvetica', 'asbajd')
iout.get(0).show()                # Output: Syntax Error: Couldn't find a font for 'Helvetica', Still same error

xpdf error logs are not very clear, we should check whether file exists and is readable in Config.add_font_file() and throw error with appropriate message

not able to read japanese text from pdf

I was trying to write some tests for pyxpdf and came accross a pdf in pdfminer pdf samples.

When trying to use pdftotext or Page.text function results in distorted output with error :-
Syntax Error: Unknown character collection 'Adobe-Japan1'

First I thought it was xpdf limitation but to my surprise pdftotext binary from my system was able to extract text from it.

Steps to reproduce :-

from pyxpdf import pdftotext
print(pdftotext("samples/jo.pdf")

PDF File - jo.pdf
replace "samples/jo.pdf" with provided pdf path

  • with pyxpdf
    image

  • with pdftotext binary
    image

System: 5.5.11-1-MANJARO
Python Version: 3.8.2
pyxpdf: : upto date dev

Better font detection [Enable fontconfig support]

For the PDF's that does not embed fonts (mostly Adobe Base-14 fonts as they are not required to embed as per PDF specification) in them we have to search for fonts in user's system.
The best way for that should be using fontconfig as xpdf-4.02 has fontconfig support already but disabled in libxpdf.

Why fontconfig support is disabled in libxpdf?
Because goal of libxpdf was to provide static library with minimum dependencies by including them statically. But for fontconfig, including it in a static library is pretty difficult (at least for me) as :

  • It requires a font.conf configuration file and without it spams the stdout with warnings.
  • Also the search location for this font.conf file must be hardcoded during build which would make pyxpdf wheels error prone.

Probable solutions:-

  • Include fontconfig statically in libxpdf and ship pyxpdf with it own custom font.conf which should work in all supported oses.

Segmentation Fault when using `Document.text()` between `RawImageOutput.get()` calls

Steps to Reproduce:-

_test.py

doc = x.Document("samples/nonfree/mandarin.pdf")
iout = x.RawImageOutput(doc)
iout.get(0)
doc.text()
iout.get(0)
python _test.py

Output :-

Syntax Error: Couldn't find a font for 'TimesNewRomanPS-ItalicMT'
[1]    6040 segmentation fault (core dumped)  python _test.py

System:-

  • OS : Clear Linux OS x86_64
  • Python : 3.8.5 (debug)
  • pyxpdf version : v0.2.2
  • pyxpdf_data : v1.0.1
  • Pillow : v7.2.0

Error on Chinese Characters text fetching!!! any idea to resolve it?

Hi , I am using the module to fetch texts (combination of both English and Chinese) from the pdf files, with the following error:

from pyxpdf import Document, Page, Config
from pyxpdf.xpdf import TextOutput, TextControl, page_iterator
with open(pdf_file, 'rb') as fp:
    doc = Document(fp)
for page in doc:
    res_box =page.find_text('Cornerstone', search_box=[0, 0, 400, 400], case_sensitive=True)
    if res_box:
        print(page.label,res_box)

results:

Syntax Error: Unknown character collection 'Adobe-CNS1'
278 (406.8096, 94.85200000000002, 465.46160000000003, 104.47700000000002)
Syntax Error: Unknown character collection 'Adobe-CNS1'
279 (69.6101, 103.50040000000014, 106.93410000000002, 109.62540000000014)
280 (230.7095, 348.65500000000003, 284.4775, 358.28000000000003)
Syntax Error: Unknown character collection 'Adobe-CNS1'
Syntax Error: Unknown character collection 'Adobe-CNS1'
Syntax Error: Unknown character collection 'Adobe-CNS1'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.