ashutoshvarma / pyxpdf Goto Github PK

View Code? Open in Web Editor NEW

38.0 4.0 16.0 12.48 MB

Fast and memory-efficient Python PDF Parser based on xpdf sources

Home Page: https://pyxpdf.readthedocs.io/

License: Other

C 0.02% Python 22.76% Makefile 1.26% Shell 1.14% C++ 5.08% Cython 69.74%

pdf python cython pdf-converter pdftotext pdf-parser pdfparser pdftohtml pdftopng xpdf

pyxpdf's Introduction

pyxpdf

pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources.

docs
tests
package
license

Features

Almost x20 times faster than pure python based pdf parsers (see Speed Comparison)
Extract text while maintaining original document layout (best possible)
Support almost all PDF encodings, CMaps and predefined CMaps.
Extract LZW, RLE, CCITTFax, DCT, JBIG2 and JPX compressed images and image masks along with their BBox.
Render PDF Pages as image with support of '1', 'L', 'LA', 'RGB', 'RGBA' and 'CMYK' color modes.
No explict dependencies (except optional ones, see Installation)
Thread Safe

More Information

License

pyxpdf is licensed under the GNU General Public License (GPL), version 2 or 3. See the LICENSE

Credits

xpdf reader by Derek Noonburg
lxml - project structure and build adapted from lxml
poppler project

pyxpdf's People

Contributors

Stargazers

Watchers

Forkers

prithvitewatia demonnezuko wrathdev aniketvarma kapil1201 manycoding pb-jeff-oneill jermellb suyin1203 kirbygox elusivespirit isankadn kkpan11 ink-splatters

pyxpdf's Issues

inconsistent behavior of `get()` methods of `PDFOutputDevice`s

When calling get() with index out of page range RawImageOutput returns last page's image whereas TextOutput throws a IndexError.

Steps To Reproduce:-

d = x.Document("samples/simple1.pdf")
iout = x.RawImageOutput(d)
tout = x.TextOutput(d)

print(len(d))
print(iout.get(10))            # will return same as iout.get(0)
print(tout.get(10))            # will throw Index Error

Output:-

1
<PIL.Image.Image image mode=RGB size=1275x1651 at 0x7F179F573370>
Traceback (most recent call last):
  File "_test.py", line 15, in <module>
    tout.get(10)
  File "src/pyxpdf/textoutput.pxi", line 268, in pyxpdf.xpdf.TextOutput.get
    cpdef object get(self, int page_no):
  File "src/pyxpdf/textoutput.pxi", line 286, in pyxpdf.xpdf.TextOutput.get
    return self._get_bytes(page_no).decode('UTF-8', errors='ignore')
  File "src/pyxpdf/textoutput.pxi", line 209, in pyxpdf.xpdf.TextOutput._get_bytes
    if self._cache_texts[page_no] == None:
IndexError: list index out of range

wrong spelling of ownerpass in the example mentioned which leads to the error.

Hi,
I think you are making a spelling mistake of ownerpass in the example you provide in the documentation. I think that should be ownerpass just like userpass and not ownerpss. This leads to error TypeError: pdftotext() got an unexpected keyword argument 'ownerpss'.

PDF file is not closed correctly

When I use the following command, the file is not closed correctly. For example, I cannot delete the file afterwards because the PDF file is still being used by a process.

doc = Document("samples/nonfree/mandarin.pdf")`

If I write the code as follows instead, the PDF file will be closed correctly.

with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
    doc = Document(fp)

Text Breaking when used for Gurmukhi(punjabi) script

I want to extract text from PDF for Gurmukhi script which is punjabi laguage
but characters wrongly read while extracting the text from pdf

`pdf_path='/content/Punjab2_new.pdf'
doc = Document(pdf_path)

text_control=TextControl("physical",insert_bom=True)
for page in range(len(doc)):
out_res=doc[page].text((0,90,155,700),text_control)
print('\n_______________New_page_output_________________________\n')
print(out_res)`

here are my expected and actual result images
expected image is sample of my input :

and with text function I am having false charecter recognition issue:

PDF
download.pdf

It will be a great help if any parameters of pyxpdf solve the issue

Config.load_file() removes encodings from pyxpdf_data

Config.load_file() removes the xpdfrc settings from pyxpdf_data.
As pyxpdf_data introduce new encodings with the help of automatic generated xpdfrc and loading another xpdfrc will discard them.

It can be solved by appending the user provided xpdfrc to the pyxpdf_data's xpdfrc.

Config file loading function can be found here:-

pyxpdf/src/pyxpdf/globalconfig.pxi

Line 35 in 40e2969

def load_file(self, cfg_path=None):

Python 3.10 and 3.11 Compatibility

Hello @ashutoshvarma

What is the maintenance status of pyxpdf? It seams that the library is effectively unmaintained!

When trying to install (or build) pyxpdf for Python 3.10 or 3.11 a large number of errors occur, and the build is terminated.

Please try to update this package to work with latest python versions.

Best
Musharraf

Dual license with GPL2?

xpdf is dual licensed under either GPL2 or GPL3.

Could you do the same for this project?

Many people prefer GPL2 over GPL3.

Mac support?

Hi,
I love this library. It works well, and reads everything I could possibly want it to. However, when running

pip3.9 install pyxpdf

On mac, it throws an error. Is Mac support planned /easy to do? If not, can you recommend other libraries that are almost as good?
Thanks.

add proper thread safe logging mechanism

As of now xpdf errors/warnings/logs are sent to stdout and only way to partially control it is with Config.error_quiet. Also pyxpdf in most cases cannot detect errors in xpdf sources as there is no error callback or logging mechanism implemented.

Ideas:-
libxpdf has an error callback which can be utilized for log/error reporting.

# https://github.com/ashutoshvarma/libxpdf/blob/bc061b4e3da53b08a74e81706c6d8721af6b6094/xpdf-4.02/xpdf/Error.h#L39
extern void setErrorCallback(void (*cbk)(void *data, ErrorCategory category,

Also take a look at lxml's logging mechanism, their use case is also similar to us
https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

Error when running tests for site-packages install

See the CI run - https://github.com/ashutoshvarma/pyxpdf/runs/990111561
importing test_segfault_image_text_image.py is causing problem

Using package from site-packages.
Traceback (most recent call last):
  File "runtests.py", line 213, in import_module
    mod = __import__(modname)
ModuleNotFoundError: No module named 'tests.test_segfault_image_text_image'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "runtests.py", line 661, in <module>
    exitcode = main(sys.argv)
  File "runtests.py", line 598, in main
    test_cases = get_test_cases(test_files, cfg, cov=cov)
  File "runtests.py", line 281, in get_test_cases
    module = import_module(file, cfg, cov=cov)
  File "runtests.py", line 221, in import_module
    mod = __import__(modname)
ModuleNotFoundError: No module named 'tests.test_segfault_image_text_image'

xpdf 4.04 support

Is it possible to compile the sources with xpdf 4.04 ( i am using Visual Studio 2022 on Windows )
i've sucessfully compiled libxpdf with xpdf 4.04 sources, but when i compile the pyxpdf sources i get an error....
( had to alter get_libxpdf.py to use the 4.04 libxpdf sources )

$ setup.py --build-libxpdf build
...
src\pyxpdf\xpdf.cpp(836): fatal error C1083: Cannot open include file: 'Form.h': No such file or directory

it appears the Form.h/.cc are missing in the new sources and XFAForm.h/.cc are renamed to XFAScanner.h/.cc

Python Support for python 3.11

Installing on Ubuntu 22.04 in a venv environment using Python 3.11 failed with the following output:
I did just notice that listed python versions on PyPI is 3.8, if > 3.8 is not supported will support for > 3.8 be added anytime soon?

Output from pip intstall pyxpdf:

Using cached pyxpdf-0.2.3.tar.gz (1.9 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/setup.py", line 165, in
**setup_extra_options()
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/setup.py", line 98, in setup_extra_options
ext_modules = setupinfo.ext_modules(
^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/setupinfo.py", line 100, in ext_modules
get_prebuilt_libxpdf(
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/get_libxpdf.py", line 91, in get_prebuilt_libxpdf
lib_dest_path = download_and_extract_libxpdf(download_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-6g6ib8ah/pyxpdf_b8e0b2a4c698462d923e318285dcbc38/get_libxpdf.py", line 60, in download_and_extract_libxpdf
libname = [name for name in filenames if "linux" in name and arch in name][0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Building pyxpdf version 0.2.3.
Latest version of libxpdf is 0.1.3
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Config.add_font_file() does not check for font file

When adding font in pyxpdf with new Config.add_font_file() method if font_path is incorrect(file does not exist) no error is thrown.

Steps to Reproduce:-

from pyxpdf import xpdf as x
doc = x.Document('samples/simple1.pdf')
iout = x.RawImageOutput(d)
iout.get(0).show()                # Output: Syntax Error: Couldn't find a font for 'Helvetica'

x.Config.add_font_file('Helvetica', 'asbajd')
iout.get(0).show()                # Output: Syntax Error: Couldn't find a font for 'Helvetica', Still same error

xpdf error logs are not very clear, we should check whether file exists and is readable in Config.add_font_file() and throw error with appropriate message

import error: Symbol not found

Env: OSX10.12, Python 3.6.1
Code:

import pyxpdf

Error:

How can I fix this error?
Thanks.

not able to read japanese text from pdf

I was trying to write some tests for pyxpdf and came accross a pdf in pdfminer pdf samples.

When trying to use pdftotext or Page.text function results in distorted output with error :-
Syntax Error: Unknown character collection 'Adobe-Japan1'

First I thought it was xpdf limitation but to my surprise pdftotext binary from my system was able to extract text from it.

Steps to reproduce :-

from pyxpdf import pdftotext
print(pdftotext("samples/jo.pdf")

PDF File - jo.pdf
replace "samples/jo.pdf" with provided pdf path

with pyxpdf
with pdftotext binary

System: 5.5.11-1-MANJARO
Python Version: 3.8.2
pyxpdf: : upto date dev

How do I know if the layout of the current page is one or two columns

Better font detection [Enable fontconfig support]

For the PDF's that does not embed fonts (mostly Adobe Base-14 fonts as they are not required to embed as per PDF specification) in them we have to search for fonts in user's system.
The best way for that should be using fontconfig as xpdf-4.02 has fontconfig support already but disabled in libxpdf.

Why fontconfig support is disabled in libxpdf?
Because goal of libxpdf was to provide static library with minimum dependencies by including them statically. But for fontconfig, including it in a static library is pretty difficult (at least for me) as :

It requires a font.conf configuration file and without it spams the stdout with warnings.
Also the search location for this font.conf file must be hardcoded during build which would make pyxpdf wheels error prone.

Probable solutions:-

Include fontconfig statically in libxpdf and ship pyxpdf with it own custom font.conf which should work in all supported oses.

Segmentation Fault when using `Document.text()` between `RawImageOutput.get()` calls

Steps to Reproduce:-

_test.py

doc = x.Document("samples/nonfree/mandarin.pdf")
iout = x.RawImageOutput(doc)
iout.get(0)
doc.text()
iout.get(0)

python _test.py

Output :-

Syntax Error: Couldn't find a font for 'TimesNewRomanPS-ItalicMT'
[1]    6040 segmentation fault (core dumped)  python _test.py

System:-

OS : Clear Linux OS x86_64
Python : 3.8.5 (debug)
pyxpdf version : v0.2.2
pyxpdf_data : v1.0.1
Pillow : v7.2.0

Error on Chinese Characters text fetching!!! any idea to resolve it?

Hi , I am using the module to fetch texts (combination of both English and Chinese) from the pdf files, with the following error:

from pyxpdf import Document, Page, Config
from pyxpdf.xpdf import TextOutput, TextControl, page_iterator
with open(pdf_file, 'rb') as fp:
    doc = Document(fp)
for page in doc:
    res_box =page.find_text('Cornerstone', search_box=[0, 0, 400, 400], case_sensitive=True)
    if res_box:
        print(page.label,res_box)

results:

Syntax Error: Unknown character collection 'Adobe-CNS1'
278 (406.8096, 94.85200000000002, 465.46160000000003, 104.47700000000002)
Syntax Error: Unknown character collection 'Adobe-CNS1'
279 (69.6101, 103.50040000000014, 106.93410000000002, 109.62540000000014)
280 (230.7095, 348.65500000000003, 284.4775, 358.28000000000003)
Syntax Error: Unknown character collection 'Adobe-CNS1'
Syntax Error: Unknown character collection 'Adobe-CNS1'
Syntax Error: Unknown character collection 'Adobe-CNS1'

Maintenance Status

Hi @ashutoshvarma

Is this repo still maintained?

What about builds for Python 3.10 and 3.11?