pymupdf / pymupdf Goto Github PK

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Home Page: https://pymupdf.readthedocs.io

License: GNU Affero General Public License v3.0

Python 62.04% C 0.19% SWIG 37.77%

mupdf xps pdf-documents epub ocr pdf font python data-science extract-data table-extraction pymupdf tesseract text-processing text-shaping

pymupdf's Introduction

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Installation

PyMuPDF requires Python 3.8 or later, install using pip with:

pip install PyMuPDF

There are no mandatory external dependencies. However, some optional features become available only if additional packages are installed.

You can also try without installing by visiting PyMuPDF.io.

Usage

Basic usage is as follows:

import fitz # imports the pymupdf library
doc = fitz.open("example.pdf") # open a document
for page in doc: # iterate the document pages
  text = page.get_text() # get plain text encoded as UTF-8

Documentation

Full documentation can be found on pymupdf.readthedocs.io.

Optional Features

fontTools for creating font subsets.
pymupdf-fonts contains some nice fonts for your text output.
Tesseract-OCR for optical character recognition in images and document pages.

About

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF was originally written by Jorj X. McKie.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Contact

Join us on Discord here: #pymupdf

pymupdf's People

Contributors

Stargazers

Watchers

Forkers

agen844 chajadan j-a-m ousia maxcrystal isaachaze mdabbagh88 randdane blasken countryold mozbugbox thecthulhukid forvendettaw python3pkg yiqideren kid0 mengwuxiao whatsahoy jtsun jdavid54 julienfr112 persistforever wangbin321 iharijaona cpvmam sotoup jbarlow83 shubhampachori12110095 diyanan ankushbhatia2 kitter mozhuowen mpower4ru shuixi2013 raghuvar avishkarj researchdlp rbikw pankaj-pundir dkavraal felipesynus luislndch amukka yufc2002 lucuicheng wzhsunn ankurjainjob fsecada01 cuiboand1 hibellm warstick jibolso dreua ruthvicp lachalek au500 arthurdarcet hhy5277 korzak killthekitten rashmithd h3xh4wk alanlzl salvopr nagasssi brunoodinukweze totalgood zzang2004 rawnc hsl2002 josch guoyunhe gindexchen antoniodesouza rekgrpth pkucss pythonthings pp717 king-ofwolf valrcs danlanko richtope66 apoorvakhanna iruletheworld jackyjingyi cooleel stebae xunge utmcontent boan-anbo jeamson-zhang cnzhujg yashtomar31 guofeng201507 cool2528 nateamus mahendra047 laurenthayez shaneisley yaoliuoa

pymupdf's Issues

Document.close()

Hi @rk700,
it seems that this function works in simpler situations only ...
When testing my MuPDF_OLedit.py my python session abended altogether.
I then compiled fitz.i with MEMDEBUG. For the simple program

import fitz
f="C:/Users/.../test.pdf"
doc = fitz.Document(f)
print doc.pageCount
print doc.ToC()
page = doc.loadPage(0)
print doc.close()

everything seems to work fine with the following output:

40 ==> the number of pages
[[...]...] ==> the table of contents
free outline
free doc
None
free page

When I use MuPDF_OLedit.py to open the same file. The following happens:
For each page displayed (function pdf_show()) the following lines appear (per page displayed):
free rect
free colorspace
free page
free irect
free pixmap
free device

When I then invoce the doc.close() function, the following 2 lines appear and then the python interpreter is abended:
free outline
free doc

variable names in PDF_display.py

@JorjMcKie, PDF_display.py contains two variable names that may not be obvious to everyone?

Would you agree with the replacement of szr10 and szr20 with BoxSizerVertical and BoxSizerHorizontal?

I would make the merge request myself, but first I need to be sure I’m not missing something (I don’t code 😟).

extractText(): non-natural ordering of text output

For complex documents, the sequence of text extracted with extractText() often is not as you naturally would expect, i.e. from top-left to bottom-right.
When I experimented with the fitz function fz_print_text_page_xml, it became clear, that fitz applies some heuristics to determine the sequence of page blocks. The XML printout correctly specifies the values of bboxes and other coordinates for everything, but not in the "natural" sequence.
The text sequence in fz_print_text_page is exactly the same, but without any coordinate information the text appears sometimes totally confusing.

Example zoom factor in Tkinter

It would be the correct way to use zoom factor with Tkinter?

PyMuPDF

doc_pdf = fitz.Document( "example.pdf" )
matrix = fitz.Matrix ( 1 , 1 ).preScale( 1.2 , 1.2 )

page = doc_pdf.loadPage( 1 )
pix = page.getPixmap( matrix = matrix , colorspace = 'RGB' )
data = str( pix.samples )

data2 = "".join( [ data[4_i:4_i+3] for i in range( len( data ) / 4 ) ] )

Tkinter - canvas

from PIL import Image , ImageTk
im = Image.frombytes( "RGBA" , [pix.width , pix.height] , data )
canvas = Canvas( root , relief=SUNKEN )
canvas.config( width=400 , height=200 )
canvas.config( highlightthickness=0 )

sbarV = Scrollbar( root , orient=VERTICAL )
sbarH = Scrollbar( root , orient=HORIZONTAL )

sbarV.config( command=canvas.yview )
sbarH.config( command=canvas.xview )

canvas.config( yscrollcommand=sbarV.set )
canvas.config( xscrollcommand=sbarH.set )
sbarV.pack( side=RIGHT , fill=Y )
sbarH.pack( side=BOTTOM , fill=X )

canvas.pack( side=LEFT, expand=YES, fill=BOTH )

width , height = im.size
canvas.config( scrollregion = ( 0 , 0 , width , height ) )
im2 = ImageTk.PhotoImage( im )

nWid = ( root.winfo_screenwidth() / 2 ) - ( width / 2 )

imgtag = canvas.create_image( nWid , 10 , anchor="nw" , image=im2 )

Python list of outline entries?

As new print functions have now been added to outline: would it be a big effort to also add a function that creates a Python list? Like the one in this demo program?
I guess that many future users will not really care for following the single entries with outline.down / outline.next, but rather be interested in a complete list in one shot.
In addition: people with non-standard encodings (like myself with German) want to get their characters printed correctly, so they would need to catch the output of outline.saveText / outline.saveXML somehow and decode it appropriately.
A potentially quick solution: could the output of the save functions be optionally redirected (into a string)?

MuPDF 1.8 Performance

@rk700, @deepgully, @ousia and friends:
I have been investigating somewhat on performance changes in MuPDF 1.8. You will be pleased to hear about improvements in the area of text extraction. There may be a other improvements - they will pop in here when they are being observed.

Text extraction speed of our bindings seem to have improved by a factor of more than 2. @rk700 - remember when I measured 7 seconds on my machine to extract simple text from a large PDF? This value is now less than 3 seconds for the exact same file
The relative gap between the more complex methods extractJSON()/extractXML() and the simple extractText() shrunk significantly: instead of having processing time relationships (TEXT <> HTML <> JSON <> XML) ~ (1 <> 2 <> 145 <> 4120), we now have (1 <> 1 <> 3 <> 52)!!!

Please do inform me if you have any questions on this, or information to share ...

Document.save(...) only for PDF?

@rk700 - are you aware whether the save function of fitz might only work for PDF files?
I tested with an XPS file, tried to save it and got return code 0.
But no output file was created.

Prevent Document.save() for input file

I included in documentation that filename in Document.save(filename) must be different from the opened file yielding the Document object.
However, nothing technically prevents me from still doing it. If I do it, the input file is being destroyed, meaning I see a document with empty pages and return code is zero.

The method should instead reject the request with an exception.

Access to plain text of a page?

Is there a way to access the text contained in a page and e.g. analyze it inside Python?

building on windows

MSVC only supports c89, while our wrapper is in c99 standard.

We may need to do some conversion on the code.

Decrypt does not work on outlines

In the process of establishing a more formal testing approach for PyMuPDF, I ran into the following problem:

I password protected a test pdf. Then I used the demo program removePass.py to create a non-protected version. This version was readable allright, but all outline data were still encrypted!!.
If I open the encrypted pdf with a standard viewer like Nitro, the outline looks fine.
Also no problem occurs with SumatraPDF.exe (which is based on MuPDF).

building on linux

With the setup.py, the wrapper can be compiled without error:

$ python2 ./setup.py build
running build
running build_py
copying fitz/fitz.py -> build/lib.linux-x86_64-2.7/fitz
running build_ext
building '_fitz' extension
gcc -pthread -fno-strict-aliasing -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4 -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4 -fPIC -I/usr/include/mupdf -I/usr/include/python2.7 -c fitz/fitz_wrap.c -o build/temp.linux-x86_64-2.7/fitz/fitz_wrap.o
gcc -pthread -shared -Wl,-O1,--sort-common,--as-needed,-z,relro build/temp.linux-x86_64-2.7/fitz/fitz_wrap.o -L/usr/lib -lmupdf -lmujs -lssl -ljbig2dec -lopenjp2 -ljpeg -lfreetype -lpython2.7 -o build/lib.linux-x86_64-2.7/_fitz.so

Notice that openssl lib is included when linking -lssl, but not in the output _fitz.so:

$ readelf -d build/lib.linux-x86_64-2.7/_fitz.so

Dynamic section at offset 0x945a88 contains 30 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libjbig2dec.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libopenjp2.so.7]
 0x0000000000000001 (NEEDED)             Shared library: [libjpeg.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libfreetype.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libpython2.7.so.1.0]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x4dbc0
...

And as a result, it would complain that something is not defined, which is in openssl lib.

>>> import fitz
...
ImportError: /home/lrk/src/python-fitz/build/lib.linux-x86_64-2.7/_fitz.so: undefined symbol: PKCS12_SAFEBAG_free

I think the reason is the linking flag --as-needed and --sort-common. Actually it works when I link by hand without those 2.

We'll try to figure out why the openssl symbol is not determined to be needed.

Env I'm using: Arch Linux, gcc 4.9.2, python2.7.9, mupdf 1.7a

Python3 compatibility problem

The Page.getPixmap() in the tutorial and API doc does not exist in current master.

Besides, the object.__getattr__ in fitz.py does not exist in Python3.

Setup.py compile on Win 7 without Visual Studio installed.

I followed the directions:

if you have not installed Visual Studio or if you do not want to generate MuPDF, you must download PyMuPDF Optional Material now and unzip / decompress its content in directory ./PyFitz/PyMuPDF-optional-material. This optional material contains the lib files needed for PyMuPDF generation, and the MuPDF header files. Update setup.py, parameter include_dirs, to point to these header files.

I don't have Python 2.7 installed on the computer either if that has anything to do with it.

Here is what I get:
G:\pyfitz>python setup.py install
running install
running build
running build_py
running build_ext
building 'fitz._fitz' extension
error: Unable to find vcvarsall.bat

I do end up with g:\pyfitz\build_init_.py, fitz.py, utils.py but no _fitz.pyd

Release?

Hi @rk700,
after no more issues are popping up and documentation being done, I think it is time to release ...
I can do it, if you do not have the time.
Also should think about making it more "searchable", meaning:
People are looking for "PDF" or even "MuPDF", but not for "fitz", when they search a solution for accessing PDFs via Python.
I am not sure what would the best ... maybe renaming it to something like "PyMuPDF"?
And how about adding it to the Python Package Index?

fitz.Document(filename)

will not work if type(filename) is unicode.
Possible fix: internally convert to str(filename).

linkDest._getFileSpec cause segfault

According to http://mupdf.com/docs/browse/include/mupdf/fitz/link.h.html:

gotor.file_spec: If set, this destination should cause a new
file to be opened; this field holds a pointer to a remote
file specification (UTF-8). Always NULL in the FZ_LINK_GOTO
case.

When kind == FZ_LINK_GOTO, the file_spec should be NULL. However, it's not really the case and file_spec could point to some invalid pointer (so this could be a mupdf code bug or doku bug). When access file_spec with kind == FZ_LINK_GOTO in some PDF, I got segfault.

The code should exclude FZ_LINK_GOTO:

        char *_getFileSpec() {
            return ($self->kind == FZ_LINK_GOTOR) ? $self->ld.gotor.file_spec : ($self->kind==FZ_LINK_LAUNCH ? $self->ld.launch.file_spec : NULL);
        }

The same could happens to linkDest._getDest: FZ_LINK_GOTO should be excluded.

compatible with 1.8?

@rk700,

sorry for asking that after the recent release of PyMuPDF, but what about compatibility with mupdf-1.8?

Many thanks for your excellent work.

Simplified installation

@rk700, I have simplified the installation process as planned:
We now have new directories containing the generated MuPDF libraries for each platform python-fitz will support. For now, these are LibLinux and LibWin32.
For example, LibWin32 contains libmupdf.lib and libthirdparty.lib.
That should be the case for LibLinux as well ... could you please upload your Linux libraries to the Linux library?
The setup.py script has already been updated accordingly, the documentation is updated as well.
Thanks and best regards

new metadata attribute: 'encryption' error

The metadata key 'encryption' shows 'None' (the string) instead of None.
The other key / value entries are correct.

Issue w/ repeated extractText() executions

When I try to extract the text of all pages of a document, the interpreter will abnormally terminate after some successfull page text extractions (around 10 it seems). Here is the code, a modified version of extract.py:

#-----------------------------------------------------------------------------------------------------------
#!/usr/bin/env python
import fitz

#get the page
d = fitz.Document("pymupdf.pdf")
numPages = d.pageCount

for i in range(numPages):
    print "working on page", i
    pg = d.loadPage(i)

    #setup the display list
    dl = fitz.DisplayList()
    dv = fitz.Device(dl)
    pg.run(dv, fitz.Identity)

    #setup the text page
    ts = fitz.TextSheet()
    tp = fitz.TextPage()
    rect = pg.bound()
    dl.run(fitz.Device(ts, tp), fitz.Identity, rect)

    #get the text content
    text = tp.extractText()
    print text

------------------------------------------------------------------------------------------

My guess is, that we get some type of a buffer issue: the string result of extractText() is represented by a buffer allocated in MuPDF / fitz, which never gets de-allocated ...

Doc for type of Link.dest is wrong

The API document for Link class has the wrong type for Link.dest.

the type of Link.dest should be a LinkDest instead of int.
the type of LinkDest.dest should be string since it will point to fileSpec if valid.
the type of LinkDest.named should be string too.

Document.close() crashes after successful decryption

@rk700 - If a document is successfully decrypted, a following close() method will reproducibly crash the interpreter - either immediately, or at the latest when the document is dropped (doc = None).

Can you confirm thisfor Linux as well? This is maybe "works as designed" - quite imaginable, in which case I would implement prohibiting close() after decrypts to produce a graceful exception.

fitz.Identity is not hashable in 1.9

While fitz.Matrix(1,1) is hashable and can be used as a key in dict, fitz.Identity is not hashable.

>>> import fitz
>>> a = fitz.Matrix(1, 1)
>>> hash(a)
8757058983372
>>> hash(fitz.Identity)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'SwigPyObject'

minimal sample?

I wonder whether it would be possible to include a minimal demo that displays a given PDF file (selected from the command line) and that goes page up and page down.

I mean something similar to the demo from poppler-python.

git repository missing fitz.h file

The line

include <fitz.h>

is in fitz/ctx.h and fitz/fitz_wrap.c

The file fitz.h is missing from the git repository

compatible with latest mupdf?

Are your bindings compatible with the latest mupdf-1.3?

Many thanks for your work,

Pablo

Linux Setup

@rk700 - trying to set-up my own Linux version, but I am a newbie there. Therefore a question: how do I generate the required MuPDF third-party libraries? That's the only thing I am missing right now.

doc.outline is a read-only attribute, right?

Looking at your remove password example motivated me to ask this question.
Any other attributes changeable in the way you did it with remove password?

invalid escape character in extractJSON() output

If one wants to decode output of TextPage.extractJSON() with Python's standard json module with an instruction like text = json.loads(tp.extractJSON()), the following exception occurs:

File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 75 column 58 (char 2440)

The escape character being referred to is ' (i.e. apostrophy). Obviously apostrophies should not be regarded as escape characters.

Documentation review

Hi @rk700 and @ousia,
hope you are well.
I have made an effort to complete the python-fitz documentation. Obviously, I have taken version 1.2 as a starting point, but a lot has changed in 1.7. Therefore a lot had to be adjusted in the docu as well ... I do believe however, that what we have now can be called a close-to-final version 0.9.

If you could be so kind and give it a critical review - time permitting - and let me know of any required or nice-to-have changes?

The doc directory contains a sub dir html, from which the html version can be invoked via index.html as usual.
As a convenience, the doc directory also contains other technical versions, namely CHM (Windows help file), EPUB (e-book) and PDF. The contents are all in sync, they all have been created with sphinx from the same source.

I am working on the new documentation ...

Having the 1.2 stuff I am trying to find a way to present the new python-fitz, esp. to new users.

in order to put versions online we need a doc directory again. I obviously cannot create one
Apart from documenting the many changed details (renamed functions, variables, deleted / new objects), I have decided to create a tutorial for using fitz. It should explain the main functions in a simple step-by-step manner, and not try to be complete, but definitely be easy to understand.

Side questions:
A) There is no way, MuPDF can access meta information from a PDF (like author, subject, ...) - right?
B) MuPDF cannot write out PDFs in a selected manner, like "take only page 3-7, and 11-21", right?
C) MuPDF cannot change a PDF's meta data (like above, or even a changed outline tree), right?

Buffer Overrun?

While rendering many pages of a larger, complex PDF I get an exception with the message "cannot render page".
In my case this happens after around 50 pages rendering a PDF with 100 pages and a total size of 14 MB.
I have sent @rk700 a mail with the attached ZIP of the PDF.
Hope his mailbox can deal with attachments of this size (it's still 11 MB).

Should be able to close document explicitly

Currently we cannot close an opened doc explicitly, and it blocks us from removing the file. Should make the document destruction available to user.

more matrix methods

Since the matrix was wrap, the most other matrix operation should be included.

Because once a user start using transform matrix, then all kind of matrix operations will likely be used.

matrix.invert
matrix.translate
matrix.compose (is it called concat in mupdf?)
transform point

Install failed on OSX 10.11.3

Got a bunch of warnings and a final error. Note I did have difficulty getting mupdf to compile but was finally able to do it by changing all the openssl includes with a hard path to where they were located on my system. Also had to use the following on the make: make HAVE_X11=no

Here is the terminal output:

Georges-MBP:PyMuPDF gbarnabic$ python3 setup.py install
running install
running build
running build_py
creating build/lib.macosx-10.6-intel-3.4
creating build/lib.macosx-10.6-intel-3.4/fitz
copying fitz/__init__.py -> build/lib.macosx-10.6-intel-3.4/fitz
copying fitz/fitz.py -> build/lib.macosx-10.6-intel-3.4/fitz
copying fitz/utils.py -> build/lib.macosx-10.6-intel-3.4/fitz
running build_ext
building 'fitz._fitz' extension
creating build/temp.macosx-10.6-intel-3.4
creating build/temp.macosx-10.6-intel-3.4/fitz
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I./mupdf/include -I./mupdf/include/mupdf -I/Library/Frameworks/Python.framework/Versions/3.4/include/python3.4m -c ./fitz/fitz_wrap.c -o build/temp.macosx-10.6-intel-3.4/./fitz/fitz_wrap.o
./fitz/fitz_wrap.c:3316:45: warning: passing 'char *' to parameter of type
      'unsigned char *' converts between pointers to integer types with
      different sign [-Wpointer-sign]
                data = fz_open_memory(gctx, stream, streamlen);
                                            ^~~~~~
./mupdf/include/mupdf/fitz/stream.h:66:59: note: passing argument to parameter
      'data' here
fz_stream *fz_open_memory(fz_context *ctx, unsigned char *data, int len);
                                                          ^
./fitz/fitz_wrap.c:3555:62: warning: passing 'char *' to parameter of type
      'unsigned char *' converts between pointers to integer types with
      different sign [-Wpointer-sign]
                pm = fz_new_pixmap_with_data(gctx, cs, w, h, samples);
                                                             ^~~~~~~
./mupdf/include/mupdf/fitz/pixmap.h:83:109: note: passing argument to parameter
      'samples' here
  ...*ctx, fz_colorspace *colorspace, int w, int h, unsigned char *samples);
                                                                   ^
./fitz/fitz_wrap.c:3563:40: warning: passing 'char *' to parameter of type
      'unsigned char *' converts between pointers to integer types with
      different sign [-Wpointer-sign]
                pm = fz_load_png(gctx, data, size);
                                       ^~~~
./mupdf/include/mupdf/fitz/image.h:87:56: note: passing argument to parameter
      'data' here
fz_pixmap *fz_load_png(fz_context *ctx, unsigned char *data, int size);
                                                       ^
3 warnings generated.
./fitz/fitz_wrap.c:3316:45: warning: passing 'char *' to parameter of type
      'unsigned char *' converts between pointers to integer types with
      different sign [-Wpointer-sign]
                data = fz_open_memory(gctx, stream, streamlen);
                                            ^~~~~~
./mupdf/include/mupdf/fitz/stream.h:66:59: note: passing argument to parameter
      'data' here
fz_stream *fz_open_memory(fz_context *ctx, unsigned char *data, int len);
                                                          ^
./fitz/fitz_wrap.c:3555:62: warning: passing 'char *' to parameter of type
      'unsigned char *' converts between pointers to integer types with
      different sign [-Wpointer-sign]
                pm = fz_new_pixmap_with_data(gctx, cs, w, h, samples);
                                                             ^~~~~~~
./mupdf/include/mupdf/fitz/pixmap.h:83:109: note: passing argument to parameter
      'samples' here
  ...*ctx, fz_colorspace *colorspace, int w, int h, unsigned char *samples);
                                                                   ^
./fitz/fitz_wrap.c:3563:40: warning: passing 'char *' to parameter of type
      'unsigned char *' converts between pointers to integer types with
      different sign [-Wpointer-sign]
                pm = fz_load_png(gctx, data, size);
                                       ^~~~
./mupdf/include/mupdf/fitz/image.h:87:56: note: passing argument to parameter
      'data' here
fz_pixmap *fz_load_png(fz_context *ctx, unsigned char *data, int size);
                                                       ^
3 warnings generated.
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-3.4/./fitz/fitz_wrap.o -L./PyMuPDF-optional-material/LibWin32 -llibmupdf -llibthirdparty -o build/lib.macosx-10.6-intel-3.4/fitz/_fitz.so /NODEFAULTLIB:MSVCRT
clang: error: no such file or directory: '/NODEFAULTLIB:MSVCRT'
error: command '/usr/bin/clang' failed with exit status 1
Georges-MBP:PyMuPDF gbarnabic$

Windows installation doesn't work if 64-bit Python is installed

When installing PyMuPDF using setup.py, the installation will only work if the 32-bit version of Python is installed. If the 64-bit version is installed, there is an error running link.exe from Microsoft's Visual C++ for Python.
This issue occurs on Windows 8.

Demo - sierpinki.py error

Now that I have it installed, I'm experimenting. So I found this demo is not working. Here is the error:

Georges-MBP:demo gbarnabic$ python3 sierpinski.py
Traceback (most recent call last):
File "sierpinski.py", line 58, in
punch(pm, 0, 0, d, d)
File "sierpinski.py", line 31, in punch
ir = fitz.IRect(x01, y01, x02, y02) # rectangle of middle square
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/fitz/fitz.py", line 432, in init
this = _fitz.new_IRect(*args)
NotImplementedError: Wrong number or type of arguments for overloaded function 'new_IRect'.
Possible C/C++ prototypes are:
fz_irect_s::fz_irect_s()
fz_irect_s::fz_irect_s(struct fz_irect_s const *)
fz_irect_s::fz_irect_s(int,int,int,int)

Georges-MBP:demo gbarnabic$

PyMuPDF binary file size

At least under Windows, the size of the binary file (_fitz.pyd, I presume _fitz.so in Linux) has increased significantly, practically doubled from 9.4 MB to now 17.5 MB.

This is due to hundreds of new font definitions now incorporated. Most notable the Google NOTO (= "no tofu") fonts. In Windows, we now need to build _fitz.pyd out of 3 library files: libmupdf, libthirdparty and now also libfonts.

In our bindings, we actually do not need fonts, or do we? Maybe we do during rendering?

If we don't need them, it may be worthwhile to find a way to exclude them during MuPDF generation.

Just omitting the libfonts won't work, because libmupdf references the object file NOTO (which is generated by compiling noto.c) and the linker step will generate hundreds of unresolved references ....

Files for V1.9 & a text extraction issue

@rk700 - created branch 1.9 and populated fitz sub dir with updated files (fitz.i etc.)

They passed the tests on my Python 2.7 32-bit installation and thus are ready to be tested on other platforms (Linux, OSX).

Interesting issue:

If you use fitz.Device(ts, tp) in a certain way, then the TextPage's text extraction functions will return no text! More specifically:

This will work (dl is a DisplayList):

ts = fitz.TextSheet()
tp = fitz.TextPage()
dl.run(fitz.Device(ts, tp), fitz.Identity, rect)
txt = tp.extractXML()

This will not work (i.e. will deliver no text):

ts = fitz.TextSheet()
tp = fitz.TextPage()
dv = fitz.Device(ts, tp)
dl.run(dv, fitz.Identity, rect)
txt = tp.extractXML()

However, tp.search() will work correctly either way.

PyMuPDF / fitz Performance

@rk700 , @ousia , @deepgully :
Have you seen my performance investigations? It is contained in the documentation, and a WIKI page is pointing to the chapter.
I hope these data underpin what has always been said: It is very hard to beat MuPDF in terms of speed ...

Porting and Tests for MuPDF 1.9a concluded

@rk700 - I have finished porting and testing our stuff to / with MuPDF 1.9a. This sub-version did not impact our code for version 1.9 in any way.

As discussed recently, text extraction is now also represented by methods directly under the Document and Page classes. Users can be ignorant of DisplayList, TextSheet, TextPage and text Device classes altogether if they are just dealing with text extraction.

More or less the same is true for the Pixmap class: it too now has new methods directly connected to the Document and Page classes.

I dare say that this approach has no negative performance impact for "normal" users. This is certainly true for text extraction.

Only for rendering pages, explicitely using the DisplayList can yield a performance advantage, e.g. when a user displaying a document decides to zoom in or out a currently displayed page.

Of course, all these classes (DisplayList, Device, TextSheet, TextPage) are still there and available for use to those who dare ...

Question:
Do you want to publish a new release given the current status?

Some new functionality

@rk700 , @ousia , @deepgully
You may want to take note of new functionality just recently added. Text extraction and pixmap creation have been both, improved and simplified - I believe those changes are significant in part.

For example, you can now extract text and create pixmaps directly for a page - no need to mess around with devices, textsheets and textpages (while this is certainly still possible when e.g. performance considerations demand it).

Another example is Pixmap. This class now has several new methods, which extend and improve its usability, even beyond the context of dealing with PDF files.

Have a look at our Wiki pages to find more details.

Invalid JSON text feeback of method extractJSON()

@deepgully , @rk700

On very rare occasions, the text string returned by this method is too long: it then returns data with arbitrary trailers.
This causes the JSON string to be invalid (exception generated by json.loads()).

I have found this issue with only one file so far: the Adobe Manual "Adobe PDF Reference 1-7.pdf".
It is peculiar, that I have seen this error only on pages 660 and 1138 (zero-based).

Here is an output of the string returned for page 660:
jsonerr.txt
Another example is
jsonerr2.txt

Observations

It only occurs on the mentioned pages, but not always: sometimes on both, sometimes on only one
An invalid text string apparently is always longer - never shorter than is legal
The extra length varies: I have seen 5, 7, 10 or even over 100 extra bytes
If I run the program on only those pages (e.g. "JSON-izing" only page 660), the error occurs, too. Obviously we do not have a buffer overrun because of the many (1310) pages of that manual.
As a circumvention, I look for the last occurrence of \n} and accordingly shorten the method's output string - this has worked so far. Maybe this is a bsis for a fix on the C level

Cannot make version 1.9 work with Gtk, wx

I don't know what was the problem, but when I have pymupdf git master HEAD compile against mupdf 1.9a, it won't work well with Gtk or wx.

When I run the demo/PDF_display, the file chooser dialog pops up with most of the widgets drawn in half way.

.....

After a few test, it seems simply import fitz after import wx make the problem goes away.

Actually, I had the opposite experience with pymupdf version 1.8 and Gtk. With pymupdf 1.8, some segfault problems went away when import fitz went before from gi.repository import Gtk. :(

release 1.8: external libraries

I’m not a programmer myself, but this is the first time I see a binding library that contains already compiled third-party binaries.

mupdf itself contains a directory named thirdparty, but only containing source code (Fedora removes this directory before compilation).

I will ask to Fedora developers to add this library to repositories for the current development version. Would it be possible that both LibLinux and LibWin32 directories are moved to a different branch from the default master branch?

In principle, any Linux distribution should compile any binary using the already installed libraries and not using binaries provided with the source code to be compiled.

Maybe in Windows this is harder to achieve. But in any case, would it be possible to add those binaries to another branch (such as external-libraries)?

Sorry for being so assertive, but I guess this is mandatory for many distributions to include PyMuPDF in their repositories.

getPixmap should take a clip parameter.

The getPixmap methods should take a keyword parameter clip which can have default None.

There is a fz_new_draw_device_with_bbox function that takes a clip parameter.

Clipping is useful when crop white borders of a page or display half column for two-columns documents. Of cause we can crop the returned pixmap. But hopeful with clipping, the render speed and data conversion speed would be improved.

Link.dest frees with Link object

I was trying to cache the dest of links in a page. The cache was a dict from link.rect -> link.dest. The problem is, when the dest was accessed later on, it sometimes gives random value.

The code is like:

    @functools.lru_cache(maxsize=PAGE_CACHE_SIZE)
    def get_link(self, page_num):
        """return a dict of {rect: fitz.linkDest}"""
        page = self.get_page(page_num)
        if page is None:
            return
        link = page.loadLinks()
        res = {}
        while link is not None:
            # link.dest, link.rect are not properly refcounted
            dest = link.dest
            rect = link.rect
            res[rect] = dest
            link = link.next
        return res

It seems that the dest was freed when its containing Link got freed. That to make sure string values in dest were freed when links get "dropped". This behavior is every non-pythonic when writing Python code.

I've to clone the dest and the rect by value before caching them. I'd know what's the standard swig way of handling this struct value in struct situation, but I think the dest and rect should be cloned somehow on creating Link object in the wrapper.

The Outline class is supposed to have the similar problem.

Wrong JSON format in TextPage.extractJSON() method

Testing text extraction with the following XPS 1-page example file produced an error:
input file and corresponding text file (with .JSON extension) can be found in test directory.
In essence, decimal values provides for one bbox are invalid, see lines 562 and following in the JSON file. I am extracting them here for clarity:

Simplifications

@rk700 - I have been thinking about further simplifying PyMuPDF's use.
Before changing anything, I would like to hear your thoughts ...

An average user would probably never voluntarily use objects like TextSheet, TextPage or Device.
But he has to if he wants to do simple things like accessing contents of pages - it is counter-intuitive.
But of course we need to keep those things for users who need to do heavy optimization.
I dislike blowing up fitz.i with Python specific code. One example is the code of Document.getToc() - it's an alien thing inside fitz.i, and I would like to get rid of it there.
Comparing the functions of PyMuPDF and other similar tools, there are still features missing in PyMuPDF - particularly in the area of output ...

So my idea in essence is to add functions / features outside fitz.i whenever possible by extending __init__.py (see provided example).

When you look at dir(fitz) after the import, lots of things pop up which no meaning or importance to the (normal) developper, or which are even dangerous to touch (e.g. cvar, or <class>_swigregister).
I don't know yet how to get rid of such entries, but I would like to try ...

Please let me know what you think.

__init__ .py.txt

pymupdf / pymupdf Goto Github PK

pymupdf's Introduction

PyMuPDF

Installation

Usage

Documentation

Optional Features

About

License and Copyright

Contact

pymupdf's People

Contributors

Stargazers

Watchers

Forkers

pymupdf's Issues

PyMuPDF

data2 = "".join( [ data[4_i:4_i+3] for i in range( len( data ) / 4 ) ] )

Tkinter - canvas

------------------------------------------------------------------------------------------

include <fitz.h>

Interesting issue:

Recommend Projects

Recommend Topics

Recommend Org