Comments (7)
an option is that the dest
and the rect
each hold a ref to the parent link
object. then the link
will not be freed when either dest
or rect
is alive.
from pymupdf.
We'll have to wait what @rk700 has to say to this.
But as an interim solution: how about extracting every link-related information from the page in one go (like the getLinks()
method does)? After all it's not that many informnation ... getLinks()
delivers a list of dictionaries of which each completely describes a Link and its dependant linkDest.
As for the Outline
, a solution as you suggest is already implemented. In addition, the getToc()
method extracts all bookmarks at once as a Python list - optionally together with associated linkDest
information (option simple=False
).
from pymupdf.
Please take a look at the screenshot. The dest
and rect
in the res
dict are not corrupted when the Link
object and its members are destructed, but when the Page
object is freed.
I think that's the result of object dependency in MuPDF. The document structure objects are closely related, and we must make sure that the destruction is in the right way. For example, if the Outline
is freed after Docuement
, then there would be segfault.
MuPDF is written in C, and the users have to take the responsibility of using it correctly. But as a Python wrapper, PyMuPDF tends to be less complex so that users do not have to take care of object management. But the convenience comes with a price. For example, you might not be able to leave Page
as a local variable and return its descendants, as in @mozbugbox's code.
Actually I've been thinking about how to manage the objects gracefully for quite a while. Please feel free to tell us if you have any idea.
from pymupdf.
@JorjMcKie maybe we can increase the ref count of Page
when calling loadLinks()
to ensure that the Page
is freed only after the Link
s it contains are freed.
from pymupdf.
@rk700 - I favor your thought to reconsider the relationship between the set of "objects" in MuPDF and the set of objects in PyMuPDF.
Today, the analogy between the two is very close.
This is not a MUST, but it is convenient (as the example of text extraction shows: now, TextSheet, TextPage and text device actually would not need to exist in PyMuPDF any longer).
But this convenience means that we have to synchronize our object creation / destruction process with the one of MuPDF - as you pointed out. I am afraid, if continue to follow this path, our code will become more and more complex - and still would contain gaps. And of course we do not want to burden the Python programmer with the duty to explicitely / manually destructing objects.
If I mentally take a step back: why would I, as a Python programmer need to know an animal like linkDest
or device
? When I deal with the links of a page, I would be fine to get a list of link objects that each already contain all the information in one shot, including stuff that is now spread over sub-objects fitz.Rect
and fitz.linkDest
.
Why would I need to know that there even exists something like an outline object? If I need bookmarks, then surely I want a complete table of contents with all the link information for every bookmark item, don't I?
Whether or not these examples are already well thought through:
I believe, if we want to attack the problem, we should take a user's (= Python programmer's) point of view. The objects he sees and the objects MuPDF handles would no longer be in a 1:1, but rather in a 1:n (or even m:n?) correspondence. Our code would then have a better control over MuPDF object management - again have a look at file selecthelper.i
, function readPageText
: every necessary MuPDF object is created and deleted again there.
A first sketch of the Python programmer's view could look like this:
- Document level: meta information, authentication, permissions, bookmarks, and more (maybe later: lists of objects like images and fonts used, file attachments, ...)
- Page level: text extraction (presented in various formats), links / annotations, pixmaps, and more (maybe later: form fields, lists of document level objects used here, ...)
- Graphics: the various options to create / manipulate images are loosely connected to page level processing, but can also be used independently. All the matrix / transformation stuff falls under this topic, too.
Please let me know what you think.
from pymupdf.
When loading a Page
from Document
, some objects like Link
s in this page are already built and contained in the large struct of Page
. And when we call functions like loadLinks()
, it just return the pointers from the Page
struct without any creation. The result is that, then the Page
is destructed, things it contains would also be corrupted.
Taking this into account, we'll have to dump objects like Link
if we'd not like to use ref count for management. Actually the constructor of Link
doesn't make any sense: we just load them from existing pages.
The refactoring would be such a big one that I think we would encounter many complex issues, including memory leaks, segment fault, etc. But we can try starting a totally new branch and working on it.
from pymupdf.
@mozbugbox - class linkDest is now no longer based on MuPDF code and resides solely in the Python layer. So this issue should no longer be existing ...
from pymupdf.
Related Issues (20)
- get_pixmap cannot get full image HOT 1
- set_toc alters link coordinates for some rotated pages on pymupdf 1.24.2
- Cannot add Widgets containing inter-field-calculation JavaScript
- find_tables doesn't recognize any table in scanned document HOT 1
- page.find_tables() is taking high CPU. HOT 1
- Move CLA signatures to dedicated branch.
- "fitz.mupdf.FzErrorArgument: code=4: source object number out of range" after "add_redact_annot" HOT 3
- MuPDF error: syntax error: unknown keyword: '4.48823e' HOT 3
- get_toc(simple=False) return 'to' point coordinate is not based on top-left origin HOT 6
- missing attribute set_dpi() HOT 1
- stamp annotation from pixmap/file HOT 1
- Re-introduced bug, text align add_redact_annot HOT 1
- doc.xref_stream(xref).decode().splitlines() does NOT split the line HOT 3
- OCR segmentation fault HOT 7
- Replacing text with redaction and insert_textbox and fixing reading order
- PyMuPDF failed to extract bw images HOT 11
- Extra characters returned by `page.get_text` with clip HOT 1
- page.get_text() cause process freeze with certain pdf on v1.24.2 HOT 2
- Unable to set ComboBox value HOT 1
- Page.apply_redactions() removes more text than expected in the pdf document. HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.