Giter VIP home page Giter VIP logo

Comments (7)

mozbugbox avatar mozbugbox commented on May 12, 2024

an option is that the dest and the rect each hold a ref to the parent link object. then the link will not be freed when either dest or rect is alive.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on May 12, 2024

We'll have to wait what @rk700 has to say to this.
But as an interim solution: how about extracting every link-related information from the page in one go (like the getLinks() method does)? After all it's not that many informnation ... getLinks() delivers a list of dictionaries of which each completely describes a Link and its dependant linkDest.

As for the Outline, a solution as you suggest is already implemented. In addition, the getToc() method extracts all bookmarks at once as a Python list - optionally together with associated linkDest information (option simple=False).

from pymupdf.

rk700 avatar rk700 commented on May 12, 2024

screen shot 2016-05-20 at 9 51 27 am

Please take a look at the screenshot. The dest and rect in the res dict are not corrupted when the Link object and its members are destructed, but when the Page object is freed.

I think that's the result of object dependency in MuPDF. The document structure objects are closely related, and we must make sure that the destruction is in the right way. For example, if the Outline is freed after Docuement, then there would be segfault.

MuPDF is written in C, and the users have to take the responsibility of using it correctly. But as a Python wrapper, PyMuPDF tends to be less complex so that users do not have to take care of object management. But the convenience comes with a price. For example, you might not be able to leave Page as a local variable and return its descendants, as in @mozbugbox's code.

Actually I've been thinking about how to manage the objects gracefully for quite a while. Please feel free to tell us if you have any idea.

from pymupdf.

rk700 avatar rk700 commented on May 12, 2024

@JorjMcKie maybe we can increase the ref count of Page when calling loadLinks() to ensure that the Page is freed only after the Links it contains are freed.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on May 12, 2024

@rk700 - I favor your thought to reconsider the relationship between the set of "objects" in MuPDF and the set of objects in PyMuPDF.
Today, the analogy between the two is very close.
This is not a MUST, but it is convenient (as the example of text extraction shows: now, TextSheet, TextPage and text device actually would not need to exist in PyMuPDF any longer).
But this convenience means that we have to synchronize our object creation / destruction process with the one of MuPDF - as you pointed out. I am afraid, if continue to follow this path, our code will become more and more complex - and still would contain gaps. And of course we do not want to burden the Python programmer with the duty to explicitely / manually destructing objects.
If I mentally take a step back: why would I, as a Python programmer need to know an animal like linkDest or device? When I deal with the links of a page, I would be fine to get a list of link objects that each already contain all the information in one shot, including stuff that is now spread over sub-objects fitz.Rect and fitz.linkDest.
Why would I need to know that there even exists something like an outline object? If I need bookmarks, then surely I want a complete table of contents with all the link information for every bookmark item, don't I?
Whether or not these examples are already well thought through:
I believe, if we want to attack the problem, we should take a user's (= Python programmer's) point of view. The objects he sees and the objects MuPDF handles would no longer be in a 1:1, but rather in a 1:n (or even m:n?) correspondence. Our code would then have a better control over MuPDF object management - again have a look at file selecthelper.i, function readPageText: every necessary MuPDF object is created and deleted again there.
A first sketch of the Python programmer's view could look like this:

  • Document level: meta information, authentication, permissions, bookmarks, and more (maybe later: lists of objects like images and fonts used, file attachments, ...)
  • Page level: text extraction (presented in various formats), links / annotations, pixmaps, and more (maybe later: form fields, lists of document level objects used here, ...)
  • Graphics: the various options to create / manipulate images are loosely connected to page level processing, but can also be used independently. All the matrix / transformation stuff falls under this topic, too.

Please let me know what you think.

from pymupdf.

rk700 avatar rk700 commented on May 12, 2024

When loading a Page from Document, some objects like Links in this page are already built and contained in the large struct of Page. And when we call functions like loadLinks(), it just return the pointers from the Page struct without any creation. The result is that, then the Page is destructed, things it contains would also be corrupted.

Taking this into account, we'll have to dump objects like Link if we'd not like to use ref count for management. Actually the constructor of Link doesn't make any sense: we just load them from existing pages.

The refactoring would be such a big one that I think we would encounter many complex issues, including memory leaks, segment fault, etc. But we can try starting a totally new branch and working on it.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on May 12, 2024

@mozbugbox - class linkDest is now no longer based on MuPDF code and resides solely in the Python layer. So this issue should no longer be existing ...

from pymupdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.