This function has become much slower in version 0.10.0 than it was in 0.9.x Append

PDFDocument:AppendDocumentPages performance regression,about podofo/podofo

Comments (33)

kovidgoyal commented on June 12, 2024

The main difference I see is in 0.9.7 objects are copied only if they are dictionaries and have a stream. 0.9.7 code

        if( (*it)->IsDictionary() && (*it)->HasStream() )
            *(pObj->GetStream()) = *(static_cast<const PdfObject*>(*it)->GetStream());

0.10.0 code:

        auto newObj = new PdfObject(PdfDictionary());
        newObj->setDirty();
        newObj->SetIndirectReference(ref);
        m_Objects.PushObject(newObj);
        *newObj = *obj;

I dont know enough about PoDoFo internals to tell if the 0.10.0 version is correct or is doing too much.

from podofo.

kovidgoyal commented on June 12, 2024

One n^2 algorithm is that For each call to appendDocumentpages it call addfreeobject for every free objet in the document being appended. And addfreeobjects itself scans through all free objects looking for every existing equal reference. This is O(n*m). Instead it would be better to just find the highest existing number once and use that as the difference to add create the new empty free objects.

from podofo.

kovidgoyal commented on June 12, 2024

I am trying to write my own implementation of appendPages that uses a std::unordered_map to do the reference number mapping efficiently. However, I cannot find API to insert a page into an existing document using the existing page object. InsertPageAt is private. Trying to copy the dictionary of the page object manually to a newly created page, leaves the PdfResources and PdfContents objects of the newly created page broken

I am pretty frustrated at this point. The API is preventing me from doing what I need for no good reason. I have spent over a man week rewriting my code against this new API and there are still regressions and workarounds needed.

Can you please either:

Fix AppendDocumentPages here is a non brain-dead implementation posted years ago: https://stackoverflow.com/questions/55283985/copy-only-necessary-objects-from-pdf-file
Provide some API that I can use to implement it myself. Maybe just make InsertPageAt public. Or tell me how I can manually copy the PdfResources and PdfContents objects over, I think I can use PdfContents::Reset() but PdfResources has no reset, no way to reload from a dictionary that I can see.

And please make a release so I can actually use the new code.

from podofo.

kovidgoyal commented on June 12, 2024

I figured out how to manually copy the resources, the trick is copy the object then call GetOrCreateResources() since the newly created page has no resources to start with it will read the resources from the copied page

from podofo.

ceztko commented on June 12, 2024

Hello, I understand your frustration with the code in Append/Insert, but the code was taken from 0.9.x trying to preserve it unchanged, it wasn't unit tested and I never used it my self. As said, the code there smells bad, as it looks like quickly hacked together without wanting to get deep into the library, possibly because at the time the library lacked many facilities to do a better job. In particular the approach to append objects using reference numbers with an offset looks very naive and poorly conceived to me. With the last message of yours I understand you are writing your own method to insert/append, which is the approach I recommend if you really can't wait for another implementation to appear. I describe here what would be my approach in fixing it so you can understand what I would do instead, which is not trivial but still quite doable today:

Parse all the content streams of the page and collect all the resources really used, using the PdfContentStreamReader. This really requires knowing the syntax of all PDF operators where they refer to page resources, which requires a good read of the PDF specification;
Create an indirect object of the page in the target document with the shallow copy of the object from the source document, deleting the /Resources dictionary;
Traverse both the source and newly created page object recursively, recreating each indirect object from the source document to the new document and persist the mapping in a std::unordered_map, and update the shallow reference in the newly created objects in the target document objects;
Recreate the /Resource dictionary in the newly created page object by selecting only used resources as found in 1. and using the same approach as in 3.

If I am not missing anything this will not mess with free objects handling in PdfIndirectObjectList so it should not suffer the regression penalty you are reporting here.

from podofo.

kovidgoyal commented on June 12, 2024

I mostly understand how to do it from a PDF spec perspective. My use case is simpler -- I am merging whole files, so I can just copy all objects I dont actually need to check which objects are needed.

The problem is the new PoDoFo API is making this unnecessarily difficult. For example, PdfContents::Reset() exists and can be used to change the /Contents reference for an existing page (it leaks the original object, but that can be worked around).

But, PdfResource has no API to set the resource dictionary , one has to manually copy individual resources. Using PdfPage::GetOrCreateResources() deletes the existing /Resource key and replaces it with an empty resource when the PdfResources object does not yet exist.

One also has to deal with /Parent when shallow copying page objects. And also inherited page attributes.

from podofo.

ceztko commented on June 12, 2024

Actually you make me notice the PdfContents::Reset(obj) API with given obj pointer parameter is actually quite bad. I probably added it and I'm not proud of it, so at some point I will just remove the parameter and keep the parameter less version.

It's ok if you investigate if PdfResources has some nice accessibility method that may be handy for you but don't expect something like PdfResources::Reset(objptr) will be added because as I told you I believe it's not a sane API. PdfResources (and in general all classes inheriting PdfElement) is basically adding convenience methods to the underlying pdf object dictionary: if a convenience method that you need is missing you can always play the card to retrieve the underlying dictionary with PdfElement::GetDictionary() and you do all the stuff you need on the underlying object with no imposed limit. You can also temporarily copy/strip a PdfResources and add your convenience methods there. Good point about looking for inherited page attributes.

The approach I suggested in my previous post is optimized for appending/inserting a selection of pages, and can also handle unfortunate cases of documents sharing a single /Resource object for all the pages (merging a single page from these documents will always mean merging the whole document if the content streams are not inspected). I understand it's not your case, but such documents exist so I would take care of this in my implementation, when I will have time to do it. Maybe I will have a look at it in 2-3 weeks, but no promise.

from podofo.

kovidgoyal commented on June 12, 2024

On Fri, Jun 16, 2023 at 03:20:55AM -0700, Francesco Pretto wrote: Actually you make me notice the `PdfContents::Reset(obj)` API with given `obj` pointer parameter is actually quite bad. I probably added it and I'm not proud of it, so at some point I will just remove the parameter and keep the parameter less version. It's ok if you investigate if `PdfResources` has some nice accessibility method that may be handy for you but don't expect something like `PdfResources::Reset(objptr)` will be added because as I told you I believe it's not a sane API. `PdfResources`, and in general all classes inheriting `PdfElement`, is basically adding convenience methods to the underlying pdf object dictionary: if a convenience method that you need is missing you can always play the card to retrieve the underlying dictionary with `PdfElement::GetDictionary()` and you do all the stuff you need on the underlying object with no imposed limit. You can also temporarily copy/strip a `PdfResources` and add your convenience methods there. Good point about looking for inherited page attributes.

The problem is that PdfContents, for example, has a private m_object field that stores a pointer to the contents object. If I directly change the /Contents key in the underlying dictionary without Reset() there is no way to change m_object, so any other code that use PdfContents is now getting a different contents than what's in the dictionary. If you intended PdfContents to be a convenience view, then it needs to not store a reference to the contents object but instead fetch it whenever it is needed. This is how PdfResources works, but it is *not* how PdfContents works.

The approach I suggested in my previous post is optimized for appending/inserting a selection of pages, and can also handle unfortunate cases of documents sharing a single `/Resource` object for all the pages (merging a single page from these documents will always mean merging the whole document). I understand it's not your case, but such documents exist so I would take care of this case in my implementation, when I will have time to do it. Maybe I will have time to have a look at it in 2-3 weeks.

That's fine, I am not suggesting my approach is good as a general one. I have written the code and it works, but it is still slow, so the slow down is elsewhere. Not in the object number assignment. Probably in the object copying, maybe its decoding/encoding multiple times, or something. I will need to investigate further.

from podofo.

ceztko commented on June 12, 2024

The problem is that PdfContents, for example, has a private m_object
field that stores a pointer to the contents object

Sorry, I can't follow you precisely in what you are doing, but I give you more hints: PdfContents it's no more inheriting PdfElement: It should have private constructors and should be not copyable, so it will not leak pointers. This was not done in the API review but I am taking notes to do it in the future. PdfContents helps in handling the fact that page /Contents can be either a stream object or an array. If you are just appending whole document to another, I don't see why this should cause you troubles: create new empty pages, get the PdfContents instance from the newly created pages, get the stream with GetStreamForAppending(), get an output stream with GetOutputStream() and then copy from source pages with PdfContents::CopyTo(outputstream) , then adjust the pages attributes and /Resources dictionary.

from podofo.

kovidgoyal commented on June 12, 2024

On Fri, Jun 16, 2023 at 05:50:08AM -0700, Francesco Pretto wrote: > The problem is that PdfContents, for example, has a private m_object field that stores a pointer to the contents object Sorry, I can't follow you precisely in what you are doing, but I give you more hints: `PdfContents` it's no more inheriting `PdfElement`: It should have private constructors and should be not copyable, so it will not leak pointers. This was not done in the API review but I am taking notes to do it in the future. `PdfContents` helps in handling the fact that page `/Contents` can be either a stream object or an array. If you are just appending whole document to another, I don't see why this should cause you troubles: create new empty pages, get the `PdfContents` instance from the newly created pages, get the stream with `GetStreamForAppending()`, get an output stream with `GetOutputStream()` and then copy from source pages with `PdfContents::CopyTo(outputstream)` , then adjust the pages dictionaries attributes and `/Resources`.

I dont want to copy the stream, the stream is already copied along with all the rest of the objects in the source document. I want to change the array to which the /Contents key points in the destination page object's dictionary. If I do that by accessing the page dictionary directly, then the PdfContents object points to the wrong array, without using Reset(). And Reset() does leak a pointer. If you remove Reset, there is *no way* to change the m_object field of PdfContents once it is constructed. And there is also no way to replace the PdfContents object inside a PdfPage object once it is instantiated.

from podofo.

kovidgoyal commented on June 12, 2024

Even with the rewritten append function, time to append is rising with number of objects in the destination document. The rewritten function is only calling CreateObject() and CreatePage(). So there is some scanning over all objects going on in those code paths as well. Maybe I should just profile this. The annoying part is its in python module that wraps Podofo so setting up profiling is a pain.

from podofo.

kovidgoyal commented on June 12, 2024

Staring some more at PdfIndirectObjectList.cpp I see that PushObject searches through the entire list of objects to check if the object is already there. addNewObject calls PushObject(). CreateObject calls addnewObject(). This happens unconditionally for every new object we create. So we have another N^2 killing our performance.

addNewObject() needs to be re-written to not check for an existing object. That will almost certainly solve this bug.

from podofo.

kovidgoyal commented on June 12, 2024

m_Objects is std::set so the entire list of existing objects is scanned twice per CreateObject call. Supposedly this should be 2* log(n) since std::set is a red-black tree, but it looks quadratic to me.

from podofo.

ceztko commented on June 12, 2024

True, m_Objects is looked up twice. One to check for existent node here and one for actual insertion here. We could try remove one lookup by using the insert overloads that take the iterator hint. I coded up a quick patch, and I tested it only by running the unit tests. Please have a look it and tell if if helps and if you feel it's correct.

create_object_onelookup.diff.txt

This will not change the complexity, though. So your guess still needs more substantiation.

from podofo.

kovidgoyal commented on June 12, 2024

The patch looks incomplete? You didnt change the body of pushObject(). Anyway let me work on setting up some profiling rather than this guess work.

from podofo.

ceztko commented on June 12, 2024

You are right. Sorry: other work has my priority and focus. I fixed the patch (tests still work). You can still evaluate it if you do some serious profiling.

from podofo.

kovidgoyal commented on June 12, 2024

I did some profiling using gperftools, attached. 99% of the runtime is spent in getPageNode. Trace from pprof is attached.
perf.txt

from podofo.

kovidgoyal commented on June 12, 2024

This is called by InsertPagesAt which is called by CreatePage. It does a complete, linear scan of the entire pages tree for every call to CreatePage when appending a page at the end. There are several approaches one can use to mitigate this:

Simplest:
Add a bulk creation API CreatePages which allows creation of a large number of pages.

Most involved:
Store the parents of every page either in the cache or in the page object itself and use this in getPageNode.

from podofo.

ceztko commented on June 12, 2024

I just implemented all the proposed changes as described in the PR comment in commit af034bf. I preserved CreatePagesAt but I changed the order or the arguments. Investigating performance issues in PdfPageCollection is really something I wanted to do for months but it's always a shock when the solution to some bugs/performance issues is taking a big amount of apparently good looking code written by other people and junk it completely. I ask you if you can support me with testing as this is a very invasive change and I am doing it in a period where I can't spend too much time on PDF related tasks.

from podofo.

kovidgoyal commented on June 12, 2024

Running from master and calling PdfPagesCollection::AppendDocumentPages(doc) now results in a bunch of empty pages being appended. Looks like the /Pages nodes are broken. This is what the /Pages looks like in the generated PDF

<</Type/Pages/Count 15/Kids[ 2 0 R 8 0 R]/Pages 28 0 R>>
28 0 obj
[ 65 1 R 2 0 R 30 0 R 44 0 R 111 0 R 118 0 R 146 0 R 155 0 R 174 0 R 211 0 R 244 0 R 252 0 R 281 0 R 306 0 R 330 0 R]
endobj

IIRC /Pages objects should just have /Kids, /Count, /Type and /Parent (if not root node). Instead PoDoFo now seems to be adding a /Pages key.

from podofo.

ceztko commented on June 12, 2024

I found something not working in the intended flattening. I'm working on it right now.

from podofo.

ceztko commented on June 12, 2024

Ok, flattening was not working, fixed in 8f51733. I added a unit test flattening a simple structure and verifying it subsequently. I also made public the method to issue flattening voluntarily. If you can continue testing I would be glad. Please note I don't use/test PdfPagesCollection::AppendDocumentPages(doc) and I don't want to look at that code at all at the present time (unless I'm forced to): if you find obvious issues there that can be fixed with few lines of code I will wait for your patches.

from podofo.

kovidgoyal commented on June 12, 2024

Thanks, working now. Even without CreatePagesAt time to merge 3000 PDF
files is ~6 secs down from days.

All other functionality that my application uses from podofo also seems
to be working.

Please consider making a release soon, as this is a pretty serious
regression for my users. I have fielded multiple bug reports about it in
the last few days alone.

And just for the record, I dont use AppendPages anymore either, I have written
my own replacement for it, I mentioned it simply as an easy way for
you to reproduce the issue.

from podofo.

ceztko commented on June 12, 2024

I would like to do a release soon but I was waiting hopelessly someone help with a couple of security issues reported related to encryption. If I can't find time to fix them as well I will do a 0.10.1 release begin of next week. In that release I will probably strip away a breaking API change I did recently.

from podofo.

kovidgoyal commented on June 12, 2024

Thanks, much appreciated.

from podofo.

ceztko commented on June 12, 2024

I also pushed ae880a1 with the removed lookup when replacing/pushing objects. I studied the set<T>::insert(pos, ..) overload and the way I did in the commit by copying and incrementing the iterator before node extraction should be correct.

from podofo.

ceztko commented on June 12, 2024

I'm closing this one now. Please open other issues if you need.

from podofo.

kovidgoyal commented on June 12, 2024

I would like to do a release soon but I was waiting hopelessly someone help with a couple of security issues reported related to encryption. If I can't find time to fix them as well I will do a 0.10.1 release begin of next week. In that release I will probably strip away a breaking API change I did recently.

Just a gentle reminder, as I need to build and test the new release into my own software.

from podofo.

ceztko commented on June 12, 2024

Hello, yesterday night I analyzed a security report (which required no intervention) out of 3. Two reports to go, but I can work on them only in my free time.

from podofo.

ceztko commented on June 12, 2024

I cleared the vulnerabilities reports list, I should be able to release tonight. I ask you again if you can give me more feedback on the PdfPageCollection rewrite, since it was quite an invasive change. The tests so far have currently been done by me, you, and a small private test group that I interface with.

from podofo.

kovidgoyal commented on June 12, 2024

I have run every single test I can think of with the new code, it works for me. I use PoDoFo to merge pdf files, do some basic imposition of one page over another and iterate over the objects in the file to dedup fonts/images and also to draw JPEG images and update metadata. All that seem to be working.

Dont bother rushing the release, I have already bundled a podofo based on master with my software. So other than linux users who use distro packages, no one will be affected.

from podofo.

ceztko commented on June 12, 2024

I released 0.10.1. It's like master with one reverted commit that was API incompatible and will be released in future versions.

from podofo.

kovidgoyal commented on June 12, 2024

thanks

from podofo.

PDFDocument:AppendDocumentPages performance regression about podofo HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent