Comments (38)
And just for context, what I am trying to do is I have one pdf file with a cover page and TOC and a group of other "body" PDF's that are the documents referenced in the TOC.
Im trying to merge the cover/toc pdf with the "body" pdf's, and have links in the TOC point to the appropriate page in the final PDF as well as having proper bookmarks in the final PDF.
I can think of two approaches:
- I add blank pages to the cover/toc pdf for every "body" pdf page that i will be appending, with a named destination at the very top of the page. Then stamping the "body" pdf pages into their appropriate pages.
- Just appending the "body" pdf's and inserting the named destinations on the fly.
The issue stated above refered to approach # 1, I can join the pdfs fine, but the links/bookmarks from the toc/cover pdf are lost in the final pdf.
from combine_pdf.
Hi Andrew,
Thank you for bringing these issues to my attention.
Allow me to point out that CombinePDF creates a new PDF file, so I doubt if Bookmarks are transferrable, although that could be an interesting feature to have. I will try looking into bookmarks when done with the TOC.
As for keeping named destinations intact, it's definitely on the roadmap.
The issue, at the moment is that ad-hock links within the PDF file (the links using the PDF's "name tree") are preserved, but TOC links (links based on the entries in the table of contents) break since the original table of contents is discarded.
I definitely want to merge the TOC of different PDF files as well, perhaps adding a new "root" for each file (since the TOC is a semi-complex binary tree were each parent must provide data about it's children), but this requires me to dig deeper into the standard again and I'm in the middle of a few projects that take most of my time...
I'm keeping this open. If you have any further ideas about implementing these features, let me know.
I'll keep you posted when I get to this.
Bo.
from combine_pdf.
I definitely want to merge the TOC of different PDF files as well
With ToC, do you mean the outline 12.3.3 or named destinations 8.2.1?
We're currently facing the problem that our outline went missing.
from combine_pdf.
When talking about the TOC, I was referring to the outline.
Named destinations are used for links inside a page. The outline is used as an "external" navigation tool.
I'm assuming the outline will need to be extended when merging outlines, so that a root tree node is created and that this root tree node expands each "branch" to include all the data in the original document and any imported document.
On the other hand, this root node would need to recognize CombinePDF root nodes in order to merge with them rather then swallow them.
from combine_pdf.
Is there a workaround? At least to not lose the outline of the main document?
from combine_pdf.
There's no workaround at the moment, as preserving the data and moving it around isn't as simple as a single line of code.
At the moment, the ToC is lost during the parsing stage, unlike the way the named destinations are preserved.
After the parsing, the data will need to be saved to the PDF object.
Then, the ToC would need to get updated during the import of new objects, a similar process happens with the names object and the form data.
Also, the ToC might reference named destinations and it might use conflicting keywords... i.e. names are updated to prevent conflicts and the ToC might need to go through a similar process.
The last step to write would be to get the ToC into the new catalog/PDF object, similar to the way the names object is added for the rendering.
I had tons of stuff on my plate, so I didn't get to it, but it is definitely possible to code this... it's just more complex of an undertaking.
from combine_pdf.
Got kind of stuck.. or lost, somewhere in the outlines hash
Merging 2 Outline hashes seems not to work by using
actual_value(@outlines).update actual_value(data.outlines_object), &self.class.method(:hash_merge_new_no_page)
"seems not to work" as in SystemStackError - stack level too deep from calling merge
in hash_merge_new_no_page
over and over.
This is making me believe that I made a mistake somewhere in the steps 1-3, you described (see 6d16830), especially because of the fact that a pdf with 7 bookmarks needed tens of thousands of lines (literally) on a fullscreened console to print the outlines hash.
I'd appreciate a little nudge in the right direction if you don't mind :)
This is what the beginning on the outlines hash looks like, starts with obvious and usefull data, but deeper in is stuff like fonts, annots, meadiaboxes and so on.
{:Type => :Outlines,
:Count => 7,
:First => {:is_reference_only => true, :referenced_object => {:Title => "\xFE\xFF\x00I\x00N\x00H\x00A\x00L\x00T\x00S\x00V\x00E\x00R\x00Z\x00E\x00I\x00C\x00H\x00N\x00I\x00S",
:Parent => {:is_reference_only => true,
:referenced_object => {:Type => :Outlines,
:Count => 7,
:First => {...},
:Last => {:is_reference_only => true,
:referenced_object => {:Title => "\xFE\xFF\x00A\x00U\x00F\x00G\x00A\x00B\x00E\x00N",
:Parent => {:is_reference_only => true,
:referenced_object => {...}},
:Count => 2,
:First => {:is_reference_only => true,
:referenced_object => {:Title => "\xFE\xFF\x00E\x00v\x00a\x00l\x00u\x00i\x00e\x00r\x00u\x00n\x00g\x00 \x00v\x00o\x00m\x00 \x000\x009\x00.\x000\x006\x00.\x002\x000\x001\x006",
:Parent => {:is_reference_only => true,
:referenced_object => {...}},
:Count => 0,
:Next => {:is_reference_only => true,
:referenced_object => {:Title => "\xFE\xFF\x00R\x00e\x00e\x00v\x00a\x00l\x00u\x00i\x00e\x00r\x00u\x00n\x00g\x00 \x00v\x00o\x00m\x00 \x000\x009\x00.\x000\x006\x00.\x002\x000\x001\x006",
:Parent => {:is_reference_only => true, :referenced_object => {...}},
:Count => 0,
:Prev => {:is_reference_only => true,
:referenced_object => {...}},
:Dest => [{:is_reference_only => true,
:referenced_object => {:Type => :Page,
:Parent => {:is_reference_only => true,
:referenced_object => {:Type => :Pages,
:Count => 9,
:Kids => [{:is_reference_only => true,
:referenced_object => {:Type => :Page,
:Parent => {:is_reference_only => true, :referenced_object => {...}},
:MediaBox => [0, 0, 595.28, 841.89],
:Contents => {:is_reference_only => true,
:referenced_object => {:Length => 1423,
:Filter => [:FlateDecode],
:raw_stream_content => RAW STRUM}},
from combine_pdf.
I don't think you're lost... yet :-)
I think the issue is the recessive nature of the update function.
The Outline hash references the pages which, in turn, reference the catalog, which brings us back to the outline... But this isn't the issue because the hash_merge_new_no_page
prevents this recursive branch during it's "deep update" implementation.
There must be another branch of data that causes recursive updating.
It seems that each node in the outline tree references it's parent node (:Parent
)... these references might be the one leading to the recursive update.
I'm guessing a manual update scheme should be implemented or a different hash update function should be used to resolve data conflicts (instead of hash_merge_new_no_page
). maybe the hash_merge_new_no_page
function should ignore and :Parent
key names...
from combine_pdf.
Seems like that's the last piece missing.. How to merge the outlines
In theory, it's just about inserting into a two-way list, but I'm having trouble understanding the logic behind the way the object gets build in the parser, or rather the result, where the outlines hash has the first and the last outline node and everything else about the outline is "stuffed" into those 2 entries and some information about the bookmark order can be found multiple times within the hash.
I'm not sure, how to insert the new bookmarks into that, yet.
from combine_pdf.
I'm sorry it's taking me a while to answer, it's a busy week for me.
To resolve this we need to read through the PDF specification.
Section 12.3 to the specs deals with outlines and I think this is the part we should be looking at.
When merging the data, we should make it fit the specs, either by:
- merging existing outlines as tree branches (no need to unpack each outline, we simply create a master outline that treats these outlines as children; OR
- merging the actual outline data (probably preferable but might be complicated) in accordance with location at which the incoming PDF object is injected.
At this point, named destinations (see 12.3.2.3 in the specs) need to be reviewed, to make sure we don't have naming conflicts. Similar code already exists and could probably be adapted for the outline.
It seems to me that in order to combine the outline (which is a linked list), the code could "walk" the list, updating the :First
, :Next
, :Last
and :Count
with each "step".
The code will probably need to walk only the root of the outline (assuming branches of the outline remain unchanged).
If the PDF is injected in the middle, the :First
and :Last
entries might not need to be updated, but in both other cases (injected at the beginning or the end) these entries will need to be updated.
I have no idea how to handle the middle case, as I have no idea how to discover which "bookmarks" reference the page after which the PDF is injected.
from combine_pdf.
Well, I was able to append the second outline to the original one so far (here: ee14937).
As far as the other (probably preferable option goes, I assume we'd have to use the :destination
keys in the outline nodes to identify the page it links to and then insert them accordingly, but the insertion part gets kind of tricky, because the outline is full of circular references, of course there are the :Page
and the :Parent
as you pointed out earlier, but every node can have up to 5 references to other nodes (:Parent
:Prev
:Next
:First
:Last
) so a node that has a :Next
node it points to will also have that node pointing to it with : Prev
, which makes "walking" through the hash very difficult, Which is also why I didn't do that in the commit I wrote about at the beginning. I updated the outline :Count
and then all the :Parent
's for nodes that have the outline as parent, then changed the :Last
key of the main outlinedictionary to the new outline and finally appended the new outline to the old one by going to the last node of the old outlines :First
key and inserting it there.
Tomorrow I'm going to test it some more and figure out if the result is according to the specifications and then hopefully open a pull request.
Also, I might be wrong here, but I am pretty sure that we don't need to do the same thing that has to be done for the named destinations, named destinations literally have a 'word' as link/anchor while the bookmarks use page links, in my tests so far, even merging 2 identical Pdfs didn't cause any problems with names or page links.
Edit/Add: Depending on what I find in the specifications, I might try to add a parent node to the new outline, so it is easier to identify an inserted part.
from combine_pdf.
Hi Stefan,
I've been working on a lot of re-writes and bug corrections for resolving lost form data, decryption issues and the like... It actually brought some in-depth changes, so things might be easier now.
For instance, the reference resolution should be both faster and easier... but it is implemented differently, so (I'm sorry) some of the code you wrote will need to be revised.
For instance, you have this line:
@objects << catalog
# [...]
add_referenced catalog[:Outlines], false
which will break under the new implementation (and isn't required, everything in the catalog is automatically processed).
This includes a new recursion (or repetition) protection helper at CombinePDF::PDF::RECORSIVE_PROTECTION
I hope this will help to "walk" the outline.
After reading the code for the update
, I thought I would give it a try myself. Your code looked concise and thoughtful, however, I though a destructive pattern in a single function would perform better and be easier to maintain over time.
I came up with this untested (and probably broken) destructive update pattern (updates the data in place).
def merge_outlines(old_data, new_data, position)
old_data = actual_object(old_data)
new_data = actual_object(new_data).dup
if old_data.empty?
# old_data is a reference to the actual object,
# so if we update old_data, we're done, no need to take any further action
old_data.update new_data
else
old_data[:Count] += new_data[:Count]
# walk the Hash here ...
# I'm just using the start / finish position for now...
# FIXME to implement an insert in the middle of the file?
prev = nil
pos = first = actual_object(((position < 0) ? new_data : old_data)[:First])
median = actual_object(((position < 0) ? old_data : new_data)[:First])
last = actual_object(((position < 0) ? old_data : new_data)[:Last])
first = old_data[:First] = {is_reference_only: true, referenced_object: first}
last = old_data[:Last] = {is_reference_only: true, referenced_object: last}
parent = {is_reference_only: true, referenced_object: old_data}
# the walking
while(pos)
pos[:First] = first # already a reference
pos[:Last] = last # already a reference
pos[:Parent] = parent # already a reference
if(prev)
pos[:Prev] = {is_reference_only: true, referenced_object: prev}
else
pos.delete :Prev
end
# connect the two outlines
if(pos[:Next].nil?)
pos[:Next] = median
median = nil
end
prev = pos
pos = actual_object(pos[:Next])
end
# make sure the last object doesn't have the :Next property
prev.delete :Next
end
# print_dat_outline(old_data)
return nil # no need to return the data, the update had taken place destructively.
end
I wonder if you tried a similar approach...?
I know you already have something working, but could you test it out and see if there's a performance difference?
I know there are a few things missing:
- This code updates the outline directly, I don't think it's corrupting the "incoming" PDF object because I think everything was duplicated, but this will need to be tested.
- This code doesn't deal with nested outlines... but does it need to? is there any change required for the nested data?
- The code doesn't deal with cases where the insert
position
isn't0
(beginning) nor-1
(end), but rather a relative value (i.e.-3
, three pages from the end)... but I think this might be as close as we get for now.
The uptake of this approach is that it is minimizing temporary objects and function calls (which are relatively expensive in Ruby), editing the data "in place".
I also think it's easier to maintain, but it might be because my mind is slow at computing function call jumps.
Named destinations?
I understand the PDF files you tested with all used page references as TOC destinations. This is great news.
However, I think the standard states that named destinations (using either Name objects (Ruby symbols) or Strings) are allowed (see sec. 12.3.2.3).
I'm not sure, but it could be that we're be covered either way, we just need to check that the existing name conflict resolution logic will be applied also on the names in the TOC (which it probably would).
Edit/Add:
At first I thought that would be cool, but then there is the question of naming that header position (we could probably use the PDF#title
property). This could actually make the outline merge easier, though it would require some thought...
On one hand, I'm not sure this is what I would want to see when merging a file.
On the other hand, this could also be used to create an outline for files without an outline, so all inserted PDF files appear on the outline and the reader can jump directly to this file or the other.
But this will probably only work if files are inserted at the beginning or the end. I don't see how this can be implemented for PDF files inserted in the middle of a page stream.
In practice, I think this idea should be scratched in favor of practicality.
from combine_pdf.
I came up with this untested (and probably broken) destructive update pattern (updates the data in place).
I appreciate it, I see where you are going with that and made it work with your updates the data in place approach.
I know there are a few things missing:
This code doesn't deal with nested outlines... but does it need to? is there any change required for the nested data?
This code updates the outline directly, I don't think it's corrupting the "incoming" PDF object because I think everything was duplicated, but this will need to be tested.
(Edit: Messed up copy-pasting this)
I assume that you are talking about the ToC's subsections. The answer is no. As long as we only enable insertion at the start or end and not somewhere in the middle (within a subsection) it doesn't matter. I would have explained the :First
and :Last
key a little better if I knew that you were going to come up with an update method. Those keys are only present in a subsection parent node and they point to the start and end of that subsection (the main nodes we see in the outline are basically a subsection of the main outline node (by which I mean the tree base, which only has { Type: Outline, Count: #, First{..}, Last{..}}
) which is not visible. Which is why we don't need to touch them, except in the main outline node.
After tweaking your merge method to look like this:
def merge_outlines(old_data, new_data, position)
old_data = actual_object(old_data)
new_data = actual_object(new_data).dup
if old_data.empty?
# old_data is a reference to the actual object,
# so if we update old_data, we're done, no need to take any further action
old_data.update new_data
else
old_data[:Count] += new_data[:Count]
# walk the Hash here ...
# I'm just using the start / finish position for now...
# FIXME to implement an insert in the middle of the file?
prev = nil
pos = first = actual_object(((position < 0) ? new_data : old_data)[:First])
last = actual_object(((position < 0) ? old_data : new_data)[:Last])
median = {is_reference_only: true, referenced_object: actual_object(((position < 0) ? old_data : new_data)[:First])}
old_data[:First] = {is_reference_only: true, referenced_object: first}
old_data[:Last] = {is_reference_only: true, referenced_object: last}
parent = {is_reference_only: true, referenced_object: old_data}
# the walking
while(pos)
# pos[:First] = first # already a reference
# pos[:Last] = last # already a reference
pos[:Parent] = parent if pos[:Parent] # already a reference
if(prev)
pos[:Prev] = {is_reference_only: true, referenced_object: prev}
else
pos.delete :Prev
end
# connect the two outlines
if(pos[:Next].nil?)
pos[:Next] = median
median = nil
end
prev = pos
pos = actual_object(pos[:Next])
end
# make sure the last object doesn't have the :Next property
prev.delete :Next
end
# print_dat_outline(old_data)
return nil # no need to return the data, the update had taken place destructively.
end
its performance was the following:
SMALL PDFS:
user system total real
0.000000 0.000000 0.000000 ( 0.000007)
0.000000 0.000000 0.000000 ( 0.000027)
0.000000 0.000000 0.000000 ( 0.000013)
0.000000 0.000000 0.000000 ( 0.000014)
LARGE PDFS:
user system total real
0.000000 0.000000 0.000000 ( 0.000009)
0.000000 0.000000 0.000000 ( 0.000054)
0.000000 0.000000 0.000000 ( 0.000089)
0.000000 0.000000 0.000000 ( 0.000129)
While my version had this performance:
SMALL PDFS:
user system total real
0.000000 0.000000 0.000000 ( 0.000019)
0.000000 0.000000 0.000000 ( 0.000016)
0.000000 0.000000 0.000000 ( 0.000016)
0.000000 0.000000 0.000000 ( 0.000028)
LARGE PDFS:
user system total real
0.000000 0.000000 0.000000 ( 0.000006)
0.000000 0.000000 0.000000 ( 0.000074)
0.000000 0.000000 0.000000 ( 0.000087)
0.000000 0.000000 0.000000 ( 0.000112)
Merging 4 identical pdfs,
one of them 4 pages with 5 outline nodes (SMALL PDFS),
one 214 pages with 100+ outline nodes (LARGE PDFS),
Test called in pdf_public.rb
, by doing this instead of a normal call to the method:
Benchmark.bm do |bm|
bm.report { merge_outlines(@outlines, data.outlines_object, 0) }
# bm.report { actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), actual_value(data.outlines_object)) unless actual_value(data.outlines_object).empty? }
end
using the according call for the new and the old method testing (new being your approach)
I don't think it's corrupting the "incoming" PDF object because I think everything was duplicated, but this will need to be tested.
I'd have to test more specifically, all I can say at this point is that everything looks fine in the result.
The code doesn't deal with cases where the insert
position
isn't0
(beginning) nor-1
(end)
I think that's sufficient for now aswell. It would be quite some effort to figure out where to put it, especially since there's even multiple ways to define an outline nodes destination, as you pointed out.
However, I think the standard states that named destinations (using either Name objects (Ruby symbols) or Strings) are allowed (see sec. 12.3.2.3).
Yup, you're right, I'm looking into that now.
In practice, I think this idea should be scratched in favour of practicality.
Alright, I see your point.
Keep in mind that everything I wrote so far is from an unmerged upstream standpoint.
I just started merging your changes in, but I'm not clear on how to do this add_referenced catalog[:Outlines], false
now. Just skipping it doesn't work.
from combine_pdf.
Wow, that's a lot of work... nice. I'm super thankful to you for this... this feature should be named after you 👍
Two questions about performance and the merge function.
-
Is it easy to move the
if
statements out of the loop?1.1 I think calling Hash#delete will be faster then calling
if(prev)
for every rotation... we know it's always the last object, or am I wrong?1.2 I know it's my bad, I preferred DRY over performance, but we might consider writing two loops instead of the
if(pos[:Next].nil?)
, but two loops should (in theory) be faster. -
The benchmarks had wide statistical variance... they both seem fast, but I fear context switches and benchmarking overhead might have to much of an effect to learn much.
I'm sure my idea will corrupt any TOC, but maybe we can keep adding it to itself a few times (maybe 1000 times), just to see how these functions behave under heavier load. My goal is that the benchmark will spend about a second actually performing the task, so some of the overhead and context switching effects could be drowned out we might need to try more repetitions, but I think they grow exponentially.
Benchmark.bm do |bm|
merge_outlines(@outlines, data.outlines_object, 0)
bm.report { 1000.times { merge_outlines(@outlines, @outlines, 0) } }
# actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), actual_value(data.outlines_object))
# bm.report { actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), @outlines) }
end
Merging with upstreamed changes
I'm sorry I made you work harder... But I hope no big changes will pop up.
I think I might have messed up the RECORSIVE_PROTECTION
Hash in a way that messes up the reference collection.
The PDF::RECORSIVE_PROTECTION
hash excludes certain keys / objects from being searched.
When I wrote it I assumed there is a :Kids
array listing all the outlines, but now that I think of it, there wasn't one... We probably need to remove the First: true, Next: true, Prev: true
from the RECORSIVE_PROTECTION
Hash... the whole thing should probably be removed and replaced with the simple k == :Parent
it was before... Sorry
If this isn't the reason it fails, I'm not sure what it might be, so let me explain the new logic, maybe it will help.
The new reference collector looks at any existing objects and adds any references they hold into the @objects
array.
Since this is done after the catalog
is added to the (now empty) @objects
array, the new reference collector should have automatically added any missing referenced objects from the @outline (referenced by the catalog object).
In other words, it should have worked automatically once you had this in the code in the rebuild_catalog
function (and saved the file):
catalog_object = { Type: :Catalog,
Pages: { referenced_object: pages_object, is_reference_only: true },
Names: { referenced_object: @names, is_reference_only: true },
Outlines: { referenced_object: @outlines, is_reference_only: true } }
However, for performance reasons, the reference collector (add_referenced
) is only called when there's risk to the data structure's integrity or there's a need to format the PDF data (i.e. saving).
Any tests performed after rebuilding the catalog will fail unless the reference collector is invoked. I'm hoping this is why it failed...
If this doesn't work I might have a bug in my new reference collector... that would definitely suck, but i's possible.
A simple workaround would be to add the @outlines
directly to the @objects
array before invoking the reference collector... but I would need to search for the bug and fix it anyway.
from combine_pdf.
Happy to help, also you did a lot by coming up with the base of how to update the outlines (:
performance and the merge function.
- Is it easy to move the if statements out of the loop?
I'm not sure if you are deleting pos[:Prev]
for some performance reasons I can't see? If not, it wouldn't be needed at all.
while(pos)
pos[:Parent] = parent if pos[:Parent] # already a reference
# connect the two outlines
if(pos[:Next].nil?)
pos[:Prev] = {is_reference_only: true, referenced_object: prev} if prev
median[:referenced_object][:Prev] = {is_reference_only: true, referenced_object: prev} if median
pos[:Next] = median
median = nil
end
prev = pos
pos = actual_object(pos[:Next])
end
this would work aswell. It would kind of combine your 1.1 and 1.2, if the if(prev)
was in it's own loop, that loop would have to run through one whole branch checking [:Next].nil?
every rotation while it's trying to get to the end. This way everything happens in this 1 if
, I'm not sure though if I overlooked anything performance wise.
2. Performance
Let's just say the old way explodes with big data.
Benchmarking the way you suggested didn't quite work for reasons I haven't fully figured out yet (calling the method more than once results in ruby getting stuck like it's in an infinite loop). I worked around that by creating pdfs with huge ToC's using the prawn gem and merging those pdfs. And this is where my method fell short.. when working with multiple really big ToC's. I know it's not very specific, but it was really just a way to confirm what we already knew, so I didn't have to waste time on those methods.
Merging with upstreamed changes
Everything worked out after reverting to k == :Parent
I'm out of time for today, but next up is rebuilding the named destinations used in the outline.
from combine_pdf.
Cool 👍
1. The loop
I love the new loop and unified if
statement, nice work.
I think deleting the :Prev
property was my bad, I didn't think it through and it isn't required.
The only thing is, if we have a nil
value, it will be a NULL object in the PDF file, which just takes up space for no reason and might cause issues in the future.
The first object probably shouldn't have the :Prev
key and the last once probably shouldn't have the :Last
key (even if the key value is nil
)... but this considerations definitely shouldn't be part of the loop.
Performance
Kudos for taking the time to create PDF files with huge ToCs... If you can post any of them (unless they contain sensitive data), I could use them for some of the tests I perform before releasing updates.
I don't add the PDFs used for testing to the online repo, since some of them might contain sensitive data... and I should probably write automated tests, though... which I haven't gotten around to. I keep using the demo app for testing half the features (merge PDF files, text boxes, font importing, tables and page numbering) and then do some manual work for testing other features (i.e. page stumping)... this isn't very effective...
... I digress.
Upstream
I'm happy this is resolved and that the new reference collection works, as I think the new reference collection will make future updates that much easier.
Named destinations
If the names already exist in the name dictionary, this might be handled by the existing name resolution that is automatically applied whenever the object catalog is rebuilt.
If this suspicion I have is correct, then this feature is practically done 👍
from combine_pdf.
The Loop
if we have a
nil
value, it will be a NULL object in the PDF file
"if we have a nil
value" - where in outline would that nil
, you are talking about, be?
The first object probably shouldn't have the :Prev key and the last once probably shouldn't have
the :Last key (even if the key value is nil)... but this considerations definitely shouldn't be part of the loop.
I can definitely add a deletion of those keys after the loop, since those nodes are accessible without looping
Performance
Well, I'm testing my work in the very rails app that brought me here. The app already generates PDFs using Prawn, but for the tests I generated different simplistic PDFs. The code I use are just things like this:
report = CombinePDF.new
b_pdf = Prawn::Document.new
b_pdf.text("Hi prawn")
b_pdf.outline.page title: 'First pdf'
b_pdf.start_new_page
b_pdf.text("Heeeeelloh!")
b_pdf.outline.page title: '1st pdf page 2'
b_pdf.start_new_page
b_pdf.text("aloha!")
b_pdf.outline.define do
100.times do |t|
section "Chapter #{t}", closed: false do
page :title => 'Page 1', destination: 1
page :title => 'Page 2', destination: 2
page :title => 'Page 3', destination: 1
page :title => 'Page 4', destination: 2
end
end
end
report << CombinePDF.parse(b_pdf.render)
report << CombinePDF.parse(a_pdf.render) # a_pdf being a similarly created pdf
final_report = report.to_pdf
send_data final_report,
filename: "test.pdf",
type: 'application/pdf',
disposition: 'inline'
Adding multipliers like this 100.times
where I see fit, this b_pdf
for example has 502 outline nodes + the base node
Named destinations
Well, I'm having a hard time creating an outline that uses this kind of link at the moment.
But I tried merging 2 Prawn manual PDFs and that didn't have the desired result. The merge itself went fine, but both the outlines linked to the same pages of the first PDF. To make things harder my little outline hash print method (don't ask me what happened to the indentation there) also isn't able to print the hash in a readable manner, thanks to the huge raw streams.
So, I'm still trying to figure out what the problem with that PDF is.
Another thing I noticed, by the way, is that the way the outline hash gets build, results in some cases, in the outline :dest
s linking to the pages dictionary, which bloats the outline up greatly, since potentially every node can be a link to a page, which means that every node's :dest
contains all the pages?
inserting pages
When iterating over a PDFs pages and inserting them one by one, that PDFs outline gets left out.
Just letting you know, so you don't get surprised by it later on.
tested like this:
CombinePDF.parse(b_pdf.render).pages.each_with_index do |page, index|
if index < 2 # b_pdf only has 3 pages here
report << page
else
report << CombinePDF.parse(a_pdf.render)
report << page
end
end
Resulting PDF contains only the outline from a_pdf
from combine_pdf.
Hi Stefan,
Named destinations
First, please let me apologize, and hopefully make you happy at the same time.
Last night an Issue was posted that exposed a vulnerability in the named destinations algorithm that was in place.
I had to rewrite the whole thing... but it now works better (although it might be glitchy, as I put it together late at night).
It could be that some of the issues you experienced will self resolve (I doubt, but I hope).
Duplicate objects / bloating
As to having the pages in the named destination hash, it's supposed to be this way. These are the same Ruby objects (same object_id
) linked to from different places. They have a container that wraps them, so whenever the PDF is formatted, a reference is printed instead of the actual data.
The objects should only print once, which is what the add_referenced
algorithm is making sure of... it's both making sure that they actually print and it's making sure the same object doesn't appear more then once.
which is also where we get this funny issue...
"both the outlines linked to the same pages of the first PDF"
I think this is why you get this funny behavior where both the outlines point to pages of the first PDF instead of their own original PDF.
Except for page objects, all objects that contain the same data are reduced to a single object.
For example, when we add pages that use the same font over and over again, we don't want to have multiple copies of the font... so the reference is resolved to the first copy and the rest of the font objects are discarded.
Pages, because of an issue with an older adobe acrobat reader, must have unique references. For this reason, Pages in the page catalog that contain the same data are have a shallow copy made specifically for the catalog. Only the Page's dictionary is duplicated, but not any of the resources they use (fonts etc' are still reduced to a single copy of each).
Th reason I think this might cause the issue where both the outlines point to pages of the first PDF is since the data in the pages is the same, the reference is forced to resolved to the original page (reduction takes place).
I don't think there's a way around it except (maybe) numbering the pages or editing them, because I can think of too many things that might go wrong if we don't enforce this behavior (like the references pointing to the pages in the other PDF object, and then formatting and managing the PDF can start doing weird things)...
...However, I'm not sure that's a contingency we need to explore. If people are just bloating their PDF files by duplicating their own data many times, I doubt if the ToC will be their source of disappointment.
nil value
I'm sorry, I think you answered me, but I'm just making sure, because sometimes I write too much and then I'm not clear.
"if we have a nil value" - where in outline would that nil, you are talking about, be?
In the :Prev
key of the first element and the Next
key of the last (which you answered me about)... but it's probably not really important. and
Upstream changes...
I hope it's easy to merge the new named destination function... sorry for the extra work.
Bo.
from combine_pdf.
Named Destinations
.. but it now works better (although it might be glitchy, as I put it together late at night).
I just realized that my named destinations (generated with prawn), don't work anymore at all. Also tested without using the outline merge, so I assume that it is caused by the your new name rebuilding method.
If I'm not mistaken, the link to a page and the anchor on that page should be renamed to contain something like 'CombinePDF_0000006' as well as the original link/anchor name, which the links currently don't have (haven't checked the anchors).
from combine_pdf.
I'm far from my computer at the moment (hurray for smartphones), but I found a small issue with the renaming logic (I have an extra dic <<
in the first line of the renaming block.
I'll fix this when I get home. Could you please send me the test files? It would save me the time of creating my own test files.
from combine_pdf.
Sure, here are some:
testpdf_1_named_destinations.pdf
testpdf_2_named_destinations.pdf
testpdf_3_named_destinations_no_outline.pdf
testpdf_4_named_destinations_no_outline.pdf
testpdf_5_big_outline.pdf
testpdf_6_bigger_outline.pdf
It doesn't work with any combination of those, actually not even using a single one without merging, would keep working named destination links.
from combine_pdf.
I found my two mistakes:
- I was a spelling mistake (I spelled
:Dests
as:Dest
singular, so it was ignored). - I left the
is_reference_only: true
key-value pair out... so the linking to the data was all broken.
I think it should be working now... could you see if you're getting better results?
I think the outlines in the test files didn't all work (the ones not under a chapter didn't seem to work even on the originals).
There isn't a need to unpack the page stream and edit it, since the named destinations don't make it into the page stream. It seems that named destinations are a type of annotation that is overlaid over the page data.
I did find a dictionary we might be missing when supporting these page url jumps, and this is the Pageable data in the catalog (section 12.4.2 to the standard).
However, I think we should finish this one (and celebrate it) before tackling a new feature extension.
from combine_pdf.
The test files
They are woking like they are supposed to, only the entries within the chapters are supposed to link to pages 1 or 2, I didn't see the need to make every entry link to any page, after all I'm not talking about named destination links in the outline (which this files don't have), I'm talking about the normal named destinations found on different pages, linking to different other pages. (I'm not sure if that was clear before, so I'm just making sure, in case it wasn't).
The problem with renaming
What you did is definitely going in the right direction.
When not merging, but only having 1 PDF run through the process of the gem, the named destinations work correctly.
When merging 2 PDFs, however, the named destinations of the first PDF don't work correctly anymore (they are still links, but when clicking them it's not jumping to any other page / nothing happens), but the named destination links found in the second PDF work like they are supposed to.
So, to sum this up, before your latest change, none of the named destinations links, found in the files I uploaded above, worked (nothing happened when clicking the link). Now, after your latest changes, the links work if it's only 1 PDF, and if it's 2 PDFs that are being merged, the links in the second PDF work, but the ones in the first PDF don't.
What I noticed:
In the first example, where I only used 1 PDF, the named destinations didn't get renamed to have something like 'CombinePDF_0000006', but obviously neither did the anchors, since these links worked.
In the second example, when merging 2 PDFs, the links of the first PDF (this links did not work!) didn't have the 'CombinePDF_0000006' string. The second PDF (this links worked!) did have a 'CombinePDF_0000006' string!
Sorry I can't tell you what the page anchors contained, since I haven't found an easy way of checking them.
I suspect that the problem with the links that don't work is that the links and anchors don't match at the moment, maybe you are updating all the anchors (to contain a 'CombinePDF_0000006' string), but the links of the first PDF get "forgotten"?
from combine_pdf.
I suspect that the problem with the links that don't work is that the links and anchors don't match at the moment, maybe you are updating all the anchors (to contain a 'CombinePDF_0000006' string), but the links of the first PDF get "forgotten"?
I'll try and work out an answer and review the code at the same time, this way if I skip something in the logic you can point it out.
The logic is this:
- The parser unifies all strings - Within a single PDF object, all strings that are equal in content are the same object. This happens in the parser:
def _parse_
#[...]
##########################################
## parse a Hex String
##########################################
elsif str = @scanner.scan(/<[0-9a-fA-F]+>/)
# warn "Found a hex string"
out << unify_string([str[1..-2]].pack('H*').force_encoding(Encoding::ASCII_8BIT))
# [...]
##########################################
## parse a Literal String
##########################################
elsif @scanner.scan(/\(/)
# [...]
out << unify_string(str.pack('C*').force_encoding(Encoding::ASCII_8BIT))
# [...]
end
# [...]
def unify_string(str)
@strings_dictionary[str] ||= str
end
- Since all strings are one string, editing a string in place (v.s replacing the string) should updated the information everywhere this sting was referenced in the PDF. i.e.
a = "test"
b = a
a << " this"
b == "test this" # => true
- When renaming we update the strings in place, so all instances are updated.
dic << (pos[i * 2].clear << base.next!)
- Since all strings are one string, editing the string in place should negate the possibility of "forgetting" to update any links.
Caveats and possible issues:
- If any strings are parsed but not unified, the whole premise is broken. I don't think is the issue.
- If, after parsing, strings are copied (
str.dup
) they will no longer be unified. This might happen when copying PDF data and it could be a good idea to review the copying (and secure copying) logic to make sure string aren't duplicated or (if they are duplicated) that they are re-unified. - Strings that are used as Hash keys are automatically duplicated and frozen by the underlying Ruby implementation, which means that if named destination names where placed as hash keys (not values), they wouldn't be updated and the information would be lost.
when merging 2 PDFs, the links of the first PDF (this links did not work!) didn't have the 'CombinePDF_0000006' string. The second PDF (this links worked!) did have a 'CombinePDF_0000006' string!
I'm very happy I asked you to test this also on your system, because I can't seem to duplicate the issue.
Here is the code I used for testing, the list is a list with all the files you sent me and their paths on my machine, I use the list to:
- For each file: load and save.
- For each file: load, merge with the same file (reloaded, not same object).
- Merge all files to review total merger.
lists = %w{./Ruby/test\ pdfs/outlines/big_toc.pdf ./Ruby/test\ pdfs/outlines/bigger_toc.pdf ./Ruby/test\ pdfs/outlines/named_dest_no_toc.pdf ./Ruby/test\ pdfs/outlines/named_dest_no_toc2.pdf ./Ruby/test\ pdfs/outlines/named_dest.pdf ./Ruby/test\ pdfs/outlines/named_dest2.pdf};
i = 0
lists.each{|n| CombinePDF.load(n).save("07_#{(i+=1).to_s}_#{n.split('/')[-1]}"); (CombinePDF.load(n) << CombinePDF.load(n)).save("07_#{(i).to_s}x2_#{n.split('/')[-1]}") }
pdf = CombinePDF.new
lists.each{|n| pdf << CombinePDF.load(n) }
pdf.save("07_named destinations.pdf")
In my test files all the links work... Can you post your test code, so I can try and replicate the issue?
from combine_pdf.
Stefan,
Happy Weekend 🎉
I hope I'll be able to replicate and find the issue and release a fixed version this weekend.
I'm thankful for all you've done.
B.
from combine_pdf.
Hi Stefan, I think @Kagetsuki exposed the source of the issue - see issue #71 .
from combine_pdf.
Nope.. probably not the issue... this shouldn't effect page links, only outlines.
from combine_pdf.
Hi Bo,
This is getting weirder with every test I do..
Honestly I have no clue what's going on here, I can't reproduce the failed examples I told you about anymore.
The one example I was able to reproduce earlier today, used a specific version of a report generated by prawn with loads of images, but after resetting my seeded data that suddenly seems to work aswell now... 😕
If it happens again I'll put together a more detailed report about how it happened and how I'm testing for it. As for the previous examples, the difference between our tests is that I used the data generated by prawn with CombinePDF.parse
while you had the actual PDFs, but since I can't even reproduce the behaviour anymore, it doesn't seem that relevant at the moment.
from combine_pdf.
@sLe1tner Did you update your gems? @boazsegev released like two patch versions in the last day 🤘🏼. If you updated your gems/bundle the patch probably got in there. You could peg the gem at .21 and then .23 and see if you get the error again - in which case your issue was also fixed in .25.
from combine_pdf.
🎉
I take this as good news - maybe the issues were related to the fixes in the latest releases (i.e. empty strings being ignored / catalog inheritance)...
...I think this means we have a (probably) stable outline merging feature. I'm very happy about this 👍
I'll close this issue, but if anything pops up or you find something, please open an issue and we'll work on it.
B.
from combine_pdf.
Ah that might have had something to do with it, i'll check it out, thanks @Kagetsuki
...I think this means we have a (probably) stable outline merging feature. I'm very happy about this 👍
Yea that sounds great :)
from combine_pdf.
I'm attempting to prepend a cover page to my pdf which has a table of contents. I would like the table of contents to remain unchanged.
Currently the TOC is unchanged but links are dropped. Is this expected to work and is there anything I can try to fix the links?
Thanks for all your work on this.
from combine_pdf.
@daymun How are the links formed? Are they to named anchors? Can you provide sources/samples?
from combine_pdf.
@Kagetsuki the TOC links are generated with wkhtmltopdf. Here are some sample files: https://drive.google.com/drive/folders/0B0vK1fMzDbarMndpdExVUFlCbE0
As you can see, links work in main.pdf
but not in combined.pdf
.
from combine_pdf.
@daymun ,
Thank you for opening this issue.
I've started looking in to this and it might have something to do with the fact that wkhtmltopdf
uses an old way for naming destinations (this format was updated in PDF v.1.2), where the Dest
dictionary is referenced directly within the Catalog
object, instead of using a Names
object.
This was a bigger problem than I thought at first glance, because wkhtmltopdf
uses the old format with page numbers instead of page references - which means that it breaks when CombinePDF tries to parse it (page numbers are meaningless before parsing is complete, and even afterwards, this becomes a performance issue).
Anyway...
I've uploaded an updated version to GitHub, to deal with wkhtmltopdf
compatibility.
This update might have a negative impact on performance in some cases... but this might be the price we pay when we need to write work arounds.
Also, I didn't finish testing this version for any negative impact on the actual PDF data... I don't promise this change will make it to the release version until I test it some more.
Could you possibly test that this fix actually helps you before I push this through?
from combine_pdf.
@boazsegev, many thanks for your quick resolution of this issue. Everything seems to be working with the fix you pushed yesterday and I haven't noticed any negative impact on PDF data.
from combine_pdf.
same issue, isn't working here.
from combine_pdf.
Could you open a new issue, explaining exactly what isn't working and adding a short example that would allow me to replicate any issue you might be experiencing?
Kindly,
Bo.
P.S.,
I'm mostly offline for the next two-three weeks. I might be slower than usual.
from combine_pdf.
Related Issues (20)
- LIbreOffice / Scribus & embedding fonts HOT 2
- Line break is not working HOT 1
- Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) HOT 6
- General PDF error - Combine PDF fails on specific PDF file HOT 1
- Seeking maintainers?
- How to use Arial black font in pdf?
- Fuzzer + various crashes
- SystemStackError: stack level too deep for docx file HOT 1
- Broken when parsing PDF with empty object
- Element missing error when rendering
- CombinePDF.parse is unable to process some PDF files
- Smart links are not updating or working in the combine process
- no implicit conversion of Symbol into Integer when using fonts()
- Combine many files at the same time and memory usage
- Web optimized PDFs improperly parsed
- center watermark HOT 1
- ParsingError: Optional Content PDF - with some PDF files uploaded HOT 5
- Couldn't connect reference
- paragraph text is collapsing when trying to combine file HOT 1
- combine_pdf/parser.rb:188:in `[]': no implicit conversion of Symbol into Integer (TypeError)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from combine_pdf.