Giter VIP home page Giter VIP logo

Comments (14)

JorjMcKie avatar JorjMcKie commented on June 20, 2024 1

I've tracked this down a little more specifically. Looks like the delete method deletes the widget from the annotations list of the PDF, but it remains in the fields list as an object. When pyHanko and the eSign platforms iterate, they are using the fields list and not the annotations list, so this is why it seems like the fields still remain in the PDF, but they don't appear in regular viewers.

Would it not make sense to remove all references to a widget from the PDF entirely if it's meant to be deleted? Or is that operation not possible in PDF for some reason?

I was beginning to suspect something like this. As to my own impression, 50% of the PDF viewers I am using look at the /Annots list of the page only and will disregard anything else.
The others indeed seem to still look at the central array of widgets (located in the PDF catalog).

We should probably indeed make sure to also either remove the entry there too (perfectly possible) or empty the PDF object definition.

Let me re-open this issue as an enhancement request.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024 1

My internet connection is terrible at the moment. So the ZIP upload was interrupted / incomplete. It is actually 2 statements only that you must add:

  • Before deleting the field do xref = field.xref to get the xref
  • After deleting the field do _fdoc.update_object(xref, "<<>>"). This will empty the field's object definition. It effectively will no longer be a field and everything reading it will not gain any information from it.

from pymupdf.

ag-gaphp avatar ag-gaphp commented on June 20, 2024 1

Yes! Absolutely perfect. When I first started hunting this issue down, this exact process is something I thought might need to be done, I just didn't do enough reading to connect all the dots, I guess.

At least in my use case, adding these two lines remove the widgets entirely and pyHanko/eSign platforms no longer see the erroneous fields.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024

This not a bug!
As always with iterations, it is error-prone to change the iterator underway.
The following modification works:

import pymupdf as fitz, os

OLD_FILE = "exported_from_libreoffice.pdf"
NEW_FILE = "deleted_signatures2.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

_fdoc = fitz.open(OLD_FILE)

# iterate the pages
for page in _fdoc:
    # iterate the fields on this page
    # BUT: use widget xrefs for iteration!!!
    xrefs = [w.xref for w in page.widgets()]
    for xref in xrefs:
        field = page.load_widget(xref)
        n = field.field_name
        # if it's a signature, remove it
        if n.startswith("sig") or n.startswith("init"):
            page.delete_widget(field)

# save the document updates
_fdoc.ez_save(NEW_FILE)
_fdoc.close()

# now re-opening the document to check if the fields I removed are still there or not according to PyMuPDF
_check_doc = fitz.open(NEW_FILE)

# iterate the pages again
for page in _check_doc:
    # iterate the fields on this page
    for field in page.widgets():
        n = field.field_name
        # if it's a signature, print to console
        if n.startswith("sig") or n.startswith("init"):
            print(f"...'{n}' is still present")

_check_doc.close()

Another safe way of iteration:

field = page.first_widget
while field:
    if n.startswith("sig") or n.startswith("init"):
        field = page.delete_widget(field)
    else:
        field = field.next

from pymupdf.

ag-gaphp avatar ag-gaphp commented on June 20, 2024

Unfortunately, both of the presented solutions give me the same end result as my original bug post. Even when I use the xrefs to iterate as you suggested in the modification, the fields that I mark for deletion are still present in some way in the file. The eSign platforms and pyHanko are still able to find them and try to utilize them as form fields, even when Acrobat doesn't display them.

Is there something extra I can do while saving the file to make sure that any lingering references are removed from the xref table?

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024

But the file check iteration shows that they are gone!
A signature often is associated with some image - do these images molest you then or what?

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024

I see that the result PDF has some yellow and red rectangle graphics where there were fields before. These have nothing to do with widgets / fields - they are just vector graphics.
If you don't want them either, you must remove them separately.

from pymupdf.

ag-gaphp avatar ag-gaphp commented on June 20, 2024

The graphic boxes are meant to stay, it's only the widgets/fields that I am trying to remove.

I understand that PyMuPDF says they are gone, I'm just confused on if they are gone, how are the other programs able to see the removed widgets and their names/dimensions? In pyHanko, I can still retrieve that info even if PyMuPDF says they are not there, and the eSign platforms do the same. I must be misunderstanding something, or there is something off about the way LibreOffice is setting up the widgets in the first place. I'm not sure how to troubleshoot that at the moment.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024

Attached my output
deleted_signatures2.pdf
On which page do you still see deleted widgets?

from pymupdf.

ag-gaphp avatar ag-gaphp commented on June 20, 2024

First, my apologies for not doing this initially. There might be a slight language barrier and I didn't give the most complete example that I could have given.

I just tried your PDF and I get the same results. When I upload to an eSign platform, the removed fields are detected during the import. When I attempt to add signatures in those same locations with pyHanko, it also complains that the fields already exist in the PDF.

Here is my complete testing script for this that prints out some data along the way. This is a watered down version of what my tool is doing. First it stores the coords of signature/initial fields, deletes them, then uses pyHanko to add proper signature fields.

requirements.txt

pymupdf
pyhanko

test.py

import fitz, os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field

OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="deleted_signatures.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

boxes = {}

doc = fitz.open(OLD_FILE)

# iterate the pages
print("Removing fields with PyMuPDF")
for page in doc:
    # store the page's height for placement
    _page_rect = page.bound()
    _page_height = _page_rect.y1

    # iterate the fields on this page
    field = page.first_widget
    while field:
        n = field.field_name

        # if it's a signature, remove it
        if n.startswith("sig") or n.startswith("init"):
            # PyMuPDF y coords go top-to-bottom, but pyHanko goes bottom-to-top
            # Subtract the y coords from the current page height for pyHanko
            boxes[n] = {
                "page": page.number,
                "box": (
                    field.rect.x0,
                    _page_height-field.rect.y0,
                    field.rect.x1,
                    _page_height-field.rect.y1
                )
            }
            print("Removing field: ", n)
            field = page.delete_widget(field)

        else:
            field = field.next

# save the document updates
doc.ez_save(NEW_FILE)
doc.close()

# now re-opening the document to check if the fields I removed are still there or not
check_doc = fitz.open(NEW_FILE)

# iterate the pages again
print("Checking PDF for removed fields with PyMuPDF")
found = 0
for page in check_doc:
    # iterate the fields on this page
    field = page.first_widget
    while field:
        n = field.field_name
        # if it's a signature, print to console
        if n.startswith("sig") or n.startswith("init"):
            found += 1
            print(f"...'{n}' is still present")
        field = field.next

if found == 0:
    print("PyMuPDF did not find any fields")

check_doc.close()

# now let's try to use pyHanko to add new signatures
# if we find that a field already exists, print the error
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
    writer = IncrementalPdfFileWriter(sig_doc, strict=False)
    for name in boxes.keys():
        _dict = boxes[name]
        try:
            append_signature_field(writer, SigFieldSpec(
                                        sig_field_name=name,
                                        on_page=_dict["page"],
                                        box=_dict["box"]
                                    ))

        except Exception as e:
            found += 1
            print("ERROR: ", e)

    writer.write_in_place()

if found > 0:
    print(f"pyHanko found {found} fields")

Even when I run only the last for loop on the PDF you uploaded, I get the same results where pyHanko can still see the fields.

Example of the full output that I see when I run this against my originally uploaded exported_from_libreoffice.pdf:

Removing fields with PyMuPDF
Removing field:  init0
Removing field:  init1
Removing field:  init2
Removing field:  init3
Removing field:  init4
Removing field:  init5
Removing field:  init6
Removing field:  init7
Removing field:  init8
Removing field:  init9
Removing field:  init10
Removing field:  init11
Removing field:  init12
Removing field:  init13
Removing field:  init14
Removing field:  init15
Removing field:  init16
Removing field:  init17
Removing field:  init18
Removing field:  init19
Removing field:  init20
Removing field:  init21
Removing field:  init22
Removing field:  init23
Removing field:  init24
Removing field:  init25
Removing field:  init26
Removing field:  init27
Removing field:  init28
Removing field:  sigPrimary1
Removing field:  init29
Removing field:  sigSecondary1
Checking PDF for removed fields with PyMuPDF
PyMuPDF did not find any fields
Adding signatures to new PDF with pyHanko
ERROR:  Field with name init0 exists but is not a signature field
ERROR:  Field with name init1 exists but is not a signature field
ERROR:  Field with name init2 exists but is not a signature field
ERROR:  Field with name init3 exists but is not a signature field
ERROR:  Field with name init4 exists but is not a signature field
ERROR:  Field with name init5 exists but is not a signature field
ERROR:  Field with name init6 exists but is not a signature field
ERROR:  Field with name init7 exists but is not a signature field
ERROR:  Field with name init8 exists but is not a signature field
ERROR:  Field with name init9 exists but is not a signature field
ERROR:  Field with name init10 exists but is not a signature field
ERROR:  Field with name init11 exists but is not a signature field
ERROR:  Field with name init12 exists but is not a signature field
ERROR:  Field with name init13 exists but is not a signature field
ERROR:  Field with name init14 exists but is not a signature field
ERROR:  Field with name init15 exists but is not a signature field
ERROR:  Field with name init16 exists but is not a signature field
ERROR:  Field with name init17 exists but is not a signature field
ERROR:  Field with name init18 exists but is not a signature field
ERROR:  Field with name init19 exists but is not a signature field
ERROR:  Field with name init20 exists but is not a signature field
ERROR:  Field with name init21 exists but is not a signature field
ERROR:  Field with name init22 exists but is not a signature field
ERROR:  Field with name init23 exists but is not a signature field
ERROR:  Field with name init24 exists but is not a signature field
ERROR:  Field with name init25 exists but is not a signature field
ERROR:  Field with name init26 exists but is not a signature field
ERROR:  Field with name init27 exists but is not a signature field
ERROR:  Field with name init28 exists but is not a signature field
ERROR:  Field with name sigPrimary1 exists but is not a signature field
ERROR:  Field with name init29 exists but is not a signature field
ERROR:  Field with name sigSecondary1 exists but is not a signature field
pyHanko found 32 fields

I mentioned this before, but it's possible that something in LibreOffice is part of my issue. I'm trying to do some more testing on that today when I can to see if I can figure anything out.

My bad if I'm annoying you, but I'm just baffled at how the other apps are able to still detect the fields if they are removed from the PDF. I will admit it could completely be a misunderstanding on my part about how things are done in PDFs, but from a basic logic standpoint it seems like some sort of reference to the deleted fields remain in the PDF.

from pymupdf.

ag-gaphp avatar ag-gaphp commented on June 20, 2024

I've tracked this down a little more specifically. Looks like the delete method deletes the widget from the annotations list of the PDF, but it remains in the fields list as an object. When pyHanko and the eSign platforms iterate, they are using the fields list and not the annotations list, so this is why it seems like the fields still remain in the PDF, but they don't appear in regular viewers.

Would it not make sense to remove all references to a widget from the PDF entirely if it's meant to be deleted? Or is that operation not possible in PDF for some reason?

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024

Here is an upfront solution - I hope.
The trick is to empty the object definition of the field.
All the viewers I tried do no longer see a field treated like this.
Uploading test2.zip…

from pymupdf.

ag-gaphp avatar ag-gaphp commented on June 20, 2024

There might have been a github malfunction when you posted the comment, looks like it is linking back to this issue instead of a zip file. I'll definitely check it out when I can download it.

Is it a dev version of the module, or code examples on how to empty an object definition using existing methods?

And thanks for bearing with me on this. I know my terminology isn't completely accurate, I'm still learning PDF/python and clearly have a ways to go.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on June 20, 2024

Thanks for the feedback!
I will add these instructions to the standard behavior ...

from pymupdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.