Giter VIP home page Giter VIP logo

Comments (19)

FernandoJCabral avatar FernandoJCabral commented on June 15, 2024 2

Here is the Python macro that converts an ODT file to PDF and signs it using pyhanko. I've added a lot of comments in the hope that it may make it easier for you to understand what each step.

This file is a sample of the pyhanko.yml. It must be renamed to pyhanko.html and updated accordingly.

[pyhanko.txt](https://github.com/MatthiasValvekens/pyHanko/files/7262854/pyhanko.txt

This is the macro file. Adding the txt extension was added because github does not accept a file with
the extension .py.

signFile.py.txt

from pyhanko.

ofcaah avatar ofcaah commented on June 15, 2024 1

There are some open requests about this in the wild, but there was no way to export empty signature fields from LibreOffice as of two months ago. One of such requests is here: https://bugs.documentfoundation.org/show_bug.cgi?id=126207 - my message from last year being the last.

Actually that's quite a rare feature in the whole PDF ecosystem to my knowledge, especially when taking only open solutions into account. I'm using this to generate contracts that are pre-stamped with certification feature to prevent unwelcome alterations and with predefined empty signature fields. It works as a PoC and only problem I found so far is adding the new fields in predictable and visually-appealing fashion. Currently I'm adding a plain rectangle in LO and then fill it with sigfield using pyHanko. I'm in full control of the template documents, so I can modify it in a most sensible and approachable fashion, once you find the time to suggest such.

from pyhanko.

tuelle avatar tuelle commented on June 15, 2024 1

Hi,

I've also been working on a method to replace placeholders in PDFs generated from a docx file. The following code works fine for me, but my use-case has also has documents that are more or less standardized. I am using pdfplumber to search for table cells or rectangles that contain a certain string, e.g. "signature field" in the PDF.

Perhaps it gives you some inspirations. I haven't checked it with PDFs generated with other programs than MS Word yet.

You can find the code attached.

Best regards
Thorsten

extract.zip
example.pdf

from pyhanko.

MatthiasValvekens avatar MatthiasValvekens commented on June 15, 2024

Hi there!

You're right that targeting a region on the page by trial and error is not really a good way to go about this. That being said, going about it any other way is generally pretty hard. As you may or may not know, a content stream in PDF is just a concatenation of a bunch of graphical operators that draw stuff, so even something as simple as "find the location of all rectangles on the page" is a nontrivial task.

Since PDF 1.4, it's possible to tag content & structure in a PDF file. Essentially, this works by inserting content markers into the content streams in the file, and then arranging these markers into a DOM-like tree. If the document you're operating on is decently tagged, you would then (for example) be able to do things such as finding a particular table cell, grab its bounding box, and put in a signature field that fills the table cell. I know that there's an option in the LibreOffice GUI to output tagged PDF files, so probably the same is true for headless LibreOffice as well.

Right now, only some parts of the validation code in pyHanko are tag-aware at all (and only minimally so). I have plans to expand that functionality in the future, but I'm not sure when that is going to happen. I'll definitely keep your use case in mind, though, and try to think of a sensible API to "convert" PDF structure elements to signature fields that doesn't make too many assumptions about the input.

In the meantime: I thought LibreOffice also had some basic signing functionality. Couldn't you use that to set up the fields? I haven't really used it much myself, though, so I wouldn't be able to tell you how it works.

from pyhanko.

MatthiasValvekens avatar MatthiasValvekens commented on June 15, 2024

OK, I'll keep you posted! With some luck, I might be able to get this into 0.6.0 in some form, but I can't promise anything quite yet :)

from pyhanko.

MatthiasValvekens avatar MatthiasValvekens commented on June 15, 2024

I did some exploratory testing using LibreOffice, and it seems that this is going to be even harder than I thought.... I tried adding a text box (to serve as a signature placeholder) to a one-page, fairly well-structured ODF file. The text box was anchored to a paragraph in a table. I also gave the text box with a visible red border, just to make it easier to find when scrolling through the graphics operators in the content stream. I then exported the file to tagged PDF.

This is what happened structurally:

  • LibreOffice ignored the fact that the text box "belonged" to the table. In the structure tree, it was represented as a floating paragraph object dangling from the top-level document object.
  • None of the structure elements have bounding boxes. Given the definition of the BBox attribute in the PDF spec, I expected this to be the case for "reflowable" structure elements such as tables and paragraphs in the documents, but I was hoping that the text box would have one...
  • The text box is not tagged in a way that makes it easy to identify as a text box, nor does the structure element dictionary have an ActualText key that we could leverage instead.

Graphically, it's even more of a mess. This is the PDF graphics code generated by LibreOffice to render the text box:

1 0 0 RG
q
1.4 w
0 J
1 j
325.55 468.139 m
214.4 468.139 l
214.4 504.339 l
436.65 504.339 l
436.65 468.139 l
325.55 468.139 l
h
S
Q
/P <</MCID 13 >> BDC
q
0 0 0 rg
BT
215.1 493.589 Td
/F3 12 Tf
<0102030405060708090a0b0c0d060e0a0f100d110a0912> Tj
ET
Q
EMC

As you can see, the marked content sequence (MCID 13) that identifies the text box doesn't even cover the border of the text box. Hence, even if you have the right structure element, figuring out where it lives on the page is hard. The graphics operators drawing the border happen to be positioned close by in this case, but that's obviously not something we can rely on in general. Also, instead of using the rectangle operator, LibreOffice draws the text box's border as a piecewise linear path (in fact, with dotted line styles, it gets even more ridiculous).

All this doesn't even begin to touch the issue of compatibility between rendering applications... So yeah, this is going to be a tricky one to get right ;)

from pyhanko.

ofcaah avatar ofcaah commented on June 15, 2024

Yup, that pretty much sums my research with pdf2py that I've done some time ago. From my point of view I'd accept pretty much conversion of any object that doesn't collide with "normal" contact's text, as in buttons or other objects. Help! :)

from pyhanko.

MatthiasValvekens avatar MatthiasValvekens commented on June 15, 2024

Here's an idle thought (haven't had coffee yet, so take this with a grain of salt). Any solution that relies on the actual content stream of the page to identify a region to be signed will be tricky to implement, since it requires a parser that understands the various geometric operators in PDF with a fairly high degree of generality.

Part of the problem here is that form fields are fundamentally different from page content in a way: they are rendered as widget annotations that live "outside" the page's main content, both in terms of file structure and how they're rendered graphically. So one possible way to get around the problem might be to rely on placeholder annotations instead. These require much less effort to manipulate into a form field widget.
Of course, it assumes that whatever you're using to produce your PDFs has the ability to output annotations, but if I recall correctly, that's the case for LibreOffice, right?

Anyway, you could then convert LibreOffice output (potentially with form fields already embedded) into a "signable" form by (for example) replacing all text annotations with a particular content string with form fields.

I'll toy around with that idea a bit when I find the time...

from pyhanko.

ofcaah avatar ofcaah commented on June 15, 2024

Thanks Matthias and Thorsten, I'll take another deep look at this tomorrow and figure something out from your suggestions. In the meantime I've moved signature boxes to a static location relative to bottom of last page, which seems like a workable workaround for the time being.

from pyhanko.

MatthiasValvekens avatar MatthiasValvekens commented on June 15, 2024

It doesn't really solve the exact issue that you're having, but if you're willing to migrate your templating work to TeX, there are LaTeX packages that are capable of producing forms (including signature fields) out there: https://tex.stackexchange.com/questions/51090/how-do-i-create-a-pdf-file-that-can-be-digitally-signed.

(I still haven't found a good way to solve the general problem, though)

from pyhanko.

FernandoJCabral avatar FernandoJCabral commented on June 15, 2024

It doesn't really solve the exact issue that you're having, but if you're willing to migrate your templating work to TeX, there are LaTeX packages that are capable of producing forms (including signature fields) out there: https://tex.stackexchange.com/questions/51090/how-do-i-create-a-pdf-file-that-can-be-digitally-signed.

(I still haven't found a good way to solve the general problem, though)
Well, I have been working in a similar problem. I want to position the visible signature somewhere in the last page of a A4 document. Positioning in the last page is not a problem. Tick. Positioning horizontally is not a problem. Tick. The problem is finding the vertical displacement. It varies from document to document. A page may have a single, short paragraph, or various short and long paragraphs. So, the visible signature should appear perhaps 2 cm or so after the last line of the last paragraph. Finding this position automagically has not been easy.

What I have tried to do is to get the vertical displacement using a macro in Basic (but called by a function in python) that returns the visual cursor position (x, y).

Now, before calling pyhanko, I convert LO cursor position X, Y into "--field -1/x,y,x+a,y+a/Sig". Not particularly elegant, but works.
I am still working to perfect this solution and make it fool proof.

I would guess you could apply a similar trick but using your template as reference. Perhaps counting the number of lines in the page where you want the signature to be placed.

from pyhanko.

MatthiasValvekens avatar MatthiasValvekens commented on June 15, 2024

Thanks; interesting approach! I can imagine that idea working quite well if you have access to the LO template. That said, I don't think it works out of the box if all you have is a PDF file. The reason being that there is no (universal) concept of paragraphs / lines of text in raw PDF graphics, so you'd have to implement a line detector first. That still requires parsing PDF graphics operators. Perhaps there's a better way if the input document has particularly good tagging. but that's a very unreliable assumption.

Anyway, once you're at the point where you have to parse content streams in a PDF file, I actually think that finding rectangular shapes with a particular colour in the page content is easier than trying to count lines of text.

This is a tricky one for sure....

from pyhanko.

FernandoJCabral avatar FernandoJCabral commented on June 15, 2024

To the point: I'm currently using a headless LibreOffice to convert a template to final PDF file with some replacements along the way. Problem is, I have to manually "target" coordinates for signature fields with trial and error. Do you have an idea how would I go about automating the process in pyHanko? Number of required signatures is dynamic, so I'm thinking some kind of search & replace type solution here.

Anyway, once you're at the point where you have to parse content streams in a PDF file, I actually think that finding rectangular shapes with a particular colour in the page content is easier than trying to count lines of text.

Right. My suggestion is to resort to the LO template (or to the ODT itself) and leave PDF alone. It is too messy to work with for this purpose. Since ODT e in fact a XML file, reading it could work too (I have not tried this approach because I sign the document immediately after finishing it, so I have it open in front of me. I see where I need the signature to be. I put the cursor there and get the coordinates. But reading the ODT file or using the template as reference should work better).

from pyhanko.

ofcaah avatar ofcaah commented on June 15, 2024

My suggestion is to resort to the LO template (or to the ODT itself) and leave PDF alone.

Unfortunately this won't help me much, as end result has to be a PDF file, and working out coordinates for pyHanko from ODT automatically is definitely beyond my expertise. That being said, I could remove signature bounding boxes and signing person's name below the box from template itself, and leave adding them to pyHanko. Is this something that could be easily accomplished from command line?

i.e.
pyhanko sign addfields --field 4/41,141,293,178/FieldName=DigSig1/FieldCaption="John Smith"/FieldStyle=box
with alternative FieldStyle being for example 'line'.

This would only require making sure, that there's enough room for signatures at the bottom of the desired page and it could possibly be useful in many more cases.

from pyhanko.

FernandoJCabral avatar FernandoJCabral commented on June 15, 2024

Right. My suggestion is to resort to the LO template (or to the ODT itself) and leave PDF alone. It is too messy to work with for this purpose. Since ODT e in fact a XML file, reading it could work too (I have not tried this approach because I sign the document immediately after finishing it, so I have it open in front of me. I see where I need the signature to be. I put the cursor there and get the coordinates. But reading the ODT file or using the template as reference should work better).

Well, I have finally put together a working solution for my own problem that is similar to yours. Similar, but not equal. What I do is to put the cursor where I want the signature to appear (if visible) and call my "signPdf" macro. It grabs the cursor position on the document and calculates the rectangle coordinates where to place the signature. If not visible, it just calls pyhanko with the field name, without any coordinates. The macro starts pyhanko with all command line options.

I' ve tested it in many different documents and many different ways. So far, it has worked to my satisfaction.

But, it seems you are running LO in the headless mode, so, figuring out where to place the signature has to resort to a different method. This might be a mark in the document. This way your program could open the document in the background, search for the mark, save the position, remove the mark, generate the PDF, call pyhanko. pyhanko will sign the PDF and place the signed copy where you want it. You will end up with three documents: ODT/PDF/signedPDF (unless you also remove some of them after signing).

The difference between this solution and my solution is that, in my case, signature is placed where the cursor is; in your case, it would be placed where a certain mark is.

My code is in Python. If you want to check it up, I can provide you with the source code. Comments, variable and function names are in Portuguese, but the code is simple enough to be easily understood even without understanding the docstrings and variable names.

EDIT: Perhaps I am wrong about you using the headless mode. If you have the document before your eyes, then you can use the same macro I am using.

from pyhanko.

ofcaah avatar ofcaah commented on June 15, 2024

EDIT: Perhaps I am wrong about you using the headless mode. If you have the document before your eyes, then you can use the same macro I am using.

No, you are not wrong; it's full auto headless. I'm already using python in the pipeline, so extending it a bit shouldn't be a problem. If your script/macro isn't overly complicated, then please do share it.

from pyhanko.

FernandoJCabral avatar FernandoJCabral commented on June 15, 2024

EDIT: Perhaps I am wrong about you using the headless mode. If you have the document before your eyes, then you can use the same macro I am using.

No, you are not wrong; it's full auto headless. I'm already using python in the pipeline, so extending it a bit shouldn't be a problem. If your script/macro isn't overly complicated, then please do share it.

I will share it with you. I'll do it a little bit later because I am busy now and also because I think I should translate into English at least the docstrings and the most important variable names. This will make it easier for you to understand the code. But, rest assured it is quite simple. It has some magic numbers, but on the macro itself I'll explain how I found them and why they are there.

from pyhanko.

ofcaah avatar ofcaah commented on June 15, 2024

Thank you, at first glance it should integrate nicely with what I'm already doing for a headless conversion from template to pdf - I'll just need to figure out finding coordinates of rectangles instead of non-existent mouse cursor, but it shouldn't be that hard.

from pyhanko.

FernandoJCabral avatar FernandoJCabral commented on June 15, 2024

Thank you, at first glance it should integrate nicely with what I'm already doing for a headless conversion from template to pdf - I'll just need to figure out finding coordinates of rectangles instead of non-existent mouse cursor, but it shouldn't be that hard.

I'd guess if you are working with the XML file it may be harder. On the other hand, if you are working with the ODT file open in the background it should be easy to move to the last page, find the last line and place the signature a certain distance bellow it. If it is in a different page (not the last one) you could place a well-chosen string where you want the signature to be (say: "#PutSignatureHere#"). Then your macro could search for this string and replace it with the visible signature. In this case probably it will be much easier to use the line number to find the vertical position (line number times character height + spacing + top margin + etc.).

from pyhanko.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.