jorisschellekens / borb Goto Github PK

View Code? Open in Web Editor NEW

3.3K 34.0 146.0 738.89 MB

borb is a library for reading, creating and manipulating PDF files in python.

Home Page: https://borbpdf.com/

License: Other

Python 99.37% HTML 0.58% CSS 0.06%

pdf pdf-generation pdf-converter pdf-conversion pdf-library python python3 typesetting library sdk

borb's People

Contributors

Stargazers

Watchers

Forkers

abdullahmohammadkhan shinroo pandruszkow-foss-sourcemine weiplanet icodein lzg440 drjeym gbtami pierre3l chlloyd virtualritz initial1ze bryant1410 simrit1 miaviles deonizm thaidoan868 constructionware jpolcek kp-forks stjordanis prithvi1998 c00renut lepy evdcush diemesleno sandy4321 harichalla93 fuerederp terragord7 michoumichmich aucan medecau nangying1112 birol-yildiz eng-rsmy csengupta1101 trendingtechnology chorseng xen0byte omar16100 patmosxx-v2 ssahgal mrcodechef kokizzu doziestar cafonso tsoliangwu0130 zhangby2085 fermatpy jeremi-nh strrchr suryatmodulus prashant118 chrischou0321 giserh frankyu0326 manodeep vbsoftpl mhadiahmed develop-python shalevy1 surajitdb kurumeti mohammadtetouan edd34 danieldjewell yanyipu santiagochou justabunchofcode useink strogo torwag dyfgszg diek elchappo jjbiggins apetcho amks1 edgsousa engeir yzqfjxm kaerez xd34throw obdura frankg1 cuong-max eugeh2020 russell310 neokwongming kartikshastrakar toddb8632 kbrown01 etrimby asweigart marcol3786 cloudnepal githobbes dorinadrian elanning

borb's Issues

Increase the usage of augmented assignment statements

👀 Some source code analysis tools can help to find opportunities for improving software components.
💭 I propose to increase the usage of augmented assignment statements accordingly.

Would you like to integrate anything from a transformation result which can be generated by a command like the following?
(:point_right: Please check also for questionable change suggestions because of an evolving search pattern.)

[Markus_Elfring@fedora lokal]$ perl -p -i.orig -0777 -e 's/^(?<indentation>\s+)(?<target>\S+)\s*=\s*\k<target>[ \t]*(?<operator>[+\-%&|^@]|\*\*?|\/\/?|<<|>>)/$+{indentation}$+{target} $+{operator}=/gm' $(find ~/Projekte/borb/lokal -name '*.py')

Suport for nested tables ?

Hello, I am trying to create a PDF file where I need some nested tables.

As I read through the documentation a table is a container for other layout elements. My question is, can I use a table inside a table cell ?

The following code:

def _build_header():

    company_header = Table(number_of_rows=5, number_of_columns=1)
    company_header.add(Paragraph("Tecla Exim"))

    return company_header


def _build_invoice_information():
    table_001 = Table(number_of_rows=1, number_of_columns=3)

    table_001.add(TableCell(_build_header()))
    table_001.add(Paragraph("Date", font="Helvetica-Bold", horizontal_alignment=Alignment.RIGHT))


    return table_001

I get an error, TableCell should not contain Table LayoutElement(s).

Is there any workaround for this ? BTW I am trying to replicate the following (ugly) invoice.

Justified alignment not working?

I just had a look at an example test output, the alignment of the paragraph didn't appear to be justified:

Thought I'd mention in in case you weren't aware.

Extra dir installing

/usr/lib/python*/site-packages/tests/ - redundant on normal system-wide installation.

Support for text/image links?

Is there any way to add text / image links into a PDF?

Ideally I'd like to be able to do something like the following:

table.add(Paragraph("Click this to go to Github", link="https://github.com"))
OR
table.add(Image("http://image.com", link="https://github.com"))

Please let me know if there is some method of performing this functionality that I am missing.

Can it export to Markdown ?

I was wondering how is the extracted content exported. Can it export to Markdown ?

Long table - how to write on multiple pages

Hi,

I'm facing some difficulties to create a pdf with a long table.
I get an assertion error assert height >= 0

I didn't succeed to find the information.
What is the best way to manage multiple pages?

Thanks for the help.

space issue

Hi Joris

I just wanted to give borb a try. I installed v2.0.12.

When I read a pdf i experience, that some space characters are ignored. This makes parsing / regex / langdetection a little bit tricky. Any idea how to overcome this? Pdf i try to parse is machine generated.

KR & thank you for your work!

No support for Chinese

I found borb doesn't support Chinese character when adding Chinese text to PDF page, any idea to solve this?

AttributeError: 'SimpleImageExtraction' object has no attribute 'get_images_per_page'

Greetings,

I'm trying to test Borb, but I am getting an error message that 'SimpleImageExtraction' object has no attribute 'get_images_per_page'.

The code is pretty straight forward:

# Import borb
import borb

from borb.toolkit.image.simple_image_extraction import SimpleImageExtraction
from borb.pdf.pdf import PDF

from pathlib import Path

# File-wide Variables
source_file_path = Path('python-basics-sample-chapters.pdf')

# Read 'source_file_path' and its info (see 'doc_info')
with open(source_file_path, "rb") as pdf_file_handle:
    l = SimpleImageExtraction()
    doc = PDF.loads(pdf_file_handle, [l])
    doc_info = doc.get_document_info()
    number_of_pages = doc_info.get_number_of_pages()

# Destination folder path
img_path = f'{source_file_path.parent}/{source_file_path.stem}'

# Create a path for 'img_path' if it doesn't already exist
Path(img_path).mkdir(parents=True, exist_ok=True)


print(f'number of pages: {int(number_of_pages)}')

# Serial
serial_results = {}
for page_number in range(int(number_of_pages)):
    serial_results[page_number] = l.get_images_per_page(page_number)
print(serial_results)

I usually prototype in Colab, and I usually work with the latest release. This issue appeared after version 2.0.7.

Link to Colab with Borb 2.0.6: Borb works, but it misses a couple of images in pages
Link to Colab with Borb 2.0.6: Borb throws the error AttributeError: 'SimpleImageExtraction' object has no attribute 'get_images_per_page', and now pip also shows compatibility conflicts between Colab and Borb.

[Suggestion] Use the package isort in order to sort import

I suggest to use the package isort in order to add consistency in the import in all file.
Here are some steps in order to get started :

Install isort : pip install isort
See isort errors without applying fix : isort --check --diff .
Run isort, applying fixes : isort .

Add requirements to setup.py

It would be nice if the contents of requirements.txt were added to setup.py so that pip would pick them up automatically when doing a pip install. A pip install without requirments is only half complete :D

I offten use something like this to avoid having requirments duplicated in two places: https://stackoverflow.com/a/14399775

Otherwise, seems like a fun project so far!

How to create a footer

I can't seem to find it in the documentation.
But I would like to have a footer on every PDF Page.

How can I add this?

Consider simplifying import path structure

I'm really excited by the library, the API looks lovely! 😄 Today I'm giving it a test drive.

One thing I noticed was that to perform basic actions a lot of deep paths are being called. From the basic example:

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF

Python allows for easy control of path structures. WIth that in mind, I propose the following API:

from borb.pdf import SingleColumnLayout
from borb.pdf import Paragraph
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PDF

Sources too thick

60MB of sources - it is vary, vary big tarball.
Including 20MB of well-known specifications and 15MB of nice pictures into sources is not good idea too.

Replace convert JPEG2000 images in pdf to JPEG?

Hi
I just discovered your library for interacting with PDFs and it looks very powerful. But I have a use case that I could not figure out from the examples. A PDF I have has JPEG2000 images in it, and so I can convert the PDF file to web pages using pdf2htmlEX, I need the images to be plain JPEG since pdf2htmlEX does not work with JPEG2000 images. So is there a way to extract the JPEG2000 image from a PDF file, convert it to JPEG, and replace the converted image back in the PDF file?
Thanks!

Post completed forms to a URL

Ability to have a Submit button to POST the completed form over HTTPS to a URL.

I realize this probably opens up an entire set of Javascript functionality that is not yet being implemented, as well as requests for other buttons such as form Reset.

Hoping this can get on an eventual project roadmap.

In which units are the values of `get_bounding_boxes` function

The example 1.3 of getting the coordinates of a regular expression, what are the units of those coodinates?

Ex:

import json
import re

from ptext.pdf.pdf import PDF
from ptext.toolkit.text.regular_expression_text_extraction import (
    RegularExpressionTextExtraction,
)


# regex = r"\${sign-[^}]+}"
regex = r"signhere"
doc = None
l = RegularExpressionTextExtraction(regex)
with open("./MSA-contract-template.pdf", "rb") as in_file_handle:
    doc = PDF.loads(in_file_handle, [l])

    # export matches
    with open("sign_matches.json", "w") as json_file_handle:
        obj = []
        for m in l.get_all_matches(0):
            for bb in m.get_bounding_boxes():
                obj.append(
                    {
                        "text": m.string,
                        "x": int(bb.x),
                        "y": int(bb.y),
                        "width": int(bb.width),
                        "height": int(bb.height),
                    }
                )
        json_file_handle.write(json.dumps(obj, indent=4))

Which are the units of bb.x and bb.y, pixels? points? feets?

font spacing config

The new font as provided using this example got some font spacing issue. Is there a way to fix that?

Here is an example:

#%%
from pathlib import Path

from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
from borb.pdf.canvas.font.font import Font


def main():
    doc: Document = Document()

    page: Page = Page()
    doc.append_page(page)

    layout: PageLayout = SingleColumnLayout(page)

    # construct the Font object
    font_path: Path = Path(__file__).parent / "Monaco Regular.ttf"
    font: Font = TrueTypeFont.true_type_font_from_file(font_path)

    layout.add(Paragraph("Hello World!", font=font))
    with open("output.pdf", "wb") as out_file_handle:
        PDF.dumps(out_file_handle, doc)

Thanks

[Feature Request] Add margins to layout intilizations parameters

Because previous_y is calculated on __init__ and updates to vertical_margins do not seem to recalculate when its changed, this can lead to unexpected behavior. Would be nice to be able to set the margins and column_width as part of the call to SingleColumnLayout.

For example, if I want to have a full page image, this is the current solution I came up with:

width, height = Decimal(2550), Decimal(3509)

# create an empty Document
pdf = Document()

# add an empty Page
page = Page(width=width, height=height)
pdf.append_page(page)

# create full page layout
layout = SingleColumnLayout(page)
layout.previous_y += layout.vertical_margin
layout.horizontal_margin = 0
layout.vertical_margin = 0
layout.inter_column_margin = 0
layout.column_width = width

layout.add(pImage(Image.open("image.tif"), width=width, height=height))

Could just be approaching this use case the entirely wrong way, but does seem hard to manipulate margins in any way currently.

Thanks!

Why is `borb` so (incredibly) slow?

Hi!

I am working on a PDF text mining project, for which I decided to benchmark & compare various Python PDF libraries for reading PDF files. For a random sample of 10 PDF files (270 pages, 17.5 MiB in total), I get the following results:

Summary statistics for the sample of 10 PDFs
File 1/10        30 pages       2.53 MiB
File 2/10        36 pages       2.55 MiB
File 3/10        19 pages       0.85 MiB
File 4/10        30 pages       1.89 MiB
File 5/10        20 pages       1.15 MiB
File 6/10        29 pages       1.89 MiB
File 7/10        32 pages       2.14 MiB
File 8/10        19 pages       0.85 MiB
File 9/10        19 pages       0.95 MiB
File 10/10       36 pages       2.75 MiB
Total size of all 10 PDF-files: 17.54 MiB

---------- Benchmarking pdfrw ----------
Reading PDF-file 1/10 took 0.006 seconds
Reading PDF-file 2/10 took 0.006 seconds
Reading PDF-file 3/10 took 0.004 seconds
Reading PDF-file 4/10 took 0.005 seconds
Reading PDF-file 5/10 took 0.004 seconds
Reading PDF-file 6/10 took 0.005 seconds
Reading PDF-file 7/10 took 0.006 seconds
Reading PDF-file 8/10 took 0.004 seconds
Reading PDF-file 9/10 took 0.003 seconds
Reading PDF-file 10/10 took 0.006 seconds

Reading all 10 PDF-files w/ `pdfrw` took 0.051 seconds

---------- Benchmarking PyPDF2 ----------
Reading PDF-file 1/10 took 0.005 seconds
Reading PDF-file 2/10 took 0.007 seconds
Reading PDF-file 3/10 took 0.003 seconds
Reading PDF-file 4/10 took 0.005 seconds
Reading PDF-file 5/10 took 0.004 seconds
Reading PDF-file 6/10 took 0.005 seconds
Reading PDF-file 7/10 took 0.006 seconds
Reading PDF-file 8/10 took 0.004 seconds
Reading PDF-file 9/10 took 0.003 seconds
Reading PDF-file 10/10 took 0.007 seconds

Reading all 10 PDF-files w/ `PyPDF2` took 0.050 seconds

--------- Benchmarking PyMuPDF ---------
Reading PDF-file 1/10 took 0.002 seconds
Reading PDF-file 2/10 took 0.001 seconds
Reading PDF-file 3/10 took 0.000 seconds
Reading PDF-file 4/10 took 0.002 seconds
Reading PDF-file 5/10 took 0.001 seconds
Reading PDF-file 6/10 took 0.001 seconds
Reading PDF-file 7/10 took 0.001 seconds
Reading PDF-file 8/10 took 0.001 seconds
Reading PDF-file 9/10 took 0.001 seconds
Reading PDF-file 10/10 took 0.001 seconds

Reading all 10 PDF-files w/ `PyMuPDF` took 0.011 seconds

------- Benchmarking pdfminer.six -------
Reading PDF-file 1/10 took 0.139 seconds
Reading PDF-file 2/10 took 0.151 seconds
Reading PDF-file 3/10 took 0.070 seconds
Reading PDF-file 4/10 took 0.127 seconds
Reading PDF-file 5/10 took 0.081 seconds
Reading PDF-file 6/10 took 0.123 seconds
Reading PDF-file 7/10 took 0.137 seconds
Reading PDF-file 8/10 took 0.069 seconds
Reading PDF-file 9/10 took 0.070 seconds
Reading PDF-file 10/10 took 0.152 seconds

Reading all 10 PDF-files w/ `pdfminer.six` took 1.118 seconds

-------- Benchmarking pdfplumber --------
Reading PDF-file 1/10 took 0.152 seconds
Reading PDF-file 2/10 took 0.169 seconds
Reading PDF-file 3/10 took 0.078 seconds
Reading PDF-file 4/10 took 0.144 seconds
Reading PDF-file 5/10 took 0.088 seconds
Reading PDF-file 6/10 took 0.138 seconds
Reading PDF-file 7/10 took 0.148 seconds
Reading PDF-file 8/10 took 0.078 seconds
Reading PDF-file 9/10 took 0.081 seconds
Reading PDF-file 10/10 took 0.170 seconds

Reading all 10 PDF-files w/ `pdfplumber` took 1.247 seconds

--------- Benchmarking pikepdf ---------
Reading PDF-file 1/10 took 0.022 seconds
Reading PDF-file 2/10 took 0.025 seconds
Reading PDF-file 3/10 took 0.014 seconds
Reading PDF-file 4/10 took 0.023 seconds
Reading PDF-file 5/10 took 0.026 seconds
Reading PDF-file 6/10 took 0.021 seconds
Reading PDF-file 7/10 took 0.023 seconds
Reading PDF-file 8/10 took 0.015 seconds
Reading PDF-file 9/10 took 0.013 seconds
Reading PDF-file 10/10 took 0.025 seconds

Reading all 10 PDF-files w/ `pikepdf` took 0.207 seconds

----------- Benchmarking tika -----------
Reading PDF-file 1/10 took 1.263 seconds
Reading PDF-file 2/10 took 1.467 seconds
Reading PDF-file 3/10 took 1.286 seconds
Reading PDF-file 4/10 took 1.242 seconds
Reading PDF-file 5/10 took 1.887 seconds
Reading PDF-file 6/10 took 1.117 seconds
Reading PDF-file 7/10 took 1.274 seconds
Reading PDF-file 8/10 took 1.289 seconds
Reading PDF-file 9/10 took 1.418 seconds
Reading PDF-file 10/10 took 1.402 seconds

Reading all 10 PDF-files w/ `tika` took 13.645 seconds

----------- Benchmarking borb -----------
Reading PDF-file 1/10 took 273.480 seconds
Reading PDF-file 2/10 took 322.414 seconds
Reading PDF-file 3/10 took 298.064 seconds
Reading PDF-file 4/10 took 275.535 seconds
Reading PDF-file 5/10 took 411.225 seconds
Reading PDF-file 6/10 took 246.551 seconds
Reading PDF-file 7/10 took 269.851 seconds
Reading PDF-file 8/10 took 292.939 seconds
Reading PDF-file 9/10 took 318.867 seconds
Reading PDF-file 10/10 took 318.921 seconds

Reading all 10 PDF-files w/ `borb` took 3027.847 seconds

Even compared to tika, which makes calls to a RESTful API, borb is 200+ times slower. Compared to the fastest "Pure Python" library in this little benchmarking test (PyPDF2), borb is 600k+ times slower.

I really like borb's API: I find it to be very intuitive and Pythonic. As such, I would love to use it in this project and similar. So I guess my question is: what gives?

How to set the subscript of the inserted image

Hi,Thank you for answering my question.one question is How to set the subscript of the inserted image? the other is How to put two images side by side?

Add installation instructions to documentation

To improve the documentation, consider adding installation instructions. Found this package in PyPI, but it does not seem to be stated anywhere on the repo front page, or in the Hello World -example.

Can't import 'image'

Hi,

The line:
from borb.pdf.canvas.layout.image import Image

gives an error:
ImportError: cannot import name 'Image' from 'borb.pdf.canvas.layout.image' (/home/user/.local/lib/python3.7/site-packages/borb/pdf/canvas/layout/image/__init__.py)

Am I doing something wrong?

Can `borb` create bookmarks?

Sorry, i don't see any similar examples.

Like the picture above.

Not all required packages installed on Windows from pypi

Installed this from pypi and seems some packages are missing by default. Needed to install windows-curses and fonttools to get it going.

ModuleNotFoundError: No module named '_curses'
...
ModuleNotFoundError: No module named 'fontTools'

error importing library

solved Please remove this thread. :-)

Ambiguous License

README says borb is dual licensed
https://github.com/jorisschellekens/borb/blob/master/README.md#2-license

borb is dual licensed as AGPL/Commercial software.

But LICENSE says it's GPL.
https://github.com/jorisschellekens/borb/blob/master/LICENSE

The GNU General Public License is a free, copyleft license for
software and other kinds of works.

PDF.dumps saves PDF with pages out of order

Using a 20-page PDF, loading and saving the file results in pages out of order:

from ptext.pdf.pdf import PDF

doc=None
with open(pdf_file, "rb") as in_file_handle:
    doc = PDF.loads(in_file_handle)

with open("output.pdf", "wb") as out_file_handle:
        PDF.dumps(out_file_handle, doc)

Is there any way to control this?

what coding style do you use ?

Hi,
I would like to know which coding style do you stick with in this project ? And which linting tool do you use to autoformat your code ?

Black
Autopep8
flake8

It could be interesting to implement and automate the autoformat feature using a tool like tox. Basically tox is a tool that helps you CI/CD tasks, you could automate the formating during the CI/CD stage.

Adding Grayscale images creates broken PDFs

from PIL import Image
from decimal import Decimal

from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout
from ptext.pdf.document import Document
from ptext.pdf.page.page import Page
from ptext.pdf.pdf import PDF
from ptext.pdf.canvas.layout.image.image import Image as pImage

img = Image.open("test.jpg")
width, height = Decimal(img.width), Decimal(img.height)

pdf = Document()

page = Page(width=width, height=height)
pdf.append_page(page)

layout = SingleColumnLayout(page)

# Conversion to grayscale here as an example
# can also use 8-bit grayscale TIFF file or similar
layout.add(pImage(img.convert("L"), width=width, height=height))

# store the PDF
with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, pdf)

Sometimes it will work properly without an error but opening the PDF gives the error above.

Other times it will hit a recursion error.

Traceback (most recent call last):
  File "C:/Users/Chris/PycharmProjects/scanfix/create_pdf_bad.py", line 22, in <module>
    layout.add(pImage(img.convert("L"), width=width, height=height))
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\page_layout.py", line 144, in add
    return self.add(layout_element)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\page_layout.py", line 144, in add
    return self.add(layout_element)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\page_layout.py", line 144, in add
    return self.add(layout_element)
  [Previous line repeated 984 more times]
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\page_layout.py", line 135, in add
    layout_rect = layout_element.layout(self.page, bounding_box=next_available_rect)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\layout_element.py", line 219, in layout
    return self.calculate_layout_box_and_do_layout(page, bounding_box)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\layout_element.py", line 271, in calculate_layout_box_and_do_layout
    final_layout_box = self._do_layout(page, bounding_box)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\layout_element.py", line 197, in _do_layout
    output_box = self._do_layout_without_padding(page, modified_bounding_box)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\image\image.py", line 106, in _do_layout_without_padding
    image_resource_name = self._get_image_resource_name(self.image, page)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\pdf\canvas\layout\image\image.py", line 52, in _get_image_resource_name
    page[Name("Resources")] = Dictionary().set_parent(page)  # type: ignore [attr-defined]
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\io\read\types.py", line 330, in __init__
    add_base_methods(self)
  File "C:\Users\Chris\PycharmProjects\scanfix\venv\lib\site-packages\ptext\io\read\types.py", line 138, in add_base_methods
    def get_event_listeners(self) -> typing.List["EventListener"]:
  File "C:\Program Files\Python38\lib\typing.py", line 258, in inner
    return cached(*args, **kwds)
  File "C:\Program Files\Python38\lib\typing.py", line 723, in __hash__
    return hash((self.__origin__, self.__args__))
RecursionError: maximum recursion depth exceeded while calling a Python object

Temporary solution is to just do a .convert("RGB") on all the images, but that increases the size and does a needless conversion.

Slow font loading

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
from borb.pdf.pdf import PDF
from pathlib import Path


def main():
    doc = Document()
    page = Page()
    doc.append_page(page)
    layout = SingleColumnLayout(page)
    font = TrueTypeFont.true_type_font_from_file(Path('/Users/aka/Software/fonts/SimHei.ttf'))
    layout.add(Paragraph('你好世界', font=font))

    with open('a.pdf', 'wb') as f:
        PDF.dumps(f, doc)


if __name__ == '__main__':
    main()

Performed for 40 seconds and words are crowded together. file size (a.pdf) 5.3MB.
For SimHei.ttf I used this font.

AttributeError: 'Stream' object has no attribute 'add_event_listener'

Hi 👋 . I have a problematic PDF file that triggers the error in the title of this issue. Here's a minimal reproducible example:

#!/usr/bin/env python3
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
l = SimpleTextExtraction()
f = open('tmp.pdf', 'rb')
doc = PDF.loads(f, [l])

The file that triggers it: tmp.pdf

The stack trace:

Traceback (most recent call last):
  File "/Users/adaszko/repos/prawo/./bug.py", line 6, in <module>
    doc = PDF.loads(f, [l])
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/pdf/pdf.py", line 49, in loads
    return ReadAnyObjectTransformer().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/any_object_transformer.py", line 93, in transform
    return super().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/transformer.py", line 120, in transform
    out = h.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/reference/xref_transformer.py", line 104, in transform
    trailer = self.get_root_transformer().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/transformer.py", line 120, in transform
    out = h.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/object/dictionary_transformer.py", line 46, in transform
    v = self.get_root_transformer().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/transformer.py", line 120, in transform
    out = h.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/reference/reference_transformer.py", line 103, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/transformer.py", line 120, in transform
    out = h.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/page/root_dictionary_transformer.py", line 53, in transform
    transformed_root_dictionary = t.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/object/dictionary_transformer.py", line 46, in transform
    v = self.get_root_transformer().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/transformer.py", line 120, in transform
    out = h.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/reference/reference_transformer.py", line 103, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/transformer.py", line 120, in transform
    out = h.transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/metadata/xmp_metadata_transformer.py", line 68, in transform
    out_value = super(XMPMetadataTransformer, self).transform(
  File "/Users/adaszko/repos/prawo/__pypackages__/3.9/lib/borb/io/read/object/stream_transformer.py", line 45, in transform
    object_to_transform.add_event_listener(l)  # type: ignore [attr-defined]
AttributeError: 'Stream' object has no attribute 'add_event_listener'

HeterogeneousParagraph does not wrap

It doesn't seem that the HeterogeneousParagraph has a way of wrapping a long passage of text and working similar to Paragraph.

I am using a list of (3) ChunkOfText() with the first being before the bold text, the second the bold text, and the third the text following the bold word to try to create this:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

I'm getting a: ValueError: max() arg is an empty sequence

I am using layout.add() to add the HP to the layout so that I don't have to provide a bounding box similar to how I would use a regular paragraph.

It seems the HeterogeneousParagraph does not have a bounding box.

Any suggestions?

Problem with JSON example

There is a bug in https://github.com/jorisschellekens/ptext-release/blob/master/EXAMPLES.md#181-exporting-a-pdf-as-json with doc.to_json_serializable(doc) which should be doc.to_json_serializable()

Errors with accented characters

Hello,

first of all, sorry if the solution to my question is too simple, as I am not a big expert and have been trying several things with no success.

I am just trying to write accented characters to my pdf, as simple as this:

table_001.add(Paragraph("Nombre", font="Helvetica-Bold"))
table_001.add(Paragraph("Dirección", font="Helvetica-Bold"))
table_001.add(Paragraph("Código Postal", font="Helvetica-Bold"))
table_001.add(Paragraph("CIF", font="Helvetica-Bold"))
table_001.add(Paragraph("Fecha de Factura", font="Helvetica-Bold"))
table_001.add(Paragraph("Número ![Untitled](https://user-images.githubusercontent.com/7126425/126342074-0fc46ef0-2b66-49c9-8535-ac17b6a2b851.png) de Factura", font="Helvetica-Bold"))

I use binary mode to dump the file to pdf:

with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

As you can see, something is happening so the pdf is not correctly build, and accented characters are not correctly shown

Any ideas?

Thank you

Regards

cannot import name 'SingleColumnLayout'

Just trying out the hello world from the README, I get this:

Traceback (most recent call last):
  File "/Users/moritz/code/python/csv2sheets/src/test.py", line 3, in <module>
    from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout
ImportError: cannot import name 'SingleColumnLayout' from 'ptext.pdf.canvas.layout.page_layout' (/Users/moritz/.local/share/virtualenvs/csv2sheets-vAENhi9u/lib/python3.9/site-packages/ptext/pdf/canvas/layout/page_layout/__init__.py)

This is the only part that doesn't import. All the other imports work (if I e.g. switch the order).

I am on macOS, I installed ptext (2.0.4) with :

❯ pipenv install ptext-joris-schellekens
Installing ptext-joris-schellekens...
Adding ptext-joris-schellekens to Pipfile's [packages]...
✔ Installation Succeeded
Pipfile.lock (c8522f) out of date, updating to (9af5ab)...
Locking [dev-packages] dependencies...
Locking [packages] dependencies...
Building requirements...
Resolving dependencies...
✔ Success!
Updated Pipfile.lock (9af5ab)!
Installing dependencies from Pipfile.lock (9af5ab)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0/0 — 00:00:00

❯ pipenv shell
Launching subshell in virtual environment...
 . /Users/moritz/.local/share/virtualenvs/csv2sheets-vAENhi9u/bin/activate
    ~/code/python/csv2sheets  loading 
❯  . /Users/moritz/.local/share/virtualenvs/csv2sheets-vAENhi9u/bin/activate

Fonts with 'space' character id = 0 display incorrectly

I have been modifying borb to have paragraphs that will split across pages. That works. But while experimenting, I tried some different fonts that I downloaded. When viewing the results, the lines were too long. After some digging, I discovered that the fonts assigned the 'space' character to character id 0. When the GlyphLine class encountered this as a two byte character (0x00??) and retrieved the following character but discarded the space (glyph_line.py:line 66).

I patched this by changing
if i + 1 < len(text_bytes):
to
if i + 1 < len(text_bytes) and text_bytes[i]:

I don't know if this is generally applicable (i.e. everything I know about fonts I learned while trying to figure this out),
but it works for me. It also seems reasonable since a two-byte value starting with 0x00 returns the same value as a one-byte value.

re.Match to re.match

borb/borb/toolkit/text/regular_expression_text_extraction.py

Line 33 in 9580de9

re_match: re.Match,

How to import the package?

Hi, the package looks very interesting and I appreciate the various examples. However, I am having issues figuring how to do the imports. Any chance the examples can include how to do the different imports?

Support password-protected PDFs

Opening an encrypted PDF without a password (bank statement) results in:

NotImplementedError: password-protected PDFs are currently not supported

This is a TODO mentioned here

can you extract/parse tables from pdf?

lets say pds file with many tables and text /images between them?

How to avoid blank lines between Paragraphs added to layout

Is there a way to modify a Paragraph so that subsequent Paragraphs don't have blank lines between them? If not, is there a better class to use to achieve this effect while still having the PageLayout help with creating the bounding boxes for me?

Bit improvement of Examples.md

So I really like what you're building here and I already saw a closed issue about importing things from the library, and if it wasn't for autocompletion I probably would be stuck using the library. The thing is that some functionalities are very nested within the project, which it could be by design and I'm ok with that, but the details of import could also be inside Examples.md. I don't have a problem to make a pull request for that, just checking if this makes any sense.

How to replace text within a pre-existing PDF

Hi, this library looks really interesting. I'd like to learn it.
I couldn't see any documentation on how to replace existing text within an existing PDF, eg:

{{TITLE}} to Mr...

How would this be done?
Thank you :)

Manually set leading for Paragraph

Is is possible to manually set the leading for the text in a paragraph?

Simplify import paths

This library is great and I'm really excited to use it. However, my one complaint is the deeply nested import paths. similar to #25:

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF

My opinion is that its too verbose and any future restructuring of the modules might break the API for existing users. It can be simplified to something like:

from borb.pdf.canvas import SingleColumnLayout
from borb.pdf.canvas import Paragraph
from borb.pdf import Document, Page, PDF

Relative imports can be added to a module's __init__.py for shorter import paths. However, instead of just placing all imports into borb's main __init__.py, larger sub-modules like canvas can remain separated to maintain a clear structure.

A proposed example

Using borb.pdf as an example, within borb/pdf/__init__.py, it can contain:

from .pdf import PDF
from .document import Document
from .page.page import Page
from .page.page_info import PageInfo
from .page.page_size import PageSize
from .trailer.document_info import DocumentInfo
from .xref.xref import XREF
from .xref.stream_xref import StreamXREF
from .xref.plaintext_xref import PlainTextXREF

Because canvas contains so many of its own modules, it can treated as a key sub-module and relative imports from color,event,font,geometry,layout,line_art,operator can be added into canvas/__init__.py. I tested this out on some of the unit tests and they ran successfully.

Current users can continue to use the current imports if they want but new users can choose to use the simplified imports.

This is just a proposed example that I feel would help to significantly simplify imports, and am happy to hear alternative solutions as well. Thank you for maintaining this btw :)

Strange digits placement in TTF.

Ubuntu Font provides strange digits placement.
https://fonts.google.com/specimen/Ubuntu

font_path: Path = Path(__file__).parent / "Ubuntu-Light.ttf"
font: Font = TrueTypeFont.true_type_font_from_file(font_path)



def add_item_content(self, content: str):

        # content_str = f"Content\n\n{content}"
        content_str = f"String with latin, cyrillic, digits in Ubuntu Font:\nОдин, два, три: почали.\n One1 Two2 Three3"
        self.__table.add(
            TableCell(
                Paragraph(
                    content_str,
                    respect_newlines_in_text=True,
                    font=font,
                    #   font_size=Decimal(8),
                ),
                col_span=4,
            ))

result

Originally posted by @fessua in #32 (comment)

Possible to add Japanese characters supports

Hey Joris
While I was using your library to create some invoice PDFs, I ran into some errors with Japanese/Chinese characters
Sample code:

pdf = Document()
page = Page()
pdf.append_page(page)
page_layout = SingleColumnLayout(page)
page_layout.vertical_margin = page.get_page_info().get_height() * Decimal(0.05)
table_001 = Table(number_of_rows=5, number_of_columns=3)
table_001.add(Paragraph("請求書.",font='Courier', horizontal_alignment=Alignment.CENTERED))

Error:
assert cid is not None, "Font %s can not represent '%s'"
I tracked the error from borb/pdf/canvas/layout/text/chunk_of_text.py._write_text_bytes_in_hex

請 = chr(0x8acb)
求 = chr(0x6c42)
書 = chr(0x66f8)
Do you mind taking a look, please?
Thanks

No Space between words

Hi team,

We are trying to use the library to extract data from the pdf files but there are no spaces in between words and we cannot use that. Is there a way to fix this?

Regards
Mandeep