Hi! I am working on a PDF text mining project, for which I decided t

Why is `borb` so (incredibly) slow?,about jorisschellekens/borb

Comments (4)

jorisschellekens commented on May 9, 2024

Thanks for your feedback.
I'd love to know what takes up most of the time.
If you can dig deeper and perhaps perform a trace to see which functions are the most labor-intensive, that'd be great.

Kind regards,
Joris Schellekens

from borb.

sgraaf commented on May 9, 2024

Thanks for your response.

For this little benchmark, for every PDF library, the goal was to see how long it would take to get a simple page count. For borb, I benchmarked the following code:

def get_page_count_borb(file: Path) -> int:
    with open(file, "rb") as f: 
        doc = PDF.loads(f)
        return doc.get_document_info().get_number_of_pages()

I'll see if I can narrow it down further soon.

from borb.

jorisschellekens commented on May 9, 2024

I think one of the first things to remark there is that for borb, there is no difference between getting a page-count and getting the text from all pages.

The entire binary stream is converted to the internal representation when you're loading a Document.

I imagine some libraries might optimize that, and only parse the needed things to get the page-count.

from borb.

jorisschellekens commented on May 9, 2024

I changed the code around a bit.
Rather than always parsing the content of the page, a page is now only parsed if there is actually a registered EventListener. In other words, if the content of the page is not needed by anyone, it isn't parsed.

This still allows you to open / copy / modify documents. And of course to read their metadata (such as number of pages).
This should provide a significant speed-up.

These are my findings. The corpus I used can be found here.

output.pdf

Kind regards,
Joris Schellekens

from borb.

Recommend Projects

Why is `borb` so (incredibly) slow? about borb HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent