Comments (4)
Thanks for your feedback.
I'd love to know what takes up most of the time.
If you can dig deeper and perhaps perform a trace to see which functions are the most labor-intensive, that'd be great.
Kind regards,
Joris Schellekens
from borb.
Thanks for your response.
For this little benchmark, for every PDF library, the goal was to see how long it would take to get a simple page count. For borb
, I benchmarked the following code:
def get_page_count_borb(file: Path) -> int:
with open(file, "rb") as f:
doc = PDF.loads(f)
return doc.get_document_info().get_number_of_pages()
I'll see if I can narrow it down further soon.
from borb.
I think one of the first things to remark there is that for borb
, there is no difference between getting a page-count and getting the text from all pages.
The entire binary stream is converted to the internal representation when you're loading a Document
.
I imagine some libraries might optimize that, and only parse the needed things to get the page-count.
from borb.
I changed the code around a bit.
Rather than always parsing the content of the page, a page is now only parsed if there is actually a registered EventListener
. In other words, if the content of the page is not needed by anyone, it isn't parsed.
This still allows you to open / copy / modify documents. And of course to read their metadata (such as number of pages).
This should provide a significant speed-up.
These are my findings. The corpus I used can be found here.
Kind regards,
Joris Schellekens
from borb.
Related Issues (20)
- KeyError: 'OCGs' HOT 1
- BUG: `cryptography` and `lxml` dependencies not declared in setup.py HOT 1
- BUG: ImageExtraction not extracting all the images in pdf HOT 6
- BUG: Digits in OrderedList not scaled with `font_size` HOT 6
- BUG Inverted characters HOT 2
- Copying a font in a PDF using low-level syntax HOT 1
- Borb: Assertion Error // SimpleFindReplace() in canvas_stream_processor.py HOT 9
- BUG Simple Find Replace doesn't work HOT 2
- invisible text layer HOT 2
- loading a large PDF (around 100mb) is very slow HOT 1
- Need to get fonts and text and bounding boxes for each word in the document HOT 1
- New bug with SingleColumnLayoutWithOverflow HOT 1
- BUG HOT 1
- Turkish characters are not supported. HOT 6
- Multiple Column Layout HOT 1
- BUG PageDictionaryTransformer prints "reading page" HOT 2
- Documentation HOT 1
- CropBox HOT 1
- Reading and writing PDF file damages it HOT 3
- Not all authors are listed HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from borb.