If a PDF contains a large amount of text and a small amount of pictures, we only want

Hello, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

I think a good try is here: <a href="https://stackoverflow.com/questions/24322338/remo

Improve efficiency on PDFs which contain large amounts of text about pdf2pdfocr HOT 5 CLOSED

leofcardoso commented on May 12, 2024

Improve efficiency on PDFs which contain large amounts of text

from pdf2pdfocr.

Comments (5)

gabriel-v commented on May 12, 2024 1

Yes, I see a 9X improvement on speed with the new flag for documents of moderate text size and low image count (30 pages).

Thank you for the feature!

Testing on this document with 30 pages of text: https://raw.githubusercontent.com/liquidinvestigations/hoover-testdata/master/data/disk-files/pdf-doc-txt/stanley.ec02.pdf

time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01  -i document.pdf
real    1m53.855s                                                                                                              
user    1m48.059s

time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01 --ignore-existing-text -i document.pdf 
real    0m13.981s                                                                                                              
user    0m12.642s

Thanks again!

from pdf2pdfocr.

LeoFCardoso commented on May 12, 2024

Hello, @gabriel-v.
Great use case. Thank you.
By now, I think it would be simpler if script remove all known text before start OCR.
I'll check on this.

from pdf2pdfocr.

LeoFCardoso commented on May 12, 2024

I think a good try is here: https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file

input.pdf

I tested manually with a 162 page text PDF with one single image on page 1 (see attached file).

No text in PDF (first page with image and 161 blank pages)
[2022-05-13 09:09:22.666163] [LOG] Success in 71.173 seconds!

Normal PDF (162 pages with text)
[2022-05-13 09:21:12.134621] [LOG] Success in 607.781 seconds!

My first conclusion is: tesseract is fast with blank pages.

Maybe we can optimize even more detecting blank pages and avoid calling tesseract for them. Method "do_check_img_greyscale" can be used as example.

This use case is interesting. I'll code this! :-)

from pdf2pdfocr.

LeoFCardoso commented on May 12, 2024

Please let me know if last commit works for you.

from pdf2pdfocr.

LeoFCardoso commented on May 12, 2024

[2022-05-15 09:24:48.847586] [LOG] Success in 16.423 seconds!

from pdf2pdfocr.

Recommend Projects

Improve efficiency on PDFs which contain large amounts of text about pdf2pdfocr HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent