Giter VIP home page Giter VIP logo

Comments (5)

gabriel-v avatar gabriel-v commented on May 12, 2024 1

Yes, I see a 9X improvement on speed with the new flag for documents of moderate text size and low image count (30 pages).

Thank you for the feature!

Testing on this document with 30 pages of text: https://raw.githubusercontent.com/liquidinvestigations/hoover-testdata/master/data/disk-files/pdf-doc-txt/stanley.ec02.pdf

time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01  -i document.pdf
real    1m53.855s                                                                                                              
user    1m48.059s 
time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01 --ignore-existing-text -i document.pdf 
real    0m13.981s                                                                                                              
user    0m12.642s  

Thanks again!

from pdf2pdfocr.

LeoFCardoso avatar LeoFCardoso commented on May 12, 2024

Hello, @gabriel-v.
Great use case. Thank you.
By now, I think it would be simpler if script remove all known text before start OCR.
I'll check on this.

from pdf2pdfocr.

LeoFCardoso avatar LeoFCardoso commented on May 12, 2024

I think a good try is here: https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file

input.pdf

I tested manually with a 162 page text PDF with one single image on page 1 (see attached file).

No text in PDF (first page with image and 161 blank pages)
[2022-05-13 09:09:22.666163] [LOG] Success in 71.173 seconds!

Normal PDF (162 pages with text)
[2022-05-13 09:21:12.134621] [LOG] Success in 607.781 seconds!

My first conclusion is: tesseract is fast with blank pages.

Maybe we can optimize even more detecting blank pages and avoid calling tesseract for them. Method "do_check_img_greyscale" can be used as example.

This use case is interesting. I'll code this! :-)

from pdf2pdfocr.

LeoFCardoso avatar LeoFCardoso commented on May 12, 2024

Please let me know if last commit works for you.

from pdf2pdfocr.

LeoFCardoso avatar LeoFCardoso commented on May 12, 2024

[2022-05-15 09:24:48.847586] [LOG] Success in 16.423 seconds!

from pdf2pdfocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.