Comments (5)
Yes, I see a 9X improvement on speed with the new flag for documents of moderate text size and low image count (30 pages).
Thank you for the feature!
Testing on this document with 30 pages of text: https://raw.githubusercontent.com/liquidinvestigations/hoover-testdata/master/data/disk-files/pdf-doc-txt/stanley.ec02.pdf
time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01 -i document.pdf
real 1m53.855s
user 1m48.059s
time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01 --ignore-existing-text -i document.pdf
real 0m13.981s
user 0m12.642s
Thanks again!
from pdf2pdfocr.
Hello, @gabriel-v.
Great use case. Thank you.
By now, I think it would be simpler if script remove all known text before start OCR.
I'll check on this.
from pdf2pdfocr.
I think a good try is here: https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file
I tested manually with a 162 page text PDF with one single image on page 1 (see attached file).
No text in PDF (first page with image and 161 blank pages)
[2022-05-13 09:09:22.666163] [LOG] Success in 71.173 seconds!
Normal PDF (162 pages with text)
[2022-05-13 09:21:12.134621] [LOG] Success in 607.781 seconds!
My first conclusion is: tesseract is fast with blank pages.
Maybe we can optimize even more detecting blank pages and avoid calling tesseract for them. Method "do_check_img_greyscale" can be used as example.
This use case is interesting. I'll code this! :-)
from pdf2pdfocr.
Please let me know if last commit works for you.
from pdf2pdfocr.
[2022-05-15 09:24:48.847586] [LOG] Success in 16.423 seconds!
from pdf2pdfocr.
Related Issues (20)
- Blank file HOT 1
- Poor performance in docker container HOT 1
- Error Message by OCR via GUI HOT 5
- Multiple Files Together HOT 4
- merging multiple files into one pdf-file HOT 3
- pdf2pdfocr gui error when selecting output file HOT 1
- join_ocred_pdf failing due to "cannot read an empty file" HOT 11
- PyPDF2 moved PdfReadError from utils to errors HOT 2
- pdf2pdfocr changing languages HOT 1
- How to use in on Win 10? Can use paddleocr as a ocr engine? HOT 1
- Zero OCR'ed files HOT 4
- File doesn't pass PDF/A validation after OCR HOT 1
- Bad insertion text on PDF HOT 3
- PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack. HOT 2
- A rectangular block is the only portion being selected from within a paragraph. HOT 4
- file not found. Aborting... HOT 2
- How to use this directly without docker on windows 11? HOT 3
- In text extraction of pdf of characters are recognized double times. HOT 4
- Do we have any parameter / flag for pdf compression here, to reduce pdf size after applying OCR? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2pdfocr.