Comments (13)
Hi there. What python version? Can I have access to this PDF file ?
from pdf2pdfocr.
$ python3 --version
Python 3.4.8
from pdf2pdfocr.
Thanks. Iām on vacation now. Will check when back to home. Please wait for about 10 days...
from pdf2pdfocr.
Hi Eric. I could process your PDF file in my installation.
Can you please post your complete command line?
from pdf2pdfocr.
$ pdf2pdfocr.py -i S.pdf -l fra -v
[2018-07-31 11:20:23.557974] [DEBUG] Temp dir is /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
[2018-07-31 11:20:23.558115] [DEBUG] Prefix is H82ML
[2018-07-31 11:20:23.558165] [DEBUG] Script dir is /Users/emoret/bin/
[2018-07-31 11:20:23.558221] [DEBUG] Parallel operations will use 4 CPUs
[2018-07-31 11:20:23.558324] [LOG] Welcome to pdf2pdfocr version 1.2.3
[2018-07-31 11:20:23.564893] [LOG] Input file /Users/emoret/Downloads/S.pdf: type is application/pdf
[2018-07-31 11:20:23.568274] [DEBUG] Output file: /Users/emoret/Downloads/S-OCR.pdf for PDF and /Users/emoret/Downloads/S-OCR.pdf.txt for TXT
[2018-07-31 11:20:23.568386] [LOG] Converting input file to images...
[2018-07-31 11:20:25.147699] [LOG] Starting OCR...
[2018-07-31 11:20:25.161726] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2018-07-31 11:20:30.165266] [LOG] OCR completed
[2018-07-31 11:20:30.166331] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
from pdf2pdfocr.
Looks like tesseract could not execute with success. Is French (-l fra) correctly installed? Can you run with "-k" and post a zip file with temp files generated that contains the "prefix" (H82ML) in your last execution...
from pdf2pdfocr.
$ port installed | grep tesseract-fra
tesseract-fra @3.04_1 (active)
SJCMAC41CBFVH8:Downloads emoret$ pdf2pdfocr.py -i S.pdf -l fra -v -k
[2018-07-31 11:40:04.273624] [DEBUG] Temp dir is /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
[2018-07-31 11:40:04.273757] [DEBUG] Prefix is CIIXL
[2018-07-31 11:40:04.273806] [DEBUG] Script dir is /Users/emoret/bin/
[2018-07-31 11:40:04.273873] [DEBUG] Parallel operations will use 4 CPUs
[2018-07-31 11:40:04.273937] [LOG] Welcome to pdf2pdfocr version 1.2.3
[2018-07-31 11:40:04.280946] [LOG] Input file /Users/emoret/Downloads/S.pdf: type is application/pdf
[2018-07-31 11:40:04.283982] [DEBUG] Output file: /Users/emoret/Downloads/S-OCR.pdf for PDF and /Users/emoret/Downloads/S-OCR.pdf.txt for TXT
[2018-07-31 11:40:04.284095] [LOG] Converting input file to images...
[2018-07-31 11:40:05.873390] [LOG] Starting OCR...
[2018-07-31 11:40:05.883770] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2018-07-31 11:40:10.888903] [LOG] OCR completed
[2018-07-31 11:40:10.890272] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
Temporary files kept in /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
from pdf2pdfocr.
tesseract --list-langs ??
Are there any "tess_err_*" files kept on your temp folder? Can you post its contents?
from pdf2pdfocr.
$ cat tess_err_CIIXL-1.log
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Can not open file "/opt/local/share/tessdata//pdf.ttf"!
Error during processing.
SJCMAC41CBFVH8:T emoret$ ls /opt/local/share/tessdata/
eng.traineddata fra.traineddata osd.traineddata pol.traineddata por.traineddata
from pdf2pdfocr.
Language is ok.
Looks like tesseract is upgraded to 3.05 in macports. My installation still uses 3.04.
There is a bug in tesseract 3.05 with macports (https://trac.macports.org/ticket/56226). :(
Can you try to copy "pdf.ttf" to correct folder manually?
from pdf2pdfocr.
Thank you for your help, I reverted to tesseract 3.04 and it now works again. Followed those instructions:
https://trac.macports.org/wiki/howto/InstallingOlderPort
from pdf2pdfocr.
Great! I will keep this issue open to track macports bug with tesseract 3.05!
from pdf2pdfocr.
Fixed upstream
from pdf2pdfocr.
Related Issues (20)
- Blank file HOT 1
- Poor performance in docker container HOT 1
- Error Message by OCR via GUI HOT 5
- Multiple Files Together HOT 4
- Improve efficiency on PDFs which contain large amounts of text HOT 5
- merging multiple files into one pdf-file HOT 3
- pdf2pdfocr gui error when selecting output file HOT 1
- join_ocred_pdf failing due to "cannot read an empty file" HOT 11
- PyPDF2 moved PdfReadError from utils to errors HOT 2
- pdf2pdfocr changing languages HOT 1
- How to use in on Win 10? Can use paddleocr as a ocr engine? HOT 1
- Zero OCR'ed files HOT 4
- File doesn't pass PDF/A validation after OCR HOT 1
- Bad insertion text on PDF HOT 3
- PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack. HOT 2
- A rectangular block is the only portion being selected from within a paragraph. HOT 4
- file not found. Aborting... HOT 2
- How to use this directly without docker on windows 11? HOT 3
- In text extraction of pdf of characters are recognized double times. HOT 4
- Do we have any parameter / flag for pdf compression here, to reduce pdf size after applying OCR? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2pdfocr.