leofcardoso / pdf2pdfocr Goto Github PK

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!

License: Apache License 2.0

Python 94.98% Shell 0.24% Dockerfile 1.26% VBScript 3.53%

pdf ocr pdftk docker python tesseract

pdf2pdfocr's People

Contributors

Stargazers

Watchers

pdf2pdfocr's Issues

script hangs on windows and python 3.7.2

script hangs forever with python 3.7.2 and windows.

join_ocred_pdf failing due to "cannot read an empty file"

The error I get is "PyPDF2.errors.PdfReadError: Cannot read an empty file". I experimented with the first 2 pages of this pdf; individually the two pages OCR'ed fine (neither page was empty, and the OCR'ed text was not empty either), but when I tried to do the 2 pages together, it gave me

Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1526, in
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 717, in ocr
self.join_ocred_pdf()
File "/usr/local/bin/pdf2pdfocr.py", line 952, in join_ocred_pdf
pdf_merger.append(PyPDF2.PdfFileReader(text_pdf_file, strict=False))
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1856, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 277, in init
self.read(stream)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1301, in read
raise PdfReadError("Cannot read an empty file")
PyPDF2.errors.PdfReadError: Cannot read an empty file.

Side note, on the successful runs it gave me the warnings

UserWarning: isString is deprecated and will be removed in PyPDF2 2.0.0. [_utils.py:76]
UserWarning: namedDestinations will be removed in PyPDF2 2.0.0. Use named_destinations instead. [_reader.py:519]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]

How to use in on Win 10? Can use paddleocr as a ocr engine?

what does cmd_file implies

cmd_file = 'file', may I know the intuition of this variable. Its path is always getting as None and I am not sure if we need to hard code it to some value or did I miss something during installation?

I have installed everything on windows.

Poor performance in docker container

Execution inside docker container takes too much time.

Do we have any parameter / flag for pdf compression here, to reduce pdf size after applying OCR?

file not found. Aborting...

python pdf2pdfocr.py -v -r 200 -i Dummy_IS_4.pdf

File: Dummy_IS_4.pdf
[2023-09-14 11:37:30.229202] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-09-14 11:37:30.247081] [DEBUG] Tesseract version: 5
[2023-09-14 11:37:30.247081] [DEBUG] cuneiform not available
file not found. Aborting...

In GUI:

Create language based Dockerimages

Hey, could you please create Dockerfiles for different languages and and upload the tagged images to the Docker Hub?

Alternatively you could add all tesseract ocr language packages to the Dockerfile, but this would nearly triple the image size:

larsk@MacBook-Pro pdf2pdfocr % docker image ls
REPOSITORY                           TAG                                              IMAGE ID            CREATED             SIZE
pdf2pdfocr                           all-lang                                         a74b8d22d02b        6 seconds ago       1.1GB
pdf2pdfocr                           latest                                           09eccd997dd3        6 minutes ago       417MB

Should I add a PR for this issue?

TypeError: expected str, bytes or os.PathLike object, not NoneType

]# python3 pdf2pdfocr.py -i /home/amuthuraman/NonOcrpdf/test.pdf
[2019-05-30 02:43:00.729260] [LOG] Tesseract can 'textonly_pdf': False
[2019-05-30 02:43:00.739282] [LOG] Tesseract version: 3
Traceback (most recent call last):
File "pdf2pdfocr.py", line 1214, in
pdf2ocr = Pdf2PdfOcr(args)
File "pdf2pdfocr.py", line 450, in init
self.check_external_tools()
File "pdf2pdfocr.py", line 531, in check_external_tools
if not self.test_convert():
File "pdf2pdfocr.py", line 1031, in test_convert
stderr=subprocess.DEVNULL, shell=self.shell_mode)
File "/usr/lib64/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1278, in _execute_child
executable = os.fsencode(executable)
File "/usr/lib64/python3.6/os.py", line 800, in fsencode
filename = fspath(filename) # Does type-checking of filename.
TypeError: expected str, bytes or os.PathLike object, not NoneType

PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

While applying OCR to a PDF, using the docker image of the repo "leofcardoso/pdf2pdfocr:latest", this error occurred:

[2023-09-05 10:35:58.939733] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-09-05 10:35:58.959460] [LOG] Input file /home/docker/Dummy_IS.pdf: type is application/pdf
[2023-09-05 10:35:59.047502] [LOG] Converting input file to images...
[2023-09-05 10:36:38.577186] [LOG] Checking blank pages
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/usr/local/bin/pdf2pdfocr.py", line 249, in do_check_img_colors_size
im = Image.open(param_image_file)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3172, in open
im = _open_core(fp, filename, prefix, formats)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3159, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3068, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1530, in
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 712, in ocr
self.check_blank_pages(image_file_list)
File "/usr/local/bin/pdf2pdfocr.py", line 1010, in check_blank_pages
blank_map_values = colors_size_pool_map.get()
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

In text extraction of pdf of characters are recognized double times.

Following OCR processing on these PDFs, attempts to extract text from the PDF using different techniques, such as code-based extraction or direct copying from the browser-rendered PDF, result in the entire text being duplicated / getting all the text twice than the text actually there in the pdf.

For instance, if the original text contains 5 characters, post-OCR, it recognizes and extracts 10 characters, effectively causing duplication of the content.

result pdf file is blank

Text can be extracted, but all pages are blank.

A rectangular block is the only portion being selected from within a paragraph.

As you can see in the below image,
any solution to this problem?

merging multiple files into one pdf-file

At the moment, a pdf file is created for each file if the option "-i" is used for a directory. There should be an option that packs all the files into one single pdf file. This would be a useful option if there were several "edited" image files (e.g. processed with scantailor) in that directory after a scan (e.g. one for each page). An option for the "correct" sorting these pages has to be kept in mind as well.

rebuild_and_merge fail in windows with big files

Sometimes, execution fail with "long command line" error in Windows when ImageMagick is called.

Missing space

When I use pdf2pdfocr, the text generated includes no space between the words recognized. As a result when I copy/paste the resulting text it is difficult to use as I have to manually reintroduce all missing spaces.

autorotation is broken with tesseract 4

In tesseract 4, script always return error when using (-u) autorotate.

file already has text and check text mode is enabled. Exiting.

Hi, i've obtained next error while trying to add ocr to pdf:
--> Errors/Warnings:
already has text and check text mode is enabled. Exiting.

You may find 'wrong' pdf from google drive:
https://drive.google.com/open?id=0B4mLkzBXmYycQ2N5OGpneWd5dzQ

Font issue on Macos Catalina Dark Appearance

When running the gui I receive an error message as seen below. Note that this does not seem to have consequences when macos is configured in light mode, however in dark mode the non selected UI controls are displaying empty, see attached screenshot.

% pdf2pdfocr_gui.py
2020-04-11 08:16:46.702 Python[45232:697646] CoreText note: Client requested name ".SFNS-Regular", it will get Times-Roman rather than the intended font. All system UI font access should be through proper APIs such as CTFontCreateUIFontForLanguage() or +[NSFont systemFontOfSize:].
2020-04-11 08:16:46.702 Python[45232:697646] CoreText note: Set a breakpoint on CTFontLogSystemFontNameRequest to debug.

GUI

Would be nice to have a simple gui to this tool. Maybe a contextual menu in finder, a print driver, or even an icon that could accept dragged and dropped pdf files ready to OCR?

Error/Warning: Mogrify from ImageMagick not found. Aborting ...

Hi there,

I always was looking for an open source tool suite like yours. I installed everything as explained for Windows7x64 system according to your README.

Right Click on PDF -> Send To -> VBS Script gives the error from above

However mogrify is installed. As I can run it from command line with "magick mogrify" successfully.

Looking into the python code it looks that it should work. Can you help me out?

Thank you so much.

File doesn't pass PDF/A validation after OCR

File used PDF A-1b.pdf
Site for validation: VeraPDF Demo
Terminal output:

eduardo@000563-desk:~/Área\ de Trabalho/testepdf$ python3 ~/Área\ de\ Trabalho/pdf2pdfocr/pdf2pdfocr.py -w -o pdfabr.pdf -v -l por -i ~/Documentos/PDF\ A-1b.pdf`
File: /home/eduardo/Documentos/PDF A-1b.pdf
[2023-02-13 10:25:11.326759] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-02-13 10:25:11.329506] [DEBUG] Tesseract version: 4
[2023-02-13 10:25:11.329598] [DEBUG] cuneiform not available
[2023-02-13 10:25:11.336831] [DEBUG] Pdftoppm version: 22.2.0
[2023-02-13 10:25:11.340303] [DEBUG] Qpdf version: 10.6.3
[2023-02-13 10:25:11.340382] [DEBUG] Temp dir is /tmp/pdf2pdfocr_O4M39/
[2023-02-13 10:25:11.340396] [DEBUG] Prefix is O4M39
[2023-02-13 10:25:11.340413] [DEBUG] Script dir is /home/eduardo/Área de Trabalho/pdf2pdfocr/
[2023-02-13 10:25:11.340442] [DEBUG] Parallel operations will use 8 CPUs
[2023-02-13 10:25:11.349594] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense  - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-02-13 10:25:11.350756] [LOG] Input file /home/eduardo/Documentos/PDF A-1b.pdf: type is application/pdf
[2023-02-13 10:25:11.352185] [DEBUG] User conversion params: 
[2023-02-13 10:25:11.352209] [DEBUG] Output file: pdfabr.pdf for PDF and pdfabr.pdf.txt for TXT
[2023-02-13 10:25:11.352249] [LOG] Converting input file to images
[2023-02-13 10:25:11.427845] [LOG] Checking blank pages
[2023-02-13 10:25:11.928593] [LOG] Starting OCR with tesseract...
[2023-02-13 10:25:13.430893] [LOG] OCR completed
[2023-02-13 10:25:13.431167] [DEBUG] We have 1 ocr'ed files
[2023-02-13 10:25:13.432973] [DEBUG] Joined ocr'ed PDF files
[2023-02-13 10:25:13.433220] [LOG] Created final text file
[2023-02-13 10:25:13.433241] [DEBUG] Merging with OCR
[2023-02-13 10:25:13.445310] [DEBUG] Autorotate skipped
[2023-02-13 10:25:13.445371] [DEBUG] Editing producer
[2023-02-13 10:25:13.458038] [DEBUG] Output file created
[2023-02-13 10:25:13.466554] [LOG] Success in 2.117 seconds!

Validation output:

Improve efficiency on PDFs which contain large amounts of text

If a PDF contains a large amount of text and a small amount of pictures, we only want to OCR the pictures. The script currently OCRs the whole pages, including any existing text, which is undesirable because of the CPU consumption, and degradation of existing text.

I want to implement a change (probably optional, enabled by a flag) to only run OCR on the images, not on any exising text. I would split the images away from the PDF using pdfimages, and then somehow re-create the layer sandwich using only the OCR generated for those images. The original text inside the files should be left untouched.

Do you have any pointers on doing this? I have a couple of ideas I want to investigate:

process everything page by page
- edit the PDF to make all text invisible (same color as background)
- run pipeline as it is -- OCR should be faster, since most of the page is blank
- recombine original PDF text & image layers with the new OCR layer overlay (still page by page)
- still inefficient -- OCR needs to scan through a lot of empty pages
process everything image by image
- run pdfimages to extract images from PDF (along with page number, img size and coordinates)
- maybe use pdf2html to get image location & position
- create PDF sandwiches for each image separately (using pdf2pdfocr, of course)
- re-combine them in the original PDF using pdfjam and pdftk
- more efficient -- we don't give blank images to the OCR engine

Specify Output Folder using pdf2pdfocr.vbs

Dear Leo,

I love your project but would like to directly push the OCRed files to a new directory.
Therefor I tried to add amend the defaultoptions: default_option = "-stp -j 0.9 -o %Userprofile%"
But no matter which directory I add, I always get a permission error:

Traceback (most recent call last):
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 1249, in
pdf2ocr.ocr()
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 605, in ocr
self.initial_cleanup()
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 952, in initial_cleanup
Pdf2PdfOcr.best_effort_remove(self.output_file)
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 1154, in best_effort_remove
os.remove(filename)
PermissionError: [WinError 5] Zugriff verweigert: 'C:\Users\Christoph'

Any ideas how to fix it?

Thank you so much.

BR
Christoph

PyPDF2 moved PdfReadError from utils to errors

Followed the instruction guide for Windows and noticed an error when running the SendTo VBScript
"ModuleNotFoundError: No module named 'PyPDF2.utils'"

Looks like the latest version of PyPDF2 moved PdfReadError from utils to errors

Changing line 41 from

from PyPDF2.utils import PdfReadError

from PyPDF2.errors import PdfReadError

fixed the problem.

Cheers

Error Message by OCR via GUI

Dear Leanardo
I´ve get an error when i try to OCR a Pdf file. Maybe you can help me ?
I use Windows 10 21H1 in Virtualbox with 4 cores and 16GB Memory for this vm.
Message is:
[2022-02-19 18:44:06.020876] [DEBUG] Tesseract can 'textonly_pdf': True
[2022-02-19 18:44:06.050413] [DEBUG] Tesseract version: 5
[2022-02-19 18:44:06.050413] [DEBUG] cuneiform not available
[2022-02-19 18:44:06.282093] [DEBUG] Pdftoppm version: 22.01.0
[2022-02-19 18:44:06.391073] [DEBUG] Qpdf version: 10.6.2
[2022-02-19 18:44:06.391073] [DEBUG] Temp dir is C:\Users\Martin\AppData\Local\Temp\pdf2pdfocr_ONWZ5
[2022-02-19 18:44:06.391073] [DEBUG] Prefix is ONWZ5
[2022-02-19 18:44:06.391073] [DEBUG] Script dir is C:\Users\Martin\pdf2pdfocr-venv\Scripts
[2022-02-19 18:44:06.391073] [DEBUG] Parallel operations will use 4 CPUs
[2022-02-19 18:44:06.507230] [LOG] Welcome to pdf2pdfocr version 1.9.1 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2022-02-19 18:44:06.641453] [LOG] Input file C:\Users\Martin\Desktop\471685214.pdf: type is application/pdf
[2022-02-19 18:44:06.641453] [DEBUG] Conversion params:
[2022-02-19 18:44:06.641453] [DEBUG] Output file: C:\Users\Martin\Desktop\471685214-OCR.pdf for PDF and C:\Users\Martin\Desktop\471685214-OCR.pdf.txt for TXT
[2022-02-19 18:44:06.641453] [LOG] Converting input file to images...
[2022-02-19 18:44:06.903005] [LOG] Starting OCR with tesseract...
[2022-02-19 18:44:07.365611] [LOG] OCR completed
[2022-02-19 18:44:07.365611] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.

Did I need cuneiform- i read your windows install.txt file and read this as optional, maybe I´m wrong.
It´s a interesting tool and would fit for me perfect to create a database of my private papers.
thx a lot
Martin

"-g grayscale" fail

script aborting

Tesseract 4 LSTM (--oem 1)

Is there a flag to set --oem 1 in for tesseract 4 like documented here?

TypeError: can't concat str to ByteStringObject in edit_producer

pdf2pdfocr gui error when selecting output file

If the output file is selected in pdf2pdfocr gui, this file must currently already exist, which is obviously not reasonable.

pdf2pdfocr changing languages

This wasn't included in the readme file but some info for anyone else lost.
You can change the language model to download by editing this:
aria2c "https://github.com/tesseract-ocr/tessdata/blob/main/por.traineddata?raw=true" --dir="%TESSDATA_PREFIX%"
And change the language prefix to which language you want. As long as its available on the tesseract repo. For example here is Swedish - "swe":

Further info here:
https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#LANGUAGES

To change default language edit pdf2pdfocr.py on line 548 from Portuguese + English - "por+eng" to whichever. For me I use Swedish + English - "swe+eng"
self.tess_langs = "por+eng" # Default
to
self.tess_langs = "swe+eng" # Default

For example to get Swedish

Documentation update

I found out that following the macports installation documentation did not properly install the modules with pip. In order to make it work, I had to use pip3 such as:

sudo pip3 install reportlab Gooey
sudo pip3 install https://github.com/mstamy2/PyPDF2/archive/master.zip
sudo pip3 install lxml beautifulsoup4

Multiple Files Together

Hey is there any option to ocr multiple files together ( or any script that I can use). Doing one by one takes alot of time. Thanks for this awesome tool btw!

Blank file

With some old poppler versions and specific PDFs, script is generating blank pages.

the right edge of the text is not fully highlighted

Hello

in pdf, after recognition with pdf2pdfocr_gui.py, the right edge of the text is not fully highlighted if you select 'tesseract' for the '- e' option.
and if you select 'native', then the entire text is highlighted correctly,but there are no Cyrillic characters.
see the screenshots...

if you say that this is a tesseract problem, then this is incorrect, because I recognize djvu with 'ocrodjvu' and the text is highlighted correctly after recognition.

can this be corrected?

Zero OCR'ed files

File: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf
[2023-01-14 19:20:35.717707] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-01-14 19:20:35.733704] [DEBUG] Tesseract version: 5
[2023-01-14 19:20:35.736704] [DEBUG] cuneiform not available
[2023-01-14 19:20:35.781705] [DEBUG] Pdftoppm version: 22.12.0
[2023-01-14 19:20:35.811712] [DEBUG] Qpdf version: 11.2.0
[2023-01-14 19:20:35.811712] [DEBUG] Temp dir is C:\Users\ADMINI~1\AppData\Local\Temp\pdf2pdfocr_L3VRF
[2023-01-14 19:20:35.811712] [DEBUG] Prefix is L3VRF
[2023-01-14 19:20:35.811712] [DEBUG] Script dir is c:\Users\Administrator\anaconda3\Scripts
[2023-01-14 19:20:35.812712] [DEBUG] Parallel operations will use 20 CPUs
[2023-01-14 19:20:35.861715] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-01-14 19:20:35.903716] [LOG] Input file D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf: type is application/pdf
[2023-01-14 19:20:35.918716] [DEBUG] User conversion params: best
[2023-01-14 19:20:35.918716] [DEBUG] Output file: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf for PDF and D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf.txt for TXT
[2023-01-14 19:20:35.918716] [LOG] Converting input file to images...
[2023-01-14 19:20:43.633767] [LOG] Checking blank pages
C:\Users\Administrator\anaconda3\lib\site-packages\PIL\Image.py:3074: DecompressionBombWarning: Image size (105023996 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
warnings.warn(
[2023-01-14 19:20:44.652767] [LOG] Starting OCR with tesseract...
[2023-01-14 19:20:45.154768] [LOG] OCR completed
[2023-01-14 19:20:45.155767] [DEBUG] We have 0 ocr'ed files
Error: No PDF files generated after OCR. This is not expected. Aborting.

Integration with Google Vision API

Hello,
I want to replace tesseract engine with Google vision API. Can you please suggest me how to do the same.
thanks

How to use this directly without docker on windows 11?

How can I use this tool directly on Windows 11 without Docker?
I'd like to utilize it as a python function API that accepts arguments and generates the OCR'd file.

Bad insertion text on PDF

PDF source :
Module 3 - .v2.pdf

I'm trying to OCR the text on my pdf for personal use.
I've check the TXT file generated, and it's working (I'm seeing the proper text).
But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :

22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa

Output file could not be created

Any ideas why this would be failing? Unable to generate a final PDF

pdf2pdfocr.py -i test.pdf -o test2.pdf -v -k -r 200

[2020-11-03 11:16:18.697012] [LOG] Tesseract can 'textonly_pdf': True
[2020-11-03 11:16:18.702393] [LOG] Tesseract version: 4
[2020-11-03 11:16:18.702628] [DEBUG] cuneiform not available
[2020-11-03 11:16:18.716257] [DEBUG] Temp dir is /tmp/
[2020-11-03 11:16:18.716342] [DEBUG] Prefix is C6UIH
[2020-11-03 11:16:18.716374] [DEBUG] Script dir is /usr/local/bin/
[2020-11-03 11:16:18.716462] [DEBUG] Parallel operations will use 1 CPUs
[2020-11-03 11:16:18.716560] [LOG] Welcome to pdf2pdfocr version 1.6.1 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2020-11-03 11:16:18.719250] [LOG] Input file /home/john/test.pdf: type is application/pdf
[2020-11-03 11:16:18.720583] [DEBUG] Output file: test2.pdf for PDF and test2.pdf.txt for TXT
[2020-11-03 11:16:18.720644] [LOG] Converting input file to images...
[2020-11-03 11:16:19.544142] [LOG] Starting OCR with tesseract...
[2020-11-03 11:16:19.550422] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2020-11-03 11:16:24.553051] [LOG] OCR completed
[2020-11-03 11:16:24.553681] [DEBUG] We have 1 ocr'ed files
[2020-11-03 11:16:24.557630] [DEBUG] Joined ocr'ed PDF files
[2020-11-03 11:16:24.557677] [DEBUG] Merging with OCR
[2020-11-03 11:16:24.564783] [DEBUG] Fail to merge source PDF with extracted OCR text. Trying to fix source PDF to build final file...
[2020-11-03 11:16:25.222864] [DEBUG] Merging with OCR
Output file could not be created :( Exiting with error code.

No PDF files generated after OCR. This is not expected. Aborting.

Something seems to be wrong. I am running MacOS 10.13.6 with a fresh macports installation.

[2018-07-21 12:18:53.392458] [LOG]      Input file /Users/emoret/Downloads/01-19-2017.pdf: type is application/pdf
PdfReadWarning: Multiple definitions in dictionary at byte 0xa9769 for key /Outlines [generic.py:588]
[2018-07-21 12:18:53.400228] [DEBUG]    Output file: /Users/emoret/Downloads/01-19-2017-OCR.pdf for PDF and /Users/emoret/Downloads/01-19-2017-OCR.pdf.txt for TXT
[2018-07-21 12:18:53.400349] [LOG]      Converting input file to images...
[2018-07-21 12:18:54.256488] [LOG]      Starting OCR...
[2018-07-21 12:18:54.268721] [LOG]      Waiting for OCR to complete. 0/5 pages completed...
[2018-07-21 12:18:59.271505] [LOG]      OCR completed
[2018-07-21 12:18:59.273427] [DEBUG]    We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.

Application icon

It would be nice for the installation script to create an icon for the gui that would appear as an Application. This would allow running without starting the terminal.

leofcardoso / pdf2pdfocr Goto Github PK

pdf2pdfocr's People

Contributors

Stargazers

Watchers

Forkers

pdf2pdfocr's Issues

python pdf2pdfocr.py -v -r 200 -i Dummy_IS_4.pdf

Recommend Projects

Recommend Topics

Recommend Org