Comments (12)
You are simply disallowing any whitespace character on your output with your whitelist.
In your direct call to Tesseract, you have -c tessedit_char_whitelist='abcdefghijklm nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '
, while you basically have -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.
in your pytesseract call (for debugging, try printing the cmd_args
parameter in pytesseract.pytesseract.run_tesseract
). Using
print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))
(or any variation which properly quotes the allowed characters) will generate the correct Tesseract call and correct output accordingly.
from pytesseract.
I have run tox
Just out of curiosity: Why? Where did you read that this would be required?
from pytesseract.
Thank you. You are correct. For me the issue was the use of single/double quotes. I used double because I want to include a single quote in my whitelist. For names like O'Leary. So I guess I need to escape my single quote to include it.
I ran tox
because somewhere in the install I was advised to. I thought it checked the system configuration.
I don't know the syntax for when you say "for debugging, try printing the cmd_args
parameter in pytesseract.pytesseract.run_tesseract
" I haven't got that far in the manual yet.
You can close this issue or let me know if I should.
from pytesseract.
Sorry. You may have found some of my previous comments confusing.
Now that I have paid closer attention to what you have written. Please consider the following:
- The issue of interword spaces not being preserved is non-existent based on your example producing the required spaces
- My misunderstanding of the syntax is in regard to the use of single and double quotation marks
- My reference to using double quotes around the literal single quote for the name O'Leary has come later and should not be considered a part of this discussion
Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.
I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args
parameter in pytesseract.pytesseract.run_tesseract
"
from pytesseract.
Thanks for the explanations and glad to see that this could be solved.
Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.
No problem.
I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the
cmd_args
parameter inpytesseract.pytesseract.run_tesseract
"
This basically should show what commands are being passed to Tesseract and requires modifying the distribution file to include the corresponding print statement. In your case, this enabled me to see that parsing the parameters would indeed drop the desired whitespace characters due to the invalid/missing quoting/escaping. A possible debugging solution can be seen in #483.
You can close this issue or let me know if I should.
As I am just a Tesseract/pytesseract user myself who just looks at the issues here and tries to debug usage errors, I cannot close this issue myself. If it is resolved for you, feel free to close it yourself to keep the issue list clean.
from pytesseract.
User error was the problem.
from pytesseract.
from pytesseract.
I tried your code & included my own path\to\tesseract.exe (which seemed to be missing from your #483 example)
This is correct, as I am on Linux where the system-wide installation of the binary is callable out-of-the-box.
Received no warning level debug message - just the same output with missing spaces
You should receive a debug-level message, not a warning-level one. The first logging configuration in my example code just sets the default logging level (of the root logger) to WARNING
to only receive debug-level messages from pytesseract.
If the log messages are still not visible after applying the patch from #483, your execution environment might filter the output further, for example when used inside the interactive Python shell. Plain python script.py
should usually show these messages in this case.
It seems to me that you are talking about recompiling the tesseract.exe with an extra print statement for debugging purposes? A little clarification would be appreciated.
You have to differentiate between the Tesseract CLI (tesseract.exe
) and the pytesseract Python package files here. I am only talking about pytesseract - no need to recompile anything here.
With #483 being merged now, you have (at least) two options:
- Install the package from GitHub using the latest source (see https://github.com/madmaze/pytesseract#installation)
- Edit the installed Python package files manually. This is only recommended if you have at least some further knowledge on this topic. Determine the location of the installed package files (
Location
field ofpip show pytesseract
) and edit thepytesseract/pytesseract.py
file in there.
from pytesseract.
from pytesseract.
I don't see what part of this debug logging shows me which CLI arguments pytesseract passes to Tesseract itself. What I see is extra escaping of path components, a reference to tess_1zkcz_p9 and finally a reference to 'txt'?
The log entry shows the parameters for the actual subprocess call. It consists of the following parts:
C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract
is the actual Tesseract binary, which you specify bytesseract_cmd
. Due to your value being a raw string (r
prefix), the regular string will escape each backslash as\
will point to an (possibly invalid) escape sequence otherwise.C:\\ocr\\target\\31832_226140__0001-00002b.jpg
is your input file you have specified. As you already escaped your backslashes, no transformations have been done.C:\\Users\\david\\AppData\\Local\\Temp\\tess_1zkcz_p9
is a temporary file which acts as the reference name/basename for any output files Tesseract will generate.- The DPI, PSM and configuration parameters (
-c
) are the parsed version of theconfig
parameter you are passing. txt
tells Tesseract which configuration file to use - in this case to generate a plain text.txt
file (due topytesseract.image_to_string
). See theCONFIGFILE
section of https://manpages.ubuntu.com/manpages/jammy/man1/tesseract.1.html for example to see common values.
This list is passed to subprocess.Popen
(could be subprocess.run
as well) and evaluated there. The final internal system call basically joins the components together using ' '.join()
, while correctly adding quotes around components where necessary.
from pytesseract.
from pytesseract.
Yes, for the tesseract_cmd
, your double backslashes are not required due to the raw string. The translation essentially is:
>>> r'C:\Program Files\Tesseract-OCR\tesseract' == 'C:\\Program Files\\Tesseract-OCR\\tesseract'
True
The image path is no raw string in this case, thus the \t
and \31
escape sequences have to be escaped properly - with additional backslashes as you do. You could write this as raw strings as well, avoiding the escapes:
>>> 'C:\ocr\target\31832_226140__0001-00002b.jpg'
'C:\\ocr\target\x19832_226140__0001-00002b.jpg'
>>> print('C:\ocr\target\31832_226140__0001-00002b.jpg')
C:\ocr arget832_226140__0001-00002b.jpg
>>> r'C:\ocr\target\31832_226140__0001-00002b.jpg' == 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
True
from pytesseract.
Related Issues (20)
- pytesseract's openMP runtime conflicts with CLIP HOT 6
- Python 3.11.4 changes the output of image_to_data HOT 4
- Can't pass citation mark character into tessedit_char_whitelist HOT 2
- Update PyPI package to pytesseract v0.3.13 HOT 4
- [Feature Request] Wrapper around training HOT 3
- image_to_data default output type is string HOT 2
- Deprecation warning raised in python 3.12 HOT 1
- Unsupported image object when using numpy.ndarray image HOT 2
- I think you need to improve character recognition by using and implementing ChatGPT in OCR HOT 3
- FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\guess\\AppData\\Local\\Temp\\tess_gtrqc8za.hocr' HOT 3
- Tesseract OCR Language Data Configuration Error in Python Environment HOT 4
- PyTesseract cannot read my number HOT 2
- Questions about Copilot + Open Source Software Hierarchy HOT 1
- NPM can't find Tesseract OCR even though it's installed and I can't update git HOT 1
- Solving environment: killed HOT 2
- Rpmlint error in Fedora
- greek langage letter HOT 4
- Image to osd,
- pytesseract.image_to_osd() error HOT 3
- get_languages HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytesseract.