Using Tesseract <div class="snippet-clipboard-content notranslate position-relativ

With regard to <a class="issue-link js-issue-link" data-error-text="Fail

I tried your code & included my own path o esseract.exe (which seem

pytesseract.image_to_string not preserving interword spaces,about madmaze/pytesseract

Comments (12)

stefan6419846 commented on July 30, 2024

You are simply disallowing any whitespace character on your output with your whitelist.

In your direct call to Tesseract, you have -c tessedit_char_whitelist='abcdefghijklm nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. ', while you basically have -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. in your pytesseract call (for debugging, try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract). Using

print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))

(or any variation which properly quotes the allowed characters) will generate the correct Tesseract call and correct output accordingly.

from pytesseract.

stefan6419846 commented on July 30, 2024

I have run tox

Just out of curiosity: Why? Where did you read that this would be required?

from pytesseract.

willdave865 commented on July 30, 2024

Thank you. You are correct. For me the issue was the use of single/double quotes. I used double because I want to include a single quote in my whitelist. For names like O'Leary. So I guess I need to escape my single quote to include it.

I ran tox because somewhere in the install I was advised to. I thought it checked the system configuration.

I don't know the syntax for when you say "for debugging, try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract" I haven't got that far in the manual yet.

You can close this issue or let me know if I should.

from pytesseract.

willdave865 commented on July 30, 2024

Sorry. You may have found some of my previous comments confusing.

Now that I have paid closer attention to what you have written. Please consider the following:

The issue of interword spaces not being preserved is non-existent based on your example producing the required spaces
My misunderstanding of the syntax is in regard to the use of single and double quotation marks
My reference to using double quotes around the literal single quote for the name O'Leary has come later and should not be considered a part of this discussion

Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.

I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract"

from pytesseract.

stefan6419846 commented on July 30, 2024

Thanks for the explanations and glad to see that this could be solved.

Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.

No problem.

I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract"

This basically should show what commands are being passed to Tesseract and requires modifying the distribution file to include the corresponding print statement. In your case, this enabled me to see that parsing the parameters would indeed drop the desired whitespace characters due to the invalid/missing quoting/escaping. A possible debugging solution can be seen in #483.

You can close this issue or let me know if I should.

As I am just a Tesseract/pytesseract user myself who just looks at the issues here and tries to debug usage errors, I cannot close this issue myself. If it is resolved for you, feel free to close it yourself to keep the issue list clean.

from pytesseract.

willdave865 commented on July 30, 2024

User error was the problem.

from pytesseract.

willdave865 commented on July 30, 2024

With regard to #483 please consider the following: - I read https://docs.python.org/3/howto/logging.html Confirmed with examples that I had this module code working - I tried your code & included my own path\to\tesseract.exe (which seemed to be missing from your #483 example) - I included my faulty config line in the pytesseract call - Received no warning level debug message - just the same output with missing spaces With respect. You say "requires modifying the distribution file to include the corresponding print statement". I didn't understand this. It seems to me that you are talking about recompiling the tesseract.exe with an extra print statement for debugging purposes? A little clarification would be appreciated. Thank you.

…

On Sat, Apr 8, 2023 at 8:54 PM Stefan ***@***.***> wrote: Thanks for the explanations and glad to see that this could be solved. Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard. No problem. I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract" This basically should show what commands are being passed to Tesseract and requires modifying the distribution file to include the corresponding print statement. In your case, this enabled me to see that parsing the parameters would indeed drop the desired whitespace characters due to the invalid/missing quoting/escaping. A possible debugging solution can be seen in #483 <#483>. You can close this issue or let me know if I should. As I am just a Tesseract/*pytesseract* user myself who just looks at the issues here and tries to debug usage errors, I cannot close this issue myself. If it is resolved for you, feel free to close it yourself to keep the issue list clean. — Reply to this email directly, view it on GitHub <#482 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATFEFUC7YUEIX4E5CGG5MFLXAERTNANCNFSM6AAAAAAWVZQTFA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from pytesseract.

stefan6419846 commented on July 30, 2024

I tried your code & included my own path\to\tesseract.exe (which seemed to be missing from your #483 example)

This is correct, as I am on Linux where the system-wide installation of the binary is callable out-of-the-box.

Received no warning level debug message - just the same output with missing spaces

You should receive a debug-level message, not a warning-level one. The first logging configuration in my example code just sets the default logging level (of the root logger) to WARNING to only receive debug-level messages from pytesseract.

If the log messages are still not visible after applying the patch from #483, your execution environment might filter the output further, for example when used inside the interactive Python shell. Plain python script.py should usually show these messages in this case.

It seems to me that you are talking about recompiling the tesseract.exe with an extra print statement for debugging purposes? A little clarification would be appreciated.

You have to differentiate between the Tesseract CLI (tesseract.exe) and the pytesseract Python package files here. I am only talking about pytesseract - no need to recompile anything here.

With #483 being merged now, you have (at least) two options:

Install the package from GitHub using the latest source (see https://github.com/madmaze/pytesseract#installation)
Edit the installed Python package files manually. This is only recommended if you have at least some further knowledge on this topic. Determine the location of the installed package files (Location field of pip show pytesseract) and edit the pytesseract/pytesseract.py file in there.

from pytesseract.

willdave865 commented on July 30, 2024

Hello Stefan DEBUG:pytesseract:['C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract', 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg', 'C:\\Users\\david\\AppData\\Local\\Temp\\tess_1zkcz_p9', '--dpi', '96', '--psm', '6', '-c', 'preserve_interword_spaces=1', '-c', 'tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.', 'txt'] I don't see what part of this debug logging shows me which CLI arguments *pytesseract* passes to Tesseract itself. What I see is extra escaping of path components, a reference to tess_1zkcz_p9 and finally a reference to 'txt'? But thank you for your troubleshooting help and the effort you are putting in.

…

On Sun, Apr 9, 2023 at 5:58 PM Stefan ***@***.***> wrote: I tried your code & included my own path\to\tesseract.exe (which seemed to be missing from your #483 <#483> example) This is correct, as I am on Linux where the system-wide installation of the binary is callable out-of-the-box. Received no warning level debug message - just the same output with missing spaces You should receive a debug-level message, not a warning-level one. The first logging configuration in my example code just sets the default logging level (of the root logger) to WARNING to only receive debug-level messages from *pytesseract*. If the log messages are still not visible after applying the patch from #483 <#483>, your execution environment might filter the output further, for example when used inside the interactive Python shell. Plain python script.py should usually show these messages in this case. It seems to me that you are talking about recompiling the tesseract.exe with an extra print statement for debugging purposes? A little clarification would be appreciated. You have to differentiate between the Tesseract CLI (tesseract.exe) and the *pytesseract* Python package files here. I am only talking about *pytesseract* - no need to recompile anything here. With #483 <#483> being merged now, you have (at least) two options: 1. Install the package from GitHub using the latest source (see https://github.com/madmaze/pytesseract#installation) 2. Edit the installed Python package files manually. This is only recommended if you have at least some further knowledge on this topic. Determine the location of the installed package files (Location field of pip show pytesseract) and edit the pytesseract/pytesseract.py file in there. — Reply to this email directly, view it on GitHub <#482 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATFEFUFB7KPPA4NFH6FRQF3XAJFYXANCNFSM6AAAAAAWVZQTFA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

from pytesseract.

stefan6419846 commented on July 30, 2024

I don't see what part of this debug logging shows me which CLI arguments pytesseract passes to Tesseract itself. What I see is extra escaping of path components, a reference to tess_1zkcz_p9 and finally a reference to 'txt'?

The log entry shows the parameters for the actual subprocess call. It consists of the following parts:

C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract is the actual Tesseract binary, which you specify by tesseract_cmd. Due to your value being a raw string (r prefix), the regular string will escape each backslash as \ will point to an (possibly invalid) escape sequence otherwise.
C:\\ocr\\target\\31832_226140__0001-00002b.jpg is your input file you have specified. As you already escaped your backslashes, no transformations have been done.
C:\\Users\\david\\AppData\\Local\\Temp\\tess_1zkcz_p9 is a temporary file which acts as the reference name/basename for any output files Tesseract will generate.
The DPI, PSM and configuration parameters (-c) are the parsed version of the config parameter you are passing.
txt tells Tesseract which configuration file to use - in this case to generate a plain text .txt file (due to pytesseract.image_to_string). See the CONFIGFILE section of https://manpages.ubuntu.com/manpages/jammy/man1/tesseract.1.html for example to see common values.

This list is passed to subprocess.Popen (could be subprocess.run as well) and evaluated there. The final internal system call basically joins the components together using ' '.join(), while correctly adding quotes around components where necessary.

from pytesseract.

willdave865 commented on July 30, 2024

Thank you. Almost everything you have written makes sense. I also looked at the Tesseract man page docs. Just for clarification please consider the following: Re: C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract is the actual Tesseract binary, which you specify by tesseract_cmd. Due to your value being a raw string (r prefix), the regular string will escape each backslash as \ will point to an (possibly invalid) escape sequence otherwise. My call to tesseract executable was: pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract' I escape because I'm using Windows and that's what we do in Python. However it seems like I don't have to escape the quoted file path slashes when using raw. Because according to https://docs.python.org/3/reference/lexical_analysis.html#string-literals "Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called *raw strings* and treat backslashes as literal characters. As a result, in string literals, '\U' and '\u' escapes in raw strings are not treated specially. " This can be proven because there is no error generated using my path C:\Program Files\Tesseract-OCR\tesseract (which is not true for my image source path) So it seems the* DEBUG:pytesseract* *output* then escapes the backslashes of the raw file path string on output. But leaves the previously escaped image backslashes as they were. That is how I'm thinking.

…

On Sun, Apr 9, 2023 at 11:33 PM Stefan ***@***.***> wrote: I don't see what part of this debug logging shows me which CLI arguments *pytesseract* passes to Tesseract itself. What I see is extra escaping of path components, a reference to tess_1zkcz_p9 and finally a reference to 'txt'? The log entry shows the parameters for the actual subprocess call. It consists of the following parts: - C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract is the actual Tesseract binary, which you specify by tesseract_cmd. Due to your value being a raw string (r prefix), the regular string will escape each backslash as \ will point to an (possibly invalid) escape sequence otherwise. - C:\\ocr\\target\\31832_226140__0001-00002b.jpg is your input file you have specified. As you already escaped your backslashes, no transformations have been done. - C:\\Users\\david\\AppData\\Local\\Temp\\tess_1zkcz_p9 is a temporary file which acts as the reference name/basename for any output files Tesseract will generate. - The DPI, PSM and configuration parameters (-c) are the parsed version of the config parameter you are passing. - txt tells Tesseract which configuration file to use - in this case to generate a plain text .txt file (due to pytesseract.image_to_string). See the CONFIGFILE section of https://manpages.ubuntu.com/manpages/jammy/man1/tesseract.1.html for example to see common values. This list is passed to subprocess.Popen (could be subprocess.run as well) and evaluated there. The final internal system call basically joins the components together using ' '.join(), while correctly adding quotes around components where necessary. — Reply to this email directly, view it on GitHub <#482 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATFEFUEYKSITHTYEG6EXGVTXAKNCLANCNFSM6AAAAAAWVZQTFA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

from pytesseract.

stefan6419846 commented on July 30, 2024

Yes, for the tesseract_cmd, your double backslashes are not required due to the raw string. The translation essentially is:

>>> r'C:\Program Files\Tesseract-OCR\tesseract' == 'C:\\Program Files\\Tesseract-OCR\\tesseract'
True

The image path is no raw string in this case, thus the \t and \31 escape sequences have to be escaped properly - with additional backslashes as you do. You could write this as raw strings as well, avoiding the escapes:

>>> 'C:\ocr\target\31832_226140__0001-00002b.jpg'
'C:\\ocr\target\x19832_226140__0001-00002b.jpg'
>>> print('C:\ocr\target\31832_226140__0001-00002b.jpg')
C:\ocr	arget832_226140__0001-00002b.jpg
>>> r'C:\ocr\target\31832_226140__0001-00002b.jpg' == 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
True

from pytesseract.

pytesseract.image_to_string not preserving interword spaces about pytesseract HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent