Giter VIP home page Giter VIP logo

Comments (12)

stefan6419846 avatar stefan6419846 commented on July 30, 2024

You are simply disallowing any whitespace character on your output with your whitelist.

In your direct call to Tesseract, you have -c tessedit_char_whitelist='abcdefghijklm nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. ', while you basically have -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. in your pytesseract call (for debugging, try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract). Using

print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))

(or any variation which properly quotes the allowed characters) will generate the correct Tesseract call and correct output accordingly.

from pytesseract.

stefan6419846 avatar stefan6419846 commented on July 30, 2024

I have run tox

Just out of curiosity: Why? Where did you read that this would be required?

from pytesseract.

willdave865 avatar willdave865 commented on July 30, 2024

Thank you. You are correct. For me the issue was the use of single/double quotes. I used double because I want to include a single quote in my whitelist. For names like O'Leary. So I guess I need to escape my single quote to include it.

I ran tox because somewhere in the install I was advised to. I thought it checked the system configuration.

I don't know the syntax for when you say "for debugging, try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract" I haven't got that far in the manual yet.

You can close this issue or let me know if I should.

from pytesseract.

willdave865 avatar willdave865 commented on July 30, 2024

Sorry. You may have found some of my previous comments confusing.

Now that I have paid closer attention to what you have written. Please consider the following:

  • The issue of interword spaces not being preserved is non-existent based on your example producing the required spaces
  • My misunderstanding of the syntax is in regard to the use of single and double quotation marks
  • My reference to using double quotes around the literal single quote for the name O'Leary has come later and should not be considered a part of this discussion

Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.

I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract"

from pytesseract.

stefan6419846 avatar stefan6419846 commented on July 30, 2024

Thanks for the explanations and glad to see that this could be solved.

Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.

No problem.

I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract"

This basically should show what commands are being passed to Tesseract and requires modifying the distribution file to include the corresponding print statement. In your case, this enabled me to see that parsing the parameters would indeed drop the desired whitespace characters due to the invalid/missing quoting/escaping. A possible debugging solution can be seen in #483.

You can close this issue or let me know if I should.

As I am just a Tesseract/pytesseract user myself who just looks at the issues here and tries to debug usage errors, I cannot close this issue myself. If it is resolved for you, feel free to close it yourself to keep the issue list clean.

from pytesseract.

willdave865 avatar willdave865 commented on July 30, 2024

User error was the problem.

from pytesseract.

willdave865 avatar willdave865 commented on July 30, 2024

from pytesseract.

stefan6419846 avatar stefan6419846 commented on July 30, 2024

I tried your code & included my own path\to\tesseract.exe (which seemed to be missing from your #483 example)

This is correct, as I am on Linux where the system-wide installation of the binary is callable out-of-the-box.

Received no warning level debug message - just the same output with missing spaces

You should receive a debug-level message, not a warning-level one. The first logging configuration in my example code just sets the default logging level (of the root logger) to WARNING to only receive debug-level messages from pytesseract.

If the log messages are still not visible after applying the patch from #483, your execution environment might filter the output further, for example when used inside the interactive Python shell. Plain python script.py should usually show these messages in this case.

It seems to me that you are talking about recompiling the tesseract.exe with an extra print statement for debugging purposes? A little clarification would be appreciated.

You have to differentiate between the Tesseract CLI (tesseract.exe) and the pytesseract Python package files here. I am only talking about pytesseract - no need to recompile anything here.

With #483 being merged now, you have (at least) two options:

  1. Install the package from GitHub using the latest source (see https://github.com/madmaze/pytesseract#installation)
  2. Edit the installed Python package files manually. This is only recommended if you have at least some further knowledge on this topic. Determine the location of the installed package files (Location field of pip show pytesseract) and edit the pytesseract/pytesseract.py file in there.

from pytesseract.

willdave865 avatar willdave865 commented on July 30, 2024

from pytesseract.

stefan6419846 avatar stefan6419846 commented on July 30, 2024

I don't see what part of this debug logging shows me which CLI arguments pytesseract passes to Tesseract itself. What I see is extra escaping of path components, a reference to tess_1zkcz_p9 and finally a reference to 'txt'?

The log entry shows the parameters for the actual subprocess call. It consists of the following parts:

  • C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract is the actual Tesseract binary, which you specify by tesseract_cmd. Due to your value being a raw string (r prefix), the regular string will escape each backslash as \ will point to an (possibly invalid) escape sequence otherwise.
  • C:\\ocr\\target\\31832_226140__0001-00002b.jpg is your input file you have specified. As you already escaped your backslashes, no transformations have been done.
  • C:\\Users\\david\\AppData\\Local\\Temp\\tess_1zkcz_p9 is a temporary file which acts as the reference name/basename for any output files Tesseract will generate.
  • The DPI, PSM and configuration parameters (-c) are the parsed version of the config parameter you are passing.
  • txt tells Tesseract which configuration file to use - in this case to generate a plain text .txt file (due to pytesseract.image_to_string). See the CONFIGFILE section of https://manpages.ubuntu.com/manpages/jammy/man1/tesseract.1.html for example to see common values.

This list is passed to subprocess.Popen (could be subprocess.run as well) and evaluated there. The final internal system call basically joins the components together using ' '.join(), while correctly adding quotes around components where necessary.

from pytesseract.

willdave865 avatar willdave865 commented on July 30, 2024

from pytesseract.

stefan6419846 avatar stefan6419846 commented on July 30, 2024

Yes, for the tesseract_cmd, your double backslashes are not required due to the raw string. The translation essentially is:

>>> r'C:\Program Files\Tesseract-OCR\tesseract' == 'C:\\Program Files\\Tesseract-OCR\\tesseract'
True

The image path is no raw string in this case, thus the \t and \31 escape sequences have to be escaped properly - with additional backslashes as you do. You could write this as raw strings as well, avoiding the escapes:

>>> 'C:\ocr\target\31832_226140__0001-00002b.jpg'
'C:\\ocr\target\x19832_226140__0001-00002b.jpg'
>>> print('C:\ocr\target\31832_226140__0001-00002b.jpg')
C:\ocr	arget832_226140__0001-00002b.jpg
>>> r'C:\ocr\target\31832_226140__0001-00002b.jpg' == 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
True

from pytesseract.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.