madmaze / pytesseract Goto Github PK

View Code? Open in Web Editor NEW

5.7K 109.0 714.0 1.51 MB

A Python wrapper for Google Tesseract

License: Apache License 2.0

Python 100.00%

pytesseract's Issues

ValueError

when i import pytsseract it raised ValueError from import image.

environment:
python 2.7.10
PIL 1.1.7
pytsseract 0.1.7
error:
ValueError: Attempted relative import in non-package

Improve the result for small text

Hello

I am trying to use pytessaract for text detection in a GUI Swing java, the probleme is I don't know how to improve the result, and specialy when the font is small, please can i have some advice for that ?

Thanks,

OSError: [Errno 2] No such file or directory error

The image seems to open, but then when I run image_to_string, it throws an exception:

>>> i = Image.open('test.png')
>>> print(pytesseract.image_to_string(i))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 161, in image_to_string
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 94, in run_tesseract
    stderr=subprocess.PIPE)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Error encoding characters

I testing pytesseract, some images are recognized but not others.
The images not recognized are photos with text, the error is:

Encoding is cp437
Running...
Traceback (most recent call last):
  File "encuesta.py", line 13, in <module>
    print(pytesseract.image_to_string(Image.open('enc1.tiff'), boxes=False))
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1010-1011: character maps to <undefined>

I tested some commands of chcp, i.e:

chcp 437
chcp 16001
# and others

My actual code is:

import sys
print "Encoding is", sys.stdin.encoding

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
print "Running..."
print(pytesseract.image_to_string(Image.open('enc1.tiff'), boxes=False))

Is there any way for pytesseract to continue with recognition regardless of errors?

CMD windows

While running this python keeps opening CMD windows http://i.imgur.com/pWI3JDW.png
How to prevent this ?

Permission to push code and raise a pull request

I have added the code for getting the output in tsv format. Please give me the permission to push code and raise a pull request for the review.
Thanks

Config for different psm values does not work

When I run print pytesseract.image_to_string(image, boxes=False, config="-psm 0")

I get the processed string of the text in the image. However, psm 0 is supposed to only test the orientation. If you run this via the command line, the output should be like this:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Orientation: 2
Orientation in degrees: 180
Orientation confidence: 0.28
Script: 2
Script confidence: 0.04

The code looks like it should be correct, and you have accounted for this, but I do not get the result. I'm not sure what the problem is, I tried to debug it for a few hours. Still trying.

UnicodeDecodeErrors when processing certain images

I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.

It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:

https://code.google.com/p/tesseract-ocr/wiki/FAQ#What_output_formats_can_Tesseract_produce?

Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):

f = open(output_file_name, encoding='utf-8', errors='replace')

Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.

return f.read().strip().encode('ascii', 'replace').decode('ascii', 'replace')

This seems to work.. and seems to allow processing of any image to get some output, however useful it may be.

Thoughts welcome - not sure if it's the best way to go about resolving the errors.

where to download the tesseract executable, please?

pytesseract.pytesseract.tesseract_cmd = '<full_path_to_your_tesseract_executable>'

Include the above line, if you don't have tesseract executable in your PATH

Example tesseract_cmd: 'C:\Program Files (x86)\Tesseract-OCR\tesseract'

Where should I to download the executable, please? I am working on ubuntu and mac OS

The 'Empty page!!' warning doesn't show in the output

Hi!
When I use the tesseract in my terminal（OS X 10.11.3） like this:

tesseract test.png out

It will raise a warning:Empty page!!.
But when I use pytesseract in my code(Python 2.7.10) like this:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

receipt = Image.open('test.png')
receipt.load()
print(pytesseract.image_to_string(receipt))

The output is consist of 2 empty lines.
So, the 'Empty page!!' warning doesn't show in the output.
Thanks.

IOError for OSD (psm 0)

Running pytesseract on Raspbian, python 2.7, tesseract 4.00 (4.00.00dev-625-g9c2fa0d).

My code is as follows:

ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")

Error:

Traceback (most recent call last):
  File "/home/pi/Vocable/TESTING.py", line 114, in <module>
    ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 126, in image_to_string
    f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory: '/tmp/tess_lN5JlN.txt'

If I run code with psm 1 [Recognition with OSD], I receive no errors but the upside down text is simply treated as right-side-up text, producing garbage results. (This was tested on an inverted test.png)

Essentially text recognition works but OSD does not.

I've got these error when I tried to compile Pytesseract

I installed OCR from the google page and I have successfully imported the library Pytesseract, but when I tried to compile this code on python and get these errors from Pytesseract.py and subprocess.py.

My code is:

9 - from pytesseract import image_to_string
10 -from PIL import Image
11-
12- im = Image.open('an91cut.jpg')
13- print(im)
14-
15- print(image_to_string(im))

And the 5 errors are these:

Traceback (most recent call last):
File "D:/Documentos/2015-2/Proyecto Electr�nico 1/PycharmProjects/Extracci�n de placa/ocr2.py", line 15, in
print(image_to_string(im))

File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 156, in image_to_string
status, error_string = run_tesseract(input_file_name,output_file_name_base,lang=lang,boxes=boxes,config=config)

File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 93, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)

File "C:\Python27\lib\subprocess.py", line 706, in init
self._execute_child(args, executable, preexec_fn, close_fds,cwd, env, universal_newlines,startupinfo, creationflags, shell,p2cread, p2cwrite,c2pread, c2pwrite, errread, errwrite)

File "C:\Python27\lib\subprocess.py", line 936, in _execute_child
hp, ht, pid, tid = _subprocess.CreateProcess(executable, args,None, None,int(not close_fds),creationflags,env,cwd, startupinfo)
WindowsError: [Error 5] Acceso denegado

Process finished with exit code 1

HELP PLEASE, I'M DEVELOPING A PROJECT OF LICENSE PLATE RECOGNITION AND MY FRIENDS AND ME ARE DELAYED WITH THE SOLUTION. (Apologize if my English is wrong. Greeting from Peru!)

TypeError: a bytes-like object is required, not 'str'

So, I just installed this lib and I'm using Python 3.5 on windows 7.

My code that's giving the error

myText2 = image_to_string(Image.open("myImage.png"))

Error that I'm getting :

Traceback (most recent call last):
  File "F:/Competitions/Donations/Scrapping.py", line 113, in <module>
    Scrapping()
  File "F:/Competitions/Donations/Scrapping.py", line 58, in __init__
    myText2 = image_to_string(Image.open("captcha.png"))
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
    errors = get_errors(error_string)
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'

I went through all the previous Issues regarding this matter and I've updated my tesseract version to 4.0 and have all the trained files available. But, I'm getting this error, what's the issue and how can it be fixed?

Older issue : #32

I even tried this step mentioned in SO answer, but after this, I have the same problem as OP. The error changes to

Traceback (most recent call last):
  File "F:/Competitions/Donations/Scrapping.py", line 113, in <module>
    Scrapping()
  File "F:/Competitions/Donations/Scrapping.py", line 58, in __init__
    myText2 = image_to_string(Image.open("captcha.bmp"))
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 164, in image_to_string
    raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\eng.traineddata')

Unit testing

We need some basic unit tests to ensure when changes are made that we still support the same functionality.

WindowsError: [Error 2]

I ran the compiled execution file under windows7 python 2.7. And I got this error

c:\temp>pytesseract captcha.png
Traceback (most recent call last):
File "C:\Python27\Scripts\pytesseract-script.py", line 9, in
load_entry_point('pytesseract==0.1.6', 'console_scripts', 'pytesseract')()
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 187, in main
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 161, in image_to
_string
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 94, in run_tesse
ract
File "C:\Python27\lib\subprocess.py", line 710, in init
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2]

question about text recognise with digits and letters

hi there,

I have a feature request to recognize a bunch of verification codes with Tesseract-ocr, and I used pytesseract.

It works perfectly for an image with only letters. However, if the image consists of both digits and letters, the output always wrong. Is there any solution to increase quality for this scenario? Or can is that possible to involve some supervised process to improve accuracy.

Attachment is a false example for recognization and the output is "N 3U’7V".

Here is the code:

import sys
from pytesseract import image_to_string
from PIL import Image
im = Image.open(sys.argv[1])
im = im.convert('L')

def initTable(threshold=140):
table = []
    for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)

return table

binaryImage = im.point(initTable(), '1')
binaryImage.show()

print(image_to_string(binaryImage, config='-psm 7'))

Error parsing of tesseract output is brittle: a bytes-like object is required, not 'str'

When using python 3.5 and pillow (the original PIL library is quite old now), I receive an error on this very simple example:

import pytesseract

try:
    import Image
except ImportError:
    from PIL import Image

pytesseract.image_to_string(Image.open('test_image.png'))

The error is:

Traceback (most recent call last):
  File "tesseract_test.py", line 8, in <module>
    pytesseract.image_to_string(Image.open('test_image.png'))
  File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
    errors = get_errors(error_string)
  File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'

I'm using Windows 10, 64-bit, with python x64.

how can I use the preserve_interword_spaces

I am trying to use the preserve_interword_spaces options but it doesn't seems to work
I tried
pytesseract.image_to_string(wandImg, lang='fra', config="-preserve_interword_spaces 1 -psm 5")
is there something I am missing?

pytesseract ko on image_to_string

Hello,

*My source is the following :
*
`import pytesseract as pt
import Image as im
from PIL import Image as PILim

print pt.image_to_string(im.open("t.png"))

img = im.open('t.png')
img.load()
print pt.image_to_string(img)`

And I have:

Traceback (most recent call last):
File "D:\Fichier\Eclipse\Test__init__.py", line 28, in
print pt.image_to_string(img)
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 161, in image_to_string
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 94, in run_tesseract
File "C:\Python27\lib\subprocess.py", line 709, in init
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 957, in _execute_child
startupinfo)
WindowsError: [Error 2] Le fichier spécifié est introuvable

Do you have any idea?

Thanks in advance.

Using Custom Dictionary in Pytesseract

I have created a seperate custom dictionary and a custom function which works perfect with normal command line tesseract but I want to know how to use the same in the pytesseract package. Because in pytesseract it is not using that function, neither am having any clue of how to provide the function name .

In normal I use:
tesseract test.png myfunc #myfunc being the custom function

How to do the same in pytesseract

WinError 6

WIN7 32bit
I have add tesseract.exe to PATH
when I call pytesseract.image_to_string, I get the Exception:
2017-12-04 17:21:09,545 Thread-3: [WinError 6] 句柄无效。
Traceback (most recent call last):
File "ui.py", line 136, in do_guoshui
row['nsrmc'] = g.login()
File "guoshui.py", line 164, in login
"checkCode": self.checkCode()
File "guoshui.py", line 153, in checkCode
cc = self.shibie_tesseract(image_file)
File "guoshui.py", line 114, in shibie_tesseract
num = self.image_to_string(img)
File "guoshui.py", line 87, in image_to_string
subprocess.call(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "subprocess.py", line 267, in call
File "subprocess.py", line 665, in init
File "subprocess.py", line 883, in _get_handles
OSError: [WinError 6] 句柄无效。

so I google, and find the solution。
# proc = subprocess.Popen(command, stderr=subprocess.PIPE)
proc = subprocess.Popen(command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.PIPE,
shell=True)

Improved method to discard alpha channel

Build info: pytesseract v0.17 with tesseract v4.0 built from source, running on Ubuntu 16.04 64-bit

In the function image_to_string there's a comment saying that the method used to remove the alpha channel is kind of a hack, while that method worked pretty well I'd like to propose using an alternative using the convert method built into PIL Image objects . This function is faster than the current implementation, and simplifies the code.

if len(image.split()) == 4:
    # In case we have 4 channels, lets discard the Alpha.
    # Kind of a hack, should fix in the future some time.
    r, g, b, a = image.split()
    image = Image.merge("RGB", (r, g, b))

The time for this current implementation to remove the alpha channel is 0.002046s,
My proposed implementation took around 0.000678s to remove the alpha channel. The code would directly replace the previous block of code:

if len(image.split()) == 4:
    image.convert('RGB')

This is the image both implementation was tested on:

I've submitted this change as PR#80.

Ability to specify the path to the tesseract binary via an argument

Perhaps turning the code into a class, the tesseract path as well as the rest of the configuration options could be passed during initialization.

TypeError: 'str' does not support the buffer interface

there happens error when i run the code:

from PIL import Image
import pytesseract

im1_path = r"D:\pictures\8.png"
image = Image.open(im1_path)
captcha_code = pytesseract.image_to_string(image)
print(captcha_code)

traceback:

Traceback (most recent call last):
  File "D:/python_workplace/Learn-to-identify-similar-images-master/similarity.py", line 5, in <module>
    captcha_code = pytesseract.image_to_string(image)
  File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
    errors = get_errors(error_string)
  File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: 'str' does not support the buffer interface

what the problem may be here?

Can the current version recognize Arabic text?

Doesn't fully recognize my image

Hey guys! I am pretty new at programming Python so take it easy with me please! I have an image provided above and if I use:

print pytesseract.image_to_string(img)

it returns:

Rggimantasih

Which is pretty close but as you can guess not fully accurate.

I tried converting image to black and white/grayscale but that did not help. Image and letters seem pretty clear maybe you can help me out here? Thanks in advance!!

ResourceWarning

I am using pytesseract in Python3 code:

#!/usr/bin/env python3

from PIL import Image
import pytesseract

file = "file.txt"
text = tess(Image.open(file), lang=eng)

Everythink work fine, but when I wrote first unittest I get following warning:

/usr/lib/python3.4/site-packages/pytesseract/pytesseract.py:161: ResourceWarning: unclosed file <_io.BufferedReader name=4>  config=config)

Version: 0.1.6, unittesting via std unittest module.

No output result

	image=Image.open('C:\\Users\\Bobliao\\Desktop\\GetValidateCode.jpg')
	tessdata_dir_config = '--tessdata-dir "E:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
	b=pytesseract.image_to_string(image,config=tessdata_dir_config)

str b is empty and no errors happen.

below is my sample JPG

Python 3 compatibility changes needed

pytesseract does not run in python 3 without errors. There are some small issues that need fixing to make things run without errors on python 2 and 3.

In pytesseract.py:

All print statements need to be inside parentheses.
```
print(image_to_string(image))
```
StringIO needs to be imported from the io module
```
from io import StringIO
```
StringIO direct usage needed
sys.stderr = StringIO()

os.tempnam needs to be refactored because it is deprecated. https://docs.python.org/2/library/os.html#os.tempnam. One solution is to replace:

return os.tempnam(None, 'tess_')

with

import tempfile
with tempfile.NamedTemporaryFile() as tf:
    tmpname = 'tess_' + os.path.basename(tf.name)
return tmpname

file() is deprecated and should to be changed to open():
```
f = open(output_file_name)
```

In init.py import needs to be explicit:

  from .pytesseract import image_to_string

With these changes I could run pytesseract on both python 2.7 and python 3.5.

When I use pytesseract, it`ll return 'TypeError'.How to deal it?

Here is my code.

from PIL import Image
import pytesseract
img = Image.open('/Users/songmingyang/Pictures/ocr-test.png')
pytesseract.image_to_string(img)

Then return an Error.

TypeError: a bytes-like object is required, not 'str'

I have tried many ways to deal it.But it does`t work

image.split() causes error

img = Image.open('./imgs/SAM_0190.JPG')
pytesseract.image_to_string(img)
Causes:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 143, in image_to_string
if len(image.split()) == 4:
File "/usr/lib64/python2.7/site-packages/PIL/Image.py", line 1497, in split
if self.im.bands == 1:
AttributeError: 'NoneType' object has no attribute 'bands'

PIL version: 1.1.7
pytesseract: 0.1.6

bug importing tesseract

when I do import tesseract I get
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/lib/python2.7/dist-packages/_tesseract.x86_64-linux-gnu.so: undefined symbol: gplotSimpleXY1

UnicodeDecodeError, when I tested the jpg file or other pic, sometimes give me these error

UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 255: illegal multibyte sequence.
use your test file "test-european.jpg", the error occured.
And how can I get the lang?

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>

What can I do about this error message in python 3? It happens from time to time and it should not throw an exception.

Pytesseract.image_to_string(image,None, False, "-psm 6")
Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to

Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata

I am trying to use pytesseract on Jupyter Notebook.

Windows 10 x64
Running Jupyter Notebook (Anaconda3, Python 3.6.1) with administrative privilege
The work directory containing TIFF file is in different drive (Z:)

When I run the following code:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))

I get the following error:

TesseractError                            Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
     11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
     12 
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
     14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
    123         if status:
    124             errors = get_errors(error_string)
--> 125             raise TesseractError(status, errors)
    126         f = open(output_file_name, 'rb')
    127         try:

TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')

I found these two references helpful but I am missing something:
#50
#64

Thank you for your time on this!

Training?

I've OCRed some text and I'm not getting great results. Is there a way to train pytesseract?

PremissionError: [Errno 13] Permission denied

I am using OSx with python 3.6

Here is my code:

try:
    import Image
except:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = '/usr/local/Cellar/tesseract/3.05.01/'

import cv2

img = cv2.imread('case_2.png')
img = Image.fromarray(img)
print(pytesseract.image_to_string(img, lang='chi_sim'))

But it outputted an error:

Traceback (most recent call last):
  File "/Users/Dylan/Documents/GitHub/Genedock/ocr_framework/tempfyllslkgxc.py", line 13, in <module>
    print(pytesseract.image_to_string(Image.open('case_2.png'), lang='chi_sim'))
  File "/Users/Dylan/anaconda/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 122, in image_to_string
    config=config)
  File "/Users/Dylan/anaconda/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 46, in run_tesseract
    proc = subprocess.Popen(command, stderr=subprocess.PIPE)
  File "/Users/Dylan/anaconda/lib/python3.6/subprocess.py", line 707, in __init__
    restore_signals, start_new_session)
  File "/Users/Dylan/anaconda/lib/python3.6/subprocess.py", line 1326, in _execute_child
    raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

I search all of the StackOverflow, but in vain.

So hope you can help me as soon as possible.

Access is denied(Windows 10, Python2.7/3.5)

I am getting access denied error when trying to do basic operation.
I downloaded the windows version of tesseract from here (UB-Mannheim, : tesseract-ocr-setup-3.05.00dev.exe)

Tried moving tesseract from C to D drive, didn't help.

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR'
x = pytesseract.image_to_string(Image.open('cropped.png'))

add html mode

I can use your config interface to add config=+hocr.txt, and include in that file a command to output hocr, but then the file cleanup process gets buggy because it looks for either .box or .txt, not .html's. the most logical way to fix this would probably be to directly switch "box=true' to something like "output" where output permits either box or hocr/html...

WindowsError: [Error 2]

stuck in error.

python pytesseract.py test.png
Traceback (most recent call last):
  File "pytesseract.py", line 174, in <module>
    main()
  File "pytesseract.py", line 158, in main
    print(image_to_string(image))
  File "pytesseract.py", line 131, in image_to_string
    nice=nice)
  File "pytesseract.py", line 51, in run_tesseract
    proc = subprocess.Popen(command, stderr=subprocess.PIPE)
  File "c:\Python27\lib\subprocess.py", line 390, in __init__
    errread, errwrite)
  File "c:\Python27\lib\subprocess.py", line 640, in _execute_child
    startupinfo)
WindowsError: [Error 2]

I think I've installed all the dependencies needed but can't solve this one.

windows 10 :pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\tessdata/chi_sim.traineddata')

I am using pytesseract on windows 10 x64, and the python is 3.5.2 x64, Tesseract is 4.0 ,the code is as follow:

# -*- coding: utf-8 -*-

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract


print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))

error:

Traceback (most recent call last):
  File "D:/test.py", line 10, in <module>
    print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 165, in image_to_string
    raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\tessdata/chi_sim.traineddata')

C:\Program Files (x86)\Tesseract-OCR\tessdata,like this:

Is there a way to avoid I/O?

The usage section says the way to use pytesseract is

print(pytesseract.image_to_string(Image.open('test.png')))

I'm wondering if there is a way to avoid I/O. I am using OpenCV to do some image pre-processing prior to sending images to tesseract. Is there a way I can send the image from OpenCV directly to pytesseract instead of first saving it to a file? Here is what I have to do now

#save image from OpenCV to disk
cv2.imwrite(path_to_image, my_image)
print(pytesseract.image_to_string(Image.open(path_to_image)))

This way I have to first write the image to the disk. I believe this will not be scalable in a distributed architecture because of the I/O involved. Any way to avoid this?

Suppress console window

Hi,
I found your project very useful and I like to thank you for your work :)
But after using it in standalone executable on windows, I'd like to suggest small improvement.
Currently (on win 8 and standalone exe) every call to pytesseract make console window to appear for ~1s which is very unpleasant. This small fix resolve this issue for me:

sinfo = subprocess.STARTUPINFO()
sinfo.dwFlags = subprocess.CREATE_NEW_CONSOLE | subprocess.STARTF_USESHOWWINDOW
sinfo.wShowWindow = subprocess.SW_HIDE
proc = subprocess.Popen(command, stderr=subprocess.PIPE, startupinfo=sinfo)

Would you consider adding some setting(or better solution) to allow user to decide, if console should be visible?

IOError: cannot write mode LA as BMP

This directly from PIL.
I am trying to run this very basic against a single file in the same directory. Won't work.

Python 2.7 Mac OS X.

Traceback (most recent call last):

code:

try:
import Image
except ImportError:
from PIL import Image
import pytesseract
tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
print(pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim', config=tessdata_dir_config))

err:

Traceback (most recent call last):
File "/Users/huangangui/PycharmProjects/test/helloword.py", line 7, in
print(pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim', config=tessdata_dir_config))
File "/Library/Python/2.7/site-packages/pytesseract/pytesseract.py", line 125, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, u'Error opening data file <replace_with_your_tessdata_dir_path>/tessdata/chi_sim.traineddata')

Python3.6 FileNotFoundError: [WinError 2] 系统找不到指定的文件。

In [1]: from PIL import Image

In [2]: im = Image.open('D:/ai.png')

In [3]: im
Out[3]: <PIL.BmpImagePlugin.BmpImageFile image mode=RGB size=858x68 at 0x3C93B50>

In [4]: import pytesseract

In [5]: pytesseract.image_to_string(im)

FileNotFoundError Traceback (most recent call last)
in ()
----> 1 pytesseract.image_to_string(im)

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\site-packages\pytesseract\pytesseract.py in image_to_strin
g(image, lang, boxes, config)
159 lang=lang,
160 boxes=boxes,
--> 161 config=config)
162 if status:
163 errors = get_errors(error_string)

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\site-packages\pytesseract\pytesseract.py in run_tesseract(
input_filename, output_filename_base, lang, boxes, config)
92
93 proc = subprocess.Popen(command,
---> 94 stderr=subprocess.PIPE)
95 return (proc.wait(), proc.stderr.read())
96

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\subprocess.py in init(self, args, bufsize, executable,
stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_
signals, start_new_session, pass_fds, encoding, errors)
705 c2pread, c2pwrite,
706 errread, errwrite,
--> 707 restore_signals, start_new_session)
708 except:
709 # Cleanup if the child failed starting.

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\subprocess.py in _execute_child(self, args, executable, pr
eexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errrea
d, errwrite, unused_restore_signals, unused_start_new_session)
988 env,
989 cwd,
--> 990 startupinfo)
991 finally:
992 # Child is launched. Close the parent's copy of those pipe

FileNotFoundError: [WinError 2] 系统找不到指定的文件。

AttributeError: 'NoneType' object has no attribute 'bands'

In [21]: d = Image.open('test.jpg')

In [22]: print(pytesseract.image_to_string(d))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-22-3c0cb3b15b33> in <module>()
----> 1 print(pytesseract.image_to_string(d))

/Library/Python/2.7/site-packages/pytesseract/pytesseract.pyc in image_to_string(image, lang, boxes, config)
    141     '''
    142
--> 143     if len(image.split()) == 4:
    144         # In case we have 4 channels, lets discard the Alpha.
    145         # Kind of a hack, should fix in the future some time.

/Library/Python/2.7/site-packages/PIL/Image.pyc in split(self)
   1495         "Split image into bands"
   1496
-> 1497         if self.im.bands == 1:
   1498             ims = [self.copy()]
   1499         else:

AttributeError: 'NoneType' object has no attribute 'bands'

What happend?
After

>>> d.load()

OSError: [Errno 2] No such file or directory

Handling multi-page tiffs

Was attempting to use this with a multi-page tiff, but because of how Image.open() works, only the first page is turned into a string.

So I tried looping over n_frames using the seek() method, but was unsuccessful because pytesseract closes the image, so seek() throws ValueError: seek of closed file.

Perhaps pytesseract.py should check for n_frames, and convert the entire file by using the seek() method before closing? (Or at least offer that option).

As an example of what I was trying:

img = Image.open('path/to/my/img')
raw = ''
for i in range(img.n_frames):
    img.seek(i)
    raw+=pytesseract.image_to_string(img)

UnicodeEncodeError: 'gbk' codec can't encode character u'\ufb01' in position 173 : illegal multibyte sequence

how to solve?

ValueError: Attempted relative import in non-package

I installed pytesseract by pip. But when importing pytesseract, it occurs the errors "ValueError: Attempted relative import in non-package". I correctly installed tesseract. The python version is 2.7, tesseract 3.04.01. I don't konw how to do.

madmaze / pytesseract Goto Github PK

pytesseract's Issues

Include the above line, if you don't have tesseract executable in your PATH

Example tesseract_cmd: 'C:\Program Files (x86)\Tesseract-OCR\tesseract'

My code is:

print pt.image_to_string(im.open("t.png"))

Recommend Projects

Recommend Topics

Recommend Org