madmaze / pytesseract Goto Github PK
View Code? Open in Web Editor NEWA Python wrapper for Google Tesseract
License: Apache License 2.0
A Python wrapper for Google Tesseract
License: Apache License 2.0
when i import pytsseract it raised ValueError from import image.
environment:
python 2.7.10
PIL 1.1.7
pytsseract 0.1.7
error:
ValueError: Attempted relative import in non-package
Hello
I am trying to use pytessaract for text detection in a GUI Swing java, the probleme is I don't know how to improve the result, and specialy when the font is small, please can i have some advice for that ?
Thanks,
The image seems to open, but then when I run image_to_string, it throws an exception:
>>> i = Image.open('test.png')
>>> print(pytesseract.image_to_string(i))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 161, in image_to_string
config=config)
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 94, in run_tesseract
stderr=subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
I testing pytesseract, some images are recognized but not others.
The images not recognized are photos with text, the error is:
Encoding is cp437
Running...
Traceback (most recent call last):
File "encuesta.py", line 13, in <module>
print(pytesseract.image_to_string(Image.open('enc1.tiff'), boxes=False))
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1010-1011: character maps to <undefined>
I tested some commands of chcp
, i.e:
chcp 437
chcp 16001
# and others
My actual code is:
import sys
print "Encoding is", sys.stdin.encoding
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
print "Running..."
print(pytesseract.image_to_string(Image.open('enc1.tiff'), boxes=False))
Is there any way for pytesseract to continue with recognition regardless of errors?
While running this python keeps opening CMD windows http://i.imgur.com/pWI3JDW.png
How to prevent this ?
I have added the code for getting the output in tsv format. Please give me the permission to push code and raise a pull request for the review.
Thanks
When I run print pytesseract.image_to_string(image, boxes=False, config="-psm 0")
I get the processed string of the text in the image. However, psm 0 is supposed to only test the orientation. If you run this via the command line, the output should be like this:
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Orientation: 2
Orientation in degrees: 180
Orientation confidence: 0.28
Script: 2
Script confidence: 0.04
The code looks like it should be correct, and you have accounted for this, but I do not get the result. I'm not sure what the problem is, I tried to debug it for a few hours. Still trying.
I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.
It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:
https://code.google.com/p/tesseract-ocr/wiki/FAQ#What_output_formats_can_Tesseract_produce?
Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):
f = open(output_file_name, encoding='utf-8', errors='replace')
Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.
return f.read().strip().encode('ascii', 'replace').decode('ascii', 'replace')
This seems to work.. and seems to allow processing of any image to get some output, however useful it may be.
Thoughts welcome - not sure if it's the best way to go about resolving the errors.
pytesseract.pytesseract.tesseract_cmd = '<full_path_to_your_tesseract_executable>'
Where should I to download the executable, please? I am working on ubuntu and mac OS
Hi!
When I use the tesseract in my terminal(OS X 10.11.3) like this:
tesseract test.png out
It will raise a warning:Empty page!!.
But when I use pytesseract in my code(Python 2.7.10) like this:
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
receipt = Image.open('test.png')
receipt.load()
print(pytesseract.image_to_string(receipt))
The output is consist of 2 empty lines.
So, the 'Empty page!!' warning doesn't show in the output.
Thanks.
Running pytesseract on Raspbian, python 2.7, tesseract 4.00 (4.00.00dev-625-g9c2fa0d).
My code is as follows:
ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")
Error:
Traceback (most recent call last):
File "/home/pi/Vocable/TESTING.py", line 114, in <module>
ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 126, in image_to_string
f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory: '/tmp/tess_lN5JlN.txt'
If I run code with psm 1 [Recognition with OSD], I receive no errors but the upside down text is simply treated as right-side-up text, producing garbage results. (This was tested on an inverted test.png)
Essentially text recognition works but OSD does not.
I installed OCR from the google page and I have successfully imported the library Pytesseract, but when I tried to compile this code on python and get these errors from Pytesseract.py and subprocess.py.
9 - from pytesseract import image_to_string
10 -from PIL import Image
11-
12- im = Image.open('an91cut.jpg')
13- print(im)
14-
15- print(image_to_string(im))
And the 5 errors are these:
Traceback (most recent call last):
File "D:/Documentos/2015-2/Proyecto Electr�nico 1/PycharmProjects/Extracci�n de placa/ocr2.py", line 15, in
print(image_to_string(im))
File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 156, in image_to_string
status, error_string = run_tesseract(input_file_name,output_file_name_base,lang=lang,boxes=boxes,config=config)
File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 93, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)
File "C:\Python27\lib\subprocess.py", line 706, in init
self._execute_child(args, executable, preexec_fn, close_fds,cwd, env, universal_newlines,startupinfo, creationflags, shell,p2cread, p2cwrite,c2pread, c2pwrite, errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 936, in _execute_child
hp, ht, pid, tid = _subprocess.CreateProcess(executable, args,None, None,int(not close_fds),creationflags,env,cwd, startupinfo)
WindowsError: [Error 5] Acceso denegado
Process finished with exit code 1
HELP PLEASE, I'M DEVELOPING A PROJECT OF LICENSE PLATE RECOGNITION AND MY FRIENDS AND ME ARE DELAYED WITH THE SOLUTION. (Apologize if my English is wrong. Greeting from Peru!)
So, I just installed this lib and I'm using Python 3.5 on windows 7.
My code that's giving the error
myText2 = image_to_string(Image.open("myImage.png"))
Error that I'm getting :
Traceback (most recent call last):
File "F:/Competitions/Donations/Scrapping.py", line 113, in <module>
Scrapping()
File "F:/Competitions/Donations/Scrapping.py", line 58, in __init__
myText2 = image_to_string(Image.open("captcha.png"))
File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
errors = get_errors(error_string)
File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
error_lines = tuple(line for line in lines if line.find('Error') >= 0)
File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'
I went through all the previous Issues regarding this matter and I've updated my tesseract version to 4.0 and have all the trained files available. But, I'm getting this error, what's the issue and how can it be fixed?
Older issue : #32
I even tried this step mentioned in SO answer, but after this, I have the same problem as OP. The error changes to
Traceback (most recent call last):
File "F:/Competitions/Donations/Scrapping.py", line 113, in <module>
Scrapping()
File "F:/Competitions/Donations/Scrapping.py", line 58, in __init__
myText2 = image_to_string(Image.open("captcha.bmp"))
File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 164, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\eng.traineddata')
We need some basic unit tests to ensure when changes are made that we still support the same functionality.
I ran the compiled execution file under windows7 python 2.7. And I got this error
c:\temp>pytesseract captcha.png
Traceback (most recent call last):
File "C:\Python27\Scripts\pytesseract-script.py", line 9, in
load_entry_point('pytesseract==0.1.6', 'console_scripts', 'pytesseract')()
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 187, in main
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 161, in image_to
_string
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 94, in run_tesse
ract
File "C:\Python27\lib\subprocess.py", line 710, in init
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2]
hi there,
I have a feature request to recognize a bunch of verification codes with Tesseract-ocr, and I used pytesseract.
It works perfectly for an image with only letters. However, if the image consists of both digits and letters, the output always wrong. Is there any solution to increase quality for this scenario? Or can is that possible to involve some supervised process to improve accuracy.
Attachment is a false example for recognization and the output is "N 3U’7V".
Here is the code:
import sys
from pytesseract import image_to_string
from PIL import Image
im = Image.open(sys.argv[1])
im = im.convert('L')
def initTable(threshold=140):
table = []
for i in range(256):
if i < threshold:
table.append(0)
else:
table.append(1)
return table
binaryImage = im.point(initTable(), '1')
binaryImage.show()
print(image_to_string(binaryImage, config='-psm 7'))
When using python 3.5 and pillow (the original PIL library is quite old now), I receive an error on this very simple example:
import pytesseract
try:
import Image
except ImportError:
from PIL import Image
pytesseract.image_to_string(Image.open('test_image.png'))
The error is:
Traceback (most recent call last):
File "tesseract_test.py", line 8, in <module>
pytesseract.image_to_string(Image.open('test_image.png'))
File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
errors = get_errors(error_string)
File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
error_lines = tuple(line for line in lines if line.find('Error') >= 0)
File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'
I'm using Windows 10, 64-bit, with python x64.
I am trying to use the preserve_interword_spaces options but it doesn't seems to work
I tried
pytesseract.image_to_string(wandImg, lang='fra', config="-preserve_interword_spaces 1 -psm 5")
is there something I am missing?
Hello,
*My source is the following :
*
`import pytesseract as pt
import Image as im
from PIL import Image as PILim
img = im.open('t.png')
img.load()
print pt.image_to_string(img)`
And I have:
Traceback (most recent call last):
File "D:\Fichier\Eclipse\Test__init__.py", line 28, in
print pt.image_to_string(img)
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 161, in image_to_string
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 94, in run_tesseract
File "C:\Python27\lib\subprocess.py", line 709, in init
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 957, in _execute_child
startupinfo)
WindowsError: [Error 2] Le fichier spécifié est introuvable
Do you have any idea?
Thanks in advance.
I have created a seperate custom dictionary and a custom function which works perfect with normal command line tesseract but I want to know how to use the same in the pytesseract package. Because in pytesseract it is not using that function, neither am having any clue of how to provide the function name .
In normal I use:
tesseract test.png myfunc #myfunc being the custom function
How to do the same in pytesseract
WIN7 32bit
I have add tesseract.exe to PATH
when I call pytesseract.image_to_string, I get the Exception:
2017-12-04 17:21:09,545 Thread-3: [WinError 6] 句柄无效。
Traceback (most recent call last):
File "ui.py", line 136, in do_guoshui
row['nsrmc'] = g.login()
File "guoshui.py", line 164, in login
"checkCode": self.checkCode()
File "guoshui.py", line 153, in checkCode
cc = self.shibie_tesseract(image_file)
File "guoshui.py", line 114, in shibie_tesseract
num = self.image_to_string(img)
File "guoshui.py", line 87, in image_to_string
subprocess.call(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "subprocess.py", line 267, in call
File "subprocess.py", line 665, in init
File "subprocess.py", line 883, in _get_handles
OSError: [WinError 6] 句柄无效。
so I google, and find the solution。
# proc = subprocess.Popen(command, stderr=subprocess.PIPE)
proc = subprocess.Popen(command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.PIPE,
shell=True)
Build info: pytesseract v0.17 with tesseract v4.0 built from source, running on Ubuntu 16.04 64-bit
In the function image_to_string there's a comment saying that the method used to remove the alpha channel is kind of a hack, while that method worked pretty well I'd like to propose using an alternative using the convert method built into PIL Image objects . This function is faster than the current implementation, and simplifies the code.
if len(image.split()) == 4:
# In case we have 4 channels, lets discard the Alpha.
# Kind of a hack, should fix in the future some time.
r, g, b, a = image.split()
image = Image.merge("RGB", (r, g, b))
The time for this current implementation to remove the alpha channel is 0.002046s,
My proposed implementation took around 0.000678s to remove the alpha channel. The code would directly replace the previous block of code:
if len(image.split()) == 4:
image.convert('RGB')
This is the image both implementation was tested on:
I've submitted this change as PR#80.
Perhaps turning the code into a class, the tesseract path as well as the rest of the configuration options could be passed during initialization.
there happens error when i run the code:
from PIL import Image
import pytesseract
im1_path = r"D:\pictures\8.png"
image = Image.open(im1_path)
captcha_code = pytesseract.image_to_string(image)
print(captcha_code)
traceback:
Traceback (most recent call last):
File "D:/python_workplace/Learn-to-identify-similar-images-master/similarity.py", line 5, in <module>
captcha_code = pytesseract.image_to_string(image)
File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
errors = get_errors(error_string)
File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
error_lines = tuple(line for line in lines if line.find('Error') >= 0)
File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: 'str' does not support the buffer interface
what the problem may be here?
Hey guys! I am pretty new at programming Python so take it easy with me please! I have an image provided above and if I use:
print pytesseract.image_to_string(img)
it returns:
Rggimantasih
Which is pretty close but as you can guess not fully accurate.
I tried converting image to black and white/grayscale but that did not help. Image and letters seem pretty clear maybe you can help me out here? Thanks in advance!!
I am using pytesseract in Python3 code:
#!/usr/bin/env python3
from PIL import Image
import pytesseract
file = "file.txt"
text = tess(Image.open(file), lang=eng)
Everythink work fine, but when I wrote first unittest I get following warning:
/usr/lib/python3.4/site-packages/pytesseract/pytesseract.py:161: ResourceWarning: unclosed file <_io.BufferedReader name=4> config=config)
Version: 0.1.6, unittesting via std unittest
module.
image=Image.open('C:\\Users\\Bobliao\\Desktop\\GetValidateCode.jpg')
tessdata_dir_config = '--tessdata-dir "E:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
b=pytesseract.image_to_string(image,config=tessdata_dir_config)
str b
is empty and no errors happen.
below is my sample JPG
pytesseract does not run in python 3 without errors. There are some small issues that need fixing to make things run without errors on python 2 and 3.
In pytesseract.py:
All print statements need to be inside parentheses.
print(image_to_string(image))
StringIO needs to be imported from the io module
from io import StringIO
StringIO direct usage needed
sys.stderr = StringIO()
os.tempnam needs to be refactored because it is deprecated. https://docs.python.org/2/library/os.html#os.tempnam. One solution is to replace:
return os.tempnam(None, 'tess_')
with
import tempfile
with tempfile.NamedTemporaryFile() as tf:
tmpname = 'tess_' + os.path.basename(tf.name)
return tmpname
file() is deprecated and should to be changed to open():
f = open(output_file_name)
In init.py import needs to be explicit:
from .pytesseract import image_to_string
With these changes I could run pytesseract on both python 2.7 and python 3.5.
Here is my code.
from PIL import Image
import pytesseract
img = Image.open('/Users/songmingyang/Pictures/ocr-test.png')
pytesseract.image_to_string(img)
Then return an Error.
TypeError: a bytes-like object is required, not 'str'
I have tried many ways to deal it.But it does`t work
img = Image.open('./imgs/SAM_0190.JPG')
pytesseract.image_to_string(img)
Causes:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 143, in image_to_string
if len(image.split()) == 4:
File "/usr/lib64/python2.7/site-packages/PIL/Image.py", line 1497, in split
if self.im.bands == 1:
AttributeError: 'NoneType' object has no attribute 'bands'
PIL version: 1.1.7
pytesseract: 0.1.6
when I do import tesseract I get
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/lib/python2.7/dist-packages/_tesseract.x86_64-linux-gnu.so: undefined symbol: gplotSimpleXY1
UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 255: illegal multibyte sequence.
use your test file "test-european.jpg", the error occured.
And how can I get the lang?
What can I do about this error message in python 3? It happens from time to time and it should not throw an exception.
Pytesseract.image_to_string(image,None, False, "-psm 6")
Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to
I am trying to use pytesseract on Jupyter Notebook.
When I run the following code:
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))
I get the following error:
TesseractError Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
12
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
123 if status:
124 errors = get_errors(error_string)
--> 125 raise TesseractError(status, errors)
126 f = open(output_file_name, 'rb')
127 try:
TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')
I found these two references helpful but I am missing something:
#50
#64
Thank you for your time on this!
I've OCRed some text and I'm not getting great results. Is there a way to train pytesseract?
I am using OSx with python 3.6
Here is my code:
try:
import Image
except:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = '/usr/local/Cellar/tesseract/3.05.01/'
import cv2
img = cv2.imread('case_2.png')
img = Image.fromarray(img)
print(pytesseract.image_to_string(img, lang='chi_sim'))
But it outputted an error:
Traceback (most recent call last):
File "/Users/Dylan/Documents/GitHub/Genedock/ocr_framework/tempfyllslkgxc.py", line 13, in <module>
print(pytesseract.image_to_string(Image.open('case_2.png'), lang='chi_sim'))
File "/Users/Dylan/anaconda/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 122, in image_to_string
config=config)
File "/Users/Dylan/anaconda/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 46, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)
File "/Users/Dylan/anaconda/lib/python3.6/subprocess.py", line 707, in __init__
restore_signals, start_new_session)
File "/Users/Dylan/anaconda/lib/python3.6/subprocess.py", line 1326, in _execute_child
raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied
I search all of the StackOverflow, but in vain.
So hope you can help me as soon as possible.
I am getting access denied error when trying to do basic operation.
I downloaded the windows version of tesseract from here (UB-Mannheim, : tesseract-ocr-setup-3.05.00dev.exe)
Tried moving tesseract from C to D drive, didn't help.
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR'
x = pytesseract.image_to_string(Image.open('cropped.png'))
I can use your config interface to add config=+hocr.txt, and include in that file a command to output hocr, but then the file cleanup process gets buggy because it looks for either .box or .txt, not .html's. the most logical way to fix this would probably be to directly switch "box=true' to something like "output" where output permits either box or hocr/html...
stuck in error.
python pytesseract.py test.png
Traceback (most recent call last):
File "pytesseract.py", line 174, in <module>
main()
File "pytesseract.py", line 158, in main
print(image_to_string(image))
File "pytesseract.py", line 131, in image_to_string
nice=nice)
File "pytesseract.py", line 51, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)
File "c:\Python27\lib\subprocess.py", line 390, in __init__
errread, errwrite)
File "c:\Python27\lib\subprocess.py", line 640, in _execute_child
startupinfo)
WindowsError: [Error 2]
I think I've installed all the dependencies needed but can't solve this one.
I am using pytesseract on windows 10 x64, and the python is 3.5.2
x64, Tesseract is 4.0
,the code is as follow:
# -*- coding: utf-8 -*-
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
error:
Traceback (most recent call last):
File "D:/test.py", line 10, in <module>
print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 165, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\tessdata/chi_sim.traineddata')
C:\Program Files (x86)\Tesseract-OCR\tessdata
,like this:
The usage section says the way to use pytesseract is
print(pytesseract.image_to_string(Image.open('test.png')))
I'm wondering if there is a way to avoid I/O. I am using OpenCV to do some image pre-processing prior to sending images to tesseract. Is there a way I can send the image from OpenCV directly to pytesseract instead of first saving it to a file? Here is what I have to do now
#save image from OpenCV to disk
cv2.imwrite(path_to_image, my_image)
print(pytesseract.image_to_string(Image.open(path_to_image)))
This way I have to first write the image to the disk. I believe this will not be scalable in a distributed architecture because of the I/O involved. Any way to avoid this?
Hi,
I found your project very useful and I like to thank you for your work :)
But after using it in standalone executable on windows, I'd like to suggest small improvement.
Currently (on win 8 and standalone exe) every call to pytesseract make console window to appear for ~1s which is very unpleasant. This small fix resolve this issue for me:
sinfo = subprocess.STARTUPINFO()
sinfo.dwFlags = subprocess.CREATE_NEW_CONSOLE | subprocess.STARTF_USESHOWWINDOW
sinfo.wShowWindow = subprocess.SW_HIDE
proc = subprocess.Popen(command, stderr=subprocess.PIPE, startupinfo=sinfo)
Would you consider adding some setting(or better solution) to allow user to decide, if console should be visible?
This directly from PIL.
I am trying to run this very basic against a single file in the same directory. Won't work.
Python 2.7 Mac OS X.
code:
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
print(pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim', config=tessdata_dir_config))
err:
Traceback (most recent call last):
File "/Users/huangangui/PycharmProjects/test/helloword.py", line 7, in
print(pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim', config=tessdata_dir_config))
File "/Library/Python/2.7/site-packages/pytesseract/pytesseract.py", line 125, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, u'Error opening data file <replace_with_your_tessdata_dir_path>/tessdata/chi_sim.traineddata')
In [1]: from PIL import Image
In [2]: im = Image.open('D:/ai.png')
In [3]: im
Out[3]: <PIL.BmpImagePlugin.BmpImageFile image mode=RGB size=858x68 at 0x3C93B50>
In [4]: import pytesseract
In [5]: pytesseract.image_to_string(im)
FileNotFoundError Traceback (most recent call last)
in ()
----> 1 pytesseract.image_to_string(im)
c:\users\lenovo\appdata\local\programs\python\python36-32\lib\site-packages\pytesseract\pytesseract.py in image_to_strin
g(image, lang, boxes, config)
159 lang=lang,
160 boxes=boxes,
--> 161 config=config)
162 if status:
163 errors = get_errors(error_string)
c:\users\lenovo\appdata\local\programs\python\python36-32\lib\site-packages\pytesseract\pytesseract.py in run_tesseract(
input_filename, output_filename_base, lang, boxes, config)
92
93 proc = subprocess.Popen(command,
---> 94 stderr=subprocess.PIPE)
95 return (proc.wait(), proc.stderr.read())
96
c:\users\lenovo\appdata\local\programs\python\python36-32\lib\subprocess.py in init(self, args, bufsize, executable,
stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_
signals, start_new_session, pass_fds, encoding, errors)
705 c2pread, c2pwrite,
706 errread, errwrite,
--> 707 restore_signals, start_new_session)
708 except:
709 # Cleanup if the child failed starting.
c:\users\lenovo\appdata\local\programs\python\python36-32\lib\subprocess.py in _execute_child(self, args, executable, pr
eexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errrea
d, errwrite, unused_restore_signals, unused_start_new_session)
988 env,
989 cwd,
--> 990 startupinfo)
991 finally:
992 # Child is launched. Close the parent's copy of those pipe
FileNotFoundError: [WinError 2] 系统找不到指定的文件。
In [21]: d = Image.open('test.jpg')
In [22]: print(pytesseract.image_to_string(d))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-3c0cb3b15b33> in <module>()
----> 1 print(pytesseract.image_to_string(d))
/Library/Python/2.7/site-packages/pytesseract/pytesseract.pyc in image_to_string(image, lang, boxes, config)
141 '''
142
--> 143 if len(image.split()) == 4:
144 # In case we have 4 channels, lets discard the Alpha.
145 # Kind of a hack, should fix in the future some time.
/Library/Python/2.7/site-packages/PIL/Image.pyc in split(self)
1495 "Split image into bands"
1496
-> 1497 if self.im.bands == 1:
1498 ims = [self.copy()]
1499 else:
AttributeError: 'NoneType' object has no attribute 'bands'
What happend?
After
>>> d.load()
OSError: [Errno 2] No such file or directory
Was attempting to use this with a multi-page tiff, but because of how Image.open()
works, only the first page is turned into a string.
So I tried looping over n_frames
using the seek()
method, but was unsuccessful because pytesseract closes the image, so seek()
throws ValueError: seek of closed file
.
Perhaps pytesseract.py
should check for n_frames
, and convert the entire file by using the seek()
method before closing? (Or at least offer that option).
As an example of what I was trying:
img = Image.open('path/to/my/img')
raw = ''
for i in range(img.n_frames):
img.seek(i)
raw+=pytesseract.image_to_string(img)
I installed pytesseract by pip. But when importing pytesseract, it occurs the errors "ValueError: Attempted relative import in non-package". I correctly installed tesseract. The python version is 2.7, tesseract 3.04.01. I don't konw how to do.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.