madmaze / pytesseract Goto Github PK

View Code? Open in Web Editor NEW

5.6K 108.0 704.0 1.51 MB

A Python wrapper for Google Tesseract

License: Apache License 2.0

Python 100.00%

pytesseract's Introduction

Python Tesseract

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded in images.

Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

USAGE

Quickstart

Note: Test images are located in the tests/data folder of the Git repo.

Library usage:

from PIL import Image

import pytesseract

# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))

# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))

# List of available languages
print(pytesseract.get_languages(config=''))

# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))

# Timeout/terminate the tesseract job after a period of time
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    pass

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')

# getting multiple types of output with one call to save compute time
# currently supports mix and match of the following: txt, pdf, hocr, box, tsv
text, boxes = pytesseract.run_and_get_multiple_output('test.png', extensions=['txt', 'box'])

Support for OpenCV image/NumPy array objects

import cv2

img_cv = cv2.imread(r'/<path_to_image>/digits.png')

# By default OpenCV stores images in BGR format and since pytesseract assumes RGB format,
# we need to convert from BGR to RGB format/mode:
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))
# OR
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb))

If you need custom configuration like oem/psm, use the config keyword.

# Example of adding any additional options
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

# Example of using pre-defined tesseract config file with options
cfg_filename = 'words'
pytesseract.run_and_get_output(image, extension='txt', config=cfg_filename)

Add the following config, if you have tessdata error like: "Error opening data file..."

# Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.
tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

Functions

get_languages Returns all currently supported languages by Tesseract OCR.
get_tesseract_version Returns the Tesseract version installed in the system.
image_to_string Returns unmodified output as string from Tesseract OCR processing
image_to_boxes Returns result containing recognized characters and their box boundaries
image_to_data Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
image_to_osd Returns result containing information about orientation and script detection.
image_to_alto_xml Returns result in the form of Tesseract's ALTO XML format.
run_and_get_output Returns the raw output from Tesseract OCR. Gives a bit more control over the parameters that are sent to tesseract.
run_and_get_multiple_output Returns like run_and_get_output but can handle multiple extensions. This function replaces the extension: str kwarg with extension: List[str] kwarg where a list of extensions can be specified and the corresponding data is returned after only one tesseract call. This function reduces the number of calls to tesseract when multiple output formats, like both text and bounding boxes, are needed.

Parameters

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

image Object or String - either PIL Image, NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.
lang String - Tesseract language code string. Defaults to eng if not specified! Example for multiple languages: lang='eng+fra'
config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'
nice Integer - modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
output_type Class attribute - specifies the type of the output, defaults to string. For the full list of all supported types, please check the definition of pytesseract.Output class.
timeout Integer or Float - duration in seconds for the OCR processing, after which, pytesseract will terminate and raise RuntimeError.
pandas_config Dict - only for the Output.DATAFRAME type. Dictionary with custom arguments for pandas.read_csv. Allows you to customize the output of image_to_data.

CLI usage:

pytesseract [-l lang] image_file

INSTALLATION

Prerequisites:

Python-tesseract requires Python 3.6+
You will need the Python Imaging Library (PIL) (or the Pillow fork). Please check the Pillow documentation to know the basic Pillow installation.
Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as tesseract. If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable pytesseract.pytesseract.tesseract_cmd. Under Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users. please install homebrew package tesseract.

Note: In some rare cases, you might need to additionally install tessconfigs and configs from tesseract-ocr/tessconfigs if the OS specific package doesn't include them.

Installing via pip:

Check the pytesseract package page for more information.

pip install pytesseract

Or if you have git installed:

pip install -U git+https://github.com/madmaze/pytesseract.git

Installing from source:

git clone https://github.com/madmaze/pytesseract.git
cd pytesseract && pip install -U .

Install with conda (via conda-forge):

conda install -c conda-forge pytesseract

TESTING

To run this project's test suite, install and run tox. Ensure that you have tesseract installed and in your PATH.

pip install tox
tox

LICENSE

Check the LICENSE file included in the Python-tesseract repository/distribution. As of Python-tesseract 0.3.1 the license is Apache License Version 2.0

CONTRIBUTORS

Originally written by Samuel Hoffstaetter
Full list of contributors

pytesseract's People

Contributors

Stargazers

Watchers

Forkers

marfarma aidanlister iveney elinaldosoft ruricolist chris-piekarski todrobbins yangzilong1986 deseram07 f3zz3h devlato ddy88958620 williamtang tonglanli rckprtr darkseed svisser karianakis muxuezi luis-wang hfeeki deandunbar redbaronmit adkatrit 744996162 iloveopenworld lotaku mkmojo iceleaf916 sr4l fish444555 apoyl tianwenc shareed2k kthh7 churehill liuyi1112 wavezhang crdcpythonclub epocolis jeffreybouva amoghbl1 tkharju xi-studio side2k qinwentu akfork felinx manggit wjzhengsjtu cuittzq yexihu zdeeb zhaog mwbaal windfarer equationdz moser kyliiat zxf fawkesley arthurtalkgoal myhau clisp sourabhgupta90 lina1 enkidulan tommilligan lixxu jammy112 lizadaly fangdejia tjerwinchen emanueles krd1 xtoux kamcord xebin sybbear gusenkovs naidu28 bossjones rmuhire teazj asitang janstk mukkoju ccrichard johnfrancisgit ivanbara donglianggao techtonik matrixy yuany snapsnapsnapsnap c0wb0y isnowalarm lijinhua1990 angelsci chenbooming

pytesseract's Issues

question about text recognise with digits and letters

hi there,

I have a feature request to recognize a bunch of verification codes with Tesseract-ocr, and I used pytesseract.

It works perfectly for an image with only letters. However, if the image consists of both digits and letters, the output always wrong. Is there any solution to increase quality for this scenario? Or can is that possible to involve some supervised process to improve accuracy.

Attachment is a false example for recognization and the output is "N 3U’7V".

Here is the code:

import sys
from pytesseract import image_to_string
from PIL import Image
im = Image.open(sys.argv[1])
im = im.convert('L')

def initTable(threshold=140):
table = []
    for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)

return table

binaryImage = im.point(initTable(), '1')
binaryImage.show()

print(image_to_string(binaryImage, config='-psm 7'))

WindowsError: [Error 2]

stuck in error.

python pytesseract.py test.png
Traceback (most recent call last):
  File "pytesseract.py", line 174, in <module>
    main()
  File "pytesseract.py", line 158, in main
    print(image_to_string(image))
  File "pytesseract.py", line 131, in image_to_string
    nice=nice)
  File "pytesseract.py", line 51, in run_tesseract
    proc = subprocess.Popen(command, stderr=subprocess.PIPE)
  File "c:\Python27\lib\subprocess.py", line 390, in __init__
    errread, errwrite)
  File "c:\Python27\lib\subprocess.py", line 640, in _execute_child
    startupinfo)
WindowsError: [Error 2]

I think I've installed all the dependencies needed but can't solve this one.

When I use pytesseract, it`ll return 'TypeError'.How to deal it?

Here is my code.

from PIL import Image
import pytesseract
img = Image.open('/Users/songmingyang/Pictures/ocr-test.png')
pytesseract.image_to_string(img)

Then return an Error.

TypeError: a bytes-like object is required, not 'str'

I have tried many ways to deal it.But it does`t work

ValueError

when i import pytsseract it raised ValueError from import image.

environment:
python 2.7.10
PIL 1.1.7
pytsseract 0.1.7
error:
ValueError: Attempted relative import in non-package

Traceback (most recent call last):

code:

try:
import Image
except ImportError:
from PIL import Image
import pytesseract
tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
print(pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim', config=tessdata_dir_config))

err:

Traceback (most recent call last):
File "/Users/huangangui/PycharmProjects/test/helloword.py", line 7, in
print(pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim', config=tessdata_dir_config))
File "/Library/Python/2.7/site-packages/pytesseract/pytesseract.py", line 125, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, u'Error opening data file <replace_with_your_tessdata_dir_path>/tessdata/chi_sim.traineddata')

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>

What can I do about this error message in python 3? It happens from time to time and it should not throw an exception.

Pytesseract.image_to_string(image,None, False, "-psm 6")
Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to

Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata

I am trying to use pytesseract on Jupyter Notebook.

Windows 10 x64
Running Jupyter Notebook (Anaconda3, Python 3.6.1) with administrative privilege
The work directory containing TIFF file is in different drive (Z:)

When I run the following code:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))

I get the following error:

TesseractError                            Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
     11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
     12 
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
     14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
    123         if status:
    124             errors = get_errors(error_string)
--> 125             raise TesseractError(status, errors)
    126         f = open(output_file_name, 'rb')
    127         try:

TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')

I found these two references helpful but I am missing something:
#50
#64

Thank you for your time on this!

UnicodeDecodeError, when I tested the jpg file or other pic, sometimes give me these error

UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 255: illegal multibyte sequence.
use your test file "test-european.jpg", the error occured.
And how can I get the lang?

Python3.6 FileNotFoundError: [WinError 2] 系统找不到指定的文件。

In [1]: from PIL import Image

In [2]: im = Image.open('D:/ai.png')

In [3]: im
Out[3]: <PIL.BmpImagePlugin.BmpImageFile image mode=RGB size=858x68 at 0x3C93B50>

In [4]: import pytesseract

In [5]: pytesseract.image_to_string(im)

FileNotFoundError Traceback (most recent call last)
in ()
----> 1 pytesseract.image_to_string(im)

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\site-packages\pytesseract\pytesseract.py in image_to_strin
g(image, lang, boxes, config)
159 lang=lang,
160 boxes=boxes,
--> 161 config=config)
162 if status:
163 errors = get_errors(error_string)

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\site-packages\pytesseract\pytesseract.py in run_tesseract(
input_filename, output_filename_base, lang, boxes, config)
92
93 proc = subprocess.Popen(command,
---> 94 stderr=subprocess.PIPE)
95 return (proc.wait(), proc.stderr.read())
96

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\subprocess.py in init(self, args, bufsize, executable,
stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_
signals, start_new_session, pass_fds, encoding, errors)
705 c2pread, c2pwrite,
706 errread, errwrite,
--> 707 restore_signals, start_new_session)
708 except:
709 # Cleanup if the child failed starting.

c:\users\lenovo\appdata\local\programs\python\python36-32\lib\subprocess.py in _execute_child(self, args, executable, pr
eexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errrea
d, errwrite, unused_restore_signals, unused_start_new_session)
988 env,
989 cwd,
--> 990 startupinfo)
991 finally:
992 # Child is launched. Close the parent's copy of those pipe

FileNotFoundError: [WinError 2] 系统找不到指定的文件。

IOError: cannot write mode LA as BMP

This directly from PIL.
I am trying to run this very basic against a single file in the same directory. Won't work.

Python 2.7 Mac OS X.

WinError 6

WIN7 32bit
I have add tesseract.exe to PATH
when I call pytesseract.image_to_string, I get the Exception:
2017-12-04 17:21:09,545 Thread-3: [WinError 6] 句柄无效。
Traceback (most recent call last):
File "ui.py", line 136, in do_guoshui
row['nsrmc'] = g.login()
File "guoshui.py", line 164, in login
"checkCode": self.checkCode()
File "guoshui.py", line 153, in checkCode
cc = self.shibie_tesseract(image_file)
File "guoshui.py", line 114, in shibie_tesseract
num = self.image_to_string(img)
File "guoshui.py", line 87, in image_to_string
subprocess.call(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "subprocess.py", line 267, in call
File "subprocess.py", line 665, in init
File "subprocess.py", line 883, in _get_handles
OSError: [WinError 6] 句柄无效。

so I google, and find the solution。
# proc = subprocess.Popen(command, stderr=subprocess.PIPE)
proc = subprocess.Popen(command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.PIPE,
shell=True)

TypeError: 'str' does not support the buffer interface

there happens error when i run the code:

from PIL import Image
import pytesseract

im1_path = r"D:\pictures\8.png"
image = Image.open(im1_path)
captcha_code = pytesseract.image_to_string(image)
print(captcha_code)

traceback:

Traceback (most recent call last):
  File "D:/python_workplace/Learn-to-identify-similar-images-master/similarity.py", line 5, in <module>
    captcha_code = pytesseract.image_to_string(image)
  File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
    errors = get_errors(error_string)
  File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Python34\lib\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: 'str' does not support the buffer interface

what the problem may be here?

TypeError: a bytes-like object is required, not 'str'

So, I just installed this lib and I'm using Python 3.5 on windows 7.

My code that's giving the error

myText2 = image_to_string(Image.open("myImage.png"))

Error that I'm getting :

Traceback (most recent call last):
  File "F:/Competitions/Donations/Scrapping.py", line 113, in <module>
    Scrapping()
  File "F:/Competitions/Donations/Scrapping.py", line 58, in __init__
    myText2 = image_to_string(Image.open("captcha.png"))
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
    errors = get_errors(error_string)
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'

I went through all the previous Issues regarding this matter and I've updated my tesseract version to 4.0 and have all the trained files available. But, I'm getting this error, what's the issue and how can it be fixed?

Older issue : #32

I even tried this step mentioned in SO answer, but after this, I have the same problem as OP. The error changes to

Traceback (most recent call last):
  File "F:/Competitions/Donations/Scrapping.py", line 113, in <module>
    Scrapping()
  File "F:/Competitions/Donations/Scrapping.py", line 58, in __init__
    myText2 = image_to_string(Image.open("captcha.bmp"))
  File "C:\Users\User Name\AppData\Roaming\Python\Python35\site-packages\pytesseract\pytesseract.py", line 164, in image_to_string
    raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\eng.traineddata')

where to download the tesseract executable, please?

pytesseract.pytesseract.tesseract_cmd = '<full_path_to_your_tesseract_executable>'

Include the above line, if you don't have tesseract executable in your PATH

Example tesseract_cmd: 'C:\Program Files (x86)\Tesseract-OCR\tesseract'

Where should I to download the executable, please? I am working on ubuntu and mac OS

WindowsError: [Error 2]

I ran the compiled execution file under windows7 python 2.7. And I got this error

c:\temp>pytesseract captcha.png
Traceback (most recent call last):
File "C:\Python27\Scripts\pytesseract-script.py", line 9, in
load_entry_point('pytesseract==0.1.6', 'console_scripts', 'pytesseract')()
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 187, in main
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 161, in image_to
_string
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 94, in run_tesse
ract
File "C:\Python27\lib\subprocess.py", line 710, in init
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2]

Python 3 compatibility changes needed

pytesseract does not run in python 3 without errors. There are some small issues that need fixing to make things run without errors on python 2 and 3.

In pytesseract.py:

All print statements need to be inside parentheses.
```
print(image_to_string(image))
```
StringIO needs to be imported from the io module
```
from io import StringIO
```
StringIO direct usage needed
sys.stderr = StringIO()

os.tempnam needs to be refactored because it is deprecated. https://docs.python.org/2/library/os.html#os.tempnam. One solution is to replace:

return os.tempnam(None, 'tess_')

with

import tempfile
with tempfile.NamedTemporaryFile() as tf:
    tmpname = 'tess_' + os.path.basename(tf.name)
return tmpname

file() is deprecated and should to be changed to open():
```
f = open(output_file_name)
```

In init.py import needs to be explicit:

  from .pytesseract import image_to_string

With these changes I could run pytesseract on both python 2.7 and python 3.5.

Suppress console window

Hi,
I found your project very useful and I like to thank you for your work :)
But after using it in standalone executable on windows, I'd like to suggest small improvement.
Currently (on win 8 and standalone exe) every call to pytesseract make console window to appear for ~1s which is very unpleasant. This small fix resolve this issue for me:

sinfo = subprocess.STARTUPINFO()
sinfo.dwFlags = subprocess.CREATE_NEW_CONSOLE | subprocess.STARTF_USESHOWWINDOW
sinfo.wShowWindow = subprocess.SW_HIDE
proc = subprocess.Popen(command, stderr=subprocess.PIPE, startupinfo=sinfo)

Would you consider adding some setting(or better solution) to allow user to decide, if console should be visible?

pytesseract ko on image_to_string

Hello,

*My source is the following :
*
`import pytesseract as pt
import Image as im
from PIL import Image as PILim

print pt.image_to_string(im.open("t.png"))

img = im.open('t.png')
img.load()
print pt.image_to_string(img)`

And I have:

Traceback (most recent call last):
File "D:\Fichier\Eclipse\Test__init__.py", line 28, in
print pt.image_to_string(img)
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 161, in image_to_string
File "build\bdist.win32\egg\pytesseract\pytesseract.py", line 94, in run_tesseract
File "C:\Python27\lib\subprocess.py", line 709, in init
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 957, in _execute_child
startupinfo)
WindowsError: [Error 2] Le fichier spécifié est introuvable

Do you have any idea?

Thanks in advance.

Doesn't fully recognize my image

Hey guys! I am pretty new at programming Python so take it easy with me please! I have an image provided above and if I use:

print pytesseract.image_to_string(img)

it returns:

Rggimantasih

Which is pretty close but as you can guess not fully accurate.

I tried converting image to black and white/grayscale but that did not help. Image and letters seem pretty clear maybe you can help me out here? Thanks in advance!!

CMD windows

While running this python keeps opening CMD windows http://i.imgur.com/pWI3JDW.png
How to prevent this ?

Config for different psm values does not work

When I run print pytesseract.image_to_string(image, boxes=False, config="-psm 0")

I get the processed string of the text in the image. However, psm 0 is supposed to only test the orientation. If you run this via the command line, the output should be like this:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Orientation: 2
Orientation in degrees: 180
Orientation confidence: 0.28
Script: 2
Script confidence: 0.04

The code looks like it should be correct, and you have accounted for this, but I do not get the result. I'm not sure what the problem is, I tried to debug it for a few hours. Still trying.

bug importing tesseract

when I do import tesseract I get
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/lib/python2.7/dist-packages/_tesseract.x86_64-linux-gnu.so: undefined symbol: gplotSimpleXY1

UnicodeDecodeErrors when processing certain images

I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.

It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:

https://code.google.com/p/tesseract-ocr/wiki/FAQ#What_output_formats_can_Tesseract_produce?

Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):

f = open(output_file_name, encoding='utf-8', errors='replace')

Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.

return f.read().strip().encode('ascii', 'replace').decode('ascii', 'replace')

This seems to work.. and seems to allow processing of any image to get some output, however useful it may be.

Thoughts welcome - not sure if it's the best way to go about resolving the errors.

Handling multi-page tiffs

Was attempting to use this with a multi-page tiff, but because of how Image.open() works, only the first page is turned into a string.

So I tried looping over n_frames using the seek() method, but was unsuccessful because pytesseract closes the image, so seek() throws ValueError: seek of closed file.

Perhaps pytesseract.py should check for n_frames, and convert the entire file by using the seek() method before closing? (Or at least offer that option).

As an example of what I was trying:

img = Image.open('path/to/my/img')
raw = ''
for i in range(img.n_frames):
    img.seek(i)
    raw+=pytesseract.image_to_string(img)

No output result

	image=Image.open('C:\\Users\\Bobliao\\Desktop\\GetValidateCode.jpg')
	tessdata_dir_config = '--tessdata-dir "E:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
	b=pytesseract.image_to_string(image,config=tessdata_dir_config)

str b is empty and no errors happen.

below is my sample JPG

IOError for OSD (psm 0)

Running pytesseract on Raspbian, python 2.7, tesseract 4.00 (4.00.00dev-625-g9c2fa0d).

My code is as follows:

ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")

Error:

Traceback (most recent call last):
  File "/home/pi/Vocable/TESTING.py", line 114, in <module>
    ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 126, in image_to_string
    f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory: '/tmp/tess_lN5JlN.txt'

If I run code with psm 1 [Recognition with OSD], I receive no errors but the upside down text is simply treated as right-side-up text, producing garbage results. (This was tested on an inverted test.png)

Essentially text recognition works but OSD does not.

Improve the result for small text

Hello

I am trying to use pytessaract for text detection in a GUI Swing java, the probleme is I don't know how to improve the result, and specialy when the font is small, please can i have some advice for that ?

Thanks,

Error parsing of tesseract output is brittle: a bytes-like object is required, not 'str'

When using python 3.5 and pillow (the original PIL library is quite old now), I receive an error on this very simple example:

import pytesseract

try:
    import Image
except ImportError:
    from PIL import Image

pytesseract.image_to_string(Image.open('test_image.png'))

The error is:

Traceback (most recent call last):
  File "tesseract_test.py", line 8, in <module>
    pytesseract.image_to_string(Image.open('test_image.png'))
  File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 163, in image_to_string
    errors = get_errors(error_string)
  File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 111, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Users\tbarik\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 111, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'

I'm using Windows 10, 64-bit, with python x64.

Unit testing

We need some basic unit tests to ensure when changes are made that we still support the same functionality.

AttributeError: 'NoneType' object has no attribute 'bands'

In [21]: d = Image.open('test.jpg')

In [22]: print(pytesseract.image_to_string(d))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-22-3c0cb3b15b33> in <module>()
----> 1 print(pytesseract.image_to_string(d))

/Library/Python/2.7/site-packages/pytesseract/pytesseract.pyc in image_to_string(image, lang, boxes, config)
    141     '''
    142
--> 143     if len(image.split()) == 4:
    144         # In case we have 4 channels, lets discard the Alpha.
    145         # Kind of a hack, should fix in the future some time.

/Library/Python/2.7/site-packages/PIL/Image.pyc in split(self)
   1495         "Split image into bands"
   1496
-> 1497         if self.im.bands == 1:
   1498             ims = [self.copy()]
   1499         else:

AttributeError: 'NoneType' object has no attribute 'bands'

What happend?
After

>>> d.load()

OSError: [Errno 2] No such file or directory

ResourceWarning

I am using pytesseract in Python3 code:

#!/usr/bin/env python3

from PIL import Image
import pytesseract

file = "file.txt"
text = tess(Image.open(file), lang=eng)

Everythink work fine, but when I wrote first unittest I get following warning:

/usr/lib/python3.4/site-packages/pytesseract/pytesseract.py:161: ResourceWarning: unclosed file <_io.BufferedReader name=4>  config=config)

Version: 0.1.6, unittesting via std unittest module.

I've got these error when I tried to compile Pytesseract

I installed OCR from the google page and I have successfully imported the library Pytesseract, but when I tried to compile this code on python and get these errors from Pytesseract.py and subprocess.py.

My code is:

9 - from pytesseract import image_to_string
10 -from PIL import Image
11-
12- im = Image.open('an91cut.jpg')
13- print(im)
14-
15- print(image_to_string(im))

And the 5 errors are these:

Traceback (most recent call last):
File "D:/Documentos/2015-2/Proyecto Electr�nico 1/PycharmProjects/Extracci�n de placa/ocr2.py", line 15, in
print(image_to_string(im))

File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 156, in image_to_string
status, error_string = run_tesseract(input_file_name,output_file_name_base,lang=lang,boxes=boxes,config=config)

File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 93, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)

File "C:\Python27\lib\subprocess.py", line 706, in init
self._execute_child(args, executable, preexec_fn, close_fds,cwd, env, universal_newlines,startupinfo, creationflags, shell,p2cread, p2cwrite,c2pread, c2pwrite, errread, errwrite)

File "C:\Python27\lib\subprocess.py", line 936, in _execute_child
hp, ht, pid, tid = _subprocess.CreateProcess(executable, args,None, None,int(not close_fds),creationflags,env,cwd, startupinfo)
WindowsError: [Error 5] Acceso denegado

Process finished with exit code 1

HELP PLEASE, I'M DEVELOPING A PROJECT OF LICENSE PLATE RECOGNITION AND MY FRIENDS AND ME ARE DELAYED WITH THE SOLUTION. (Apologize if my English is wrong. Greeting from Peru!)

The 'Empty page!!' warning doesn't show in the output

Hi!
When I use the tesseract in my terminal（OS X 10.11.3） like this:

tesseract test.png out

It will raise a warning:Empty page!!.
But when I use pytesseract in my code(Python 2.7.10) like this:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

receipt = Image.open('test.png')
receipt.load()
print(pytesseract.image_to_string(receipt))

The output is consist of 2 empty lines.
So, the 'Empty page!!' warning doesn't show in the output.
Thanks.

PremissionError: [Errno 13] Permission denied

I am using OSx with python 3.6

Here is my code:

try:
    import Image
except:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = '/usr/local/Cellar/tesseract/3.05.01/'

import cv2

img = cv2.imread('case_2.png')
img = Image.fromarray(img)
print(pytesseract.image_to_string(img, lang='chi_sim'))

But it outputted an error:

Traceback (most recent call last):
  File "/Users/Dylan/Documents/GitHub/Genedock/ocr_framework/tempfyllslkgxc.py", line 13, in <module>
    print(pytesseract.image_to_string(Image.open('case_2.png'), lang='chi_sim'))
  File "/Users/Dylan/anaconda/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 122, in image_to_string
    config=config)
  File "/Users/Dylan/anaconda/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 46, in run_tesseract
    proc = subprocess.Popen(command, stderr=subprocess.PIPE)
  File "/Users/Dylan/anaconda/lib/python3.6/subprocess.py", line 707, in __init__
    restore_signals, start_new_session)
  File "/Users/Dylan/anaconda/lib/python3.6/subprocess.py", line 1326, in _execute_child
    raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

I search all of the StackOverflow, but in vain.

So hope you can help me as soon as possible.

add html mode

I can use your config interface to add config=+hocr.txt, and include in that file a command to output hocr, but then the file cleanup process gets buggy because it looks for either .box or .txt, not .html's. the most logical way to fix this would probably be to directly switch "box=true' to something like "output" where output permits either box or hocr/html...

windows 10 :pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\tessdata/chi_sim.traineddata')

I am using pytesseract on windows 10 x64, and the python is 3.5.2 x64, Tesseract is 4.0 ,the code is as follow:

# -*- coding: utf-8 -*-

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract


print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))

error:

Traceback (most recent call last):
  File "D:/test.py", line 10, in <module>
    print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 165, in image_to_string
    raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\tessdata/chi_sim.traineddata')

C:\Program Files (x86)\Tesseract-OCR\tessdata,like this:

Permission to push code and raise a pull request

I have added the code for getting the output in tsv format. Please give me the permission to push code and raise a pull request for the review.
Thanks

Training?

I've OCRed some text and I'm not getting great results. Is there a way to train pytesseract?

how can I use the preserve_interword_spaces

I am trying to use the preserve_interword_spaces options but it doesn't seems to work
I tried
pytesseract.image_to_string(wandImg, lang='fra', config="-preserve_interword_spaces 1 -psm 5")
is there something I am missing?

image.split() causes error

img = Image.open('./imgs/SAM_0190.JPG')
pytesseract.image_to_string(img)
Causes:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 143, in image_to_string
if len(image.split()) == 4:
File "/usr/lib64/python2.7/site-packages/PIL/Image.py", line 1497, in split
if self.im.bands == 1:
AttributeError: 'NoneType' object has no attribute 'bands'

PIL version: 1.1.7
pytesseract: 0.1.6

Error encoding characters

I testing pytesseract, some images are recognized but not others.
The images not recognized are photos with text, the error is:

Encoding is cp437
Running...
Traceback (most recent call last):
  File "encuesta.py", line 13, in <module>
    print(pytesseract.image_to_string(Image.open('enc1.tiff'), boxes=False))
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1010-1011: character maps to <undefined>

I tested some commands of chcp, i.e:

chcp 437
chcp 16001
# and others

My actual code is:

import sys
print "Encoding is", sys.stdin.encoding

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
print "Running..."
print(pytesseract.image_to_string(Image.open('enc1.tiff'), boxes=False))

Is there any way for pytesseract to continue with recognition regardless of errors?

ValueError: Attempted relative import in non-package

I installed pytesseract by pip. But when importing pytesseract, it occurs the errors "ValueError: Attempted relative import in non-package". I correctly installed tesseract. The python version is 2.7, tesseract 3.04.01. I don't konw how to do.

Using Custom Dictionary in Pytesseract

I have created a seperate custom dictionary and a custom function which works perfect with normal command line tesseract but I want to know how to use the same in the pytesseract package. Because in pytesseract it is not using that function, neither am having any clue of how to provide the function name .

In normal I use:
tesseract test.png myfunc #myfunc being the custom function

How to do the same in pytesseract

Ability to specify the path to the tesseract binary via an argument

Perhaps turning the code into a class, the tesseract path as well as the rest of the configuration options could be passed during initialization.

Can the current version recognize Arabic text?

Is there a way to avoid I/O?

The usage section says the way to use pytesseract is

print(pytesseract.image_to_string(Image.open('test.png')))

I'm wondering if there is a way to avoid I/O. I am using OpenCV to do some image pre-processing prior to sending images to tesseract. Is there a way I can send the image from OpenCV directly to pytesseract instead of first saving it to a file? Here is what I have to do now

#save image from OpenCV to disk
cv2.imwrite(path_to_image, my_image)
print(pytesseract.image_to_string(Image.open(path_to_image)))

This way I have to first write the image to the disk. I believe this will not be scalable in a distributed architecture because of the I/O involved. Any way to avoid this?

UnicodeEncodeError: 'gbk' codec can't encode character u'\ufb01' in position 173 : illegal multibyte sequence

how to solve?

Improved method to discard alpha channel

Build info: pytesseract v0.17 with tesseract v4.0 built from source, running on Ubuntu 16.04 64-bit

In the function image_to_string there's a comment saying that the method used to remove the alpha channel is kind of a hack, while that method worked pretty well I'd like to propose using an alternative using the convert method built into PIL Image objects . This function is faster than the current implementation, and simplifies the code.

if len(image.split()) == 4:
    # In case we have 4 channels, lets discard the Alpha.
    # Kind of a hack, should fix in the future some time.
    r, g, b, a = image.split()
    image = Image.merge("RGB", (r, g, b))

The time for this current implementation to remove the alpha channel is 0.002046s,
My proposed implementation took around 0.000678s to remove the alpha channel. The code would directly replace the previous block of code:

if len(image.split()) == 4:
    image.convert('RGB')

This is the image both implementation was tested on:

I've submitted this change as PR#80.

Access is denied(Windows 10, Python2.7/3.5)

I am getting access denied error when trying to do basic operation.
I downloaded the windows version of tesseract from here (UB-Mannheim, : tesseract-ocr-setup-3.05.00dev.exe)

Tried moving tesseract from C to D drive, didn't help.

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR'
x = pytesseract.image_to_string(Image.open('cropped.png'))

OSError: [Errno 2] No such file or directory error

The image seems to open, but then when I run image_to_string, it throws an exception:

>>> i = Image.open('test.png')
>>> print(pytesseract.image_to_string(i))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 161, in image_to_string
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 94, in run_tesseract
    stderr=subprocess.PIPE)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory