Giter VIP home page Giter VIP logo

tesserocr's People

Contributors

aggiebill avatar belval avatar bertsky avatar betaboon avatar bpotard avatar diiigle avatar fladi avatar flip111 avatar glqstrauss avatar johnthagen avatar lambdaterm avatar llazzaro avatar nijel avatar noahmetzger avatar norm-ideal avatar peterzhizhin avatar polaris- avatar ricardomga avatar rmast avatar simonflueckiger avatar sirfz avatar stweil avatar timgates42 avatar tirkarthi avatar tysonite avatar zdenop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tesserocr's Issues

Not supporting Python 3?

i got error when i run the code

from tesserocr import PyTessBaseAPI

print(tesserocr.tesseract_version()) # print tesseract-ocr version
print(tesserocr.get_languages() )


TypeError Traceback (most recent call last)
in ()
----> 1 from tesserocr import PyTessBaseAPI
2
3
4 print(tesserocr.tesseract_version()) # print tesseract-ocr version

/home/parallels/tesserocr/tesserocr/tesserocr.pyx in init tesserocr (tesserocr.cpp:22946)()
42 cdef unicode _abs_path = abspath(join(_api.GetDatapath(), os.pardir)) + os.sep
43 cdef unicode _lang_s = _api.GetInitLanguagesAsString()
---> 44 cdef cchar_t *_DEFAULT_PATH = _abs_path
45 cdef cchar_t *_DEFAULT_LANG = _lang_s
46 _api.End()

TypeError: expected bytes, str found

Installation error - error: command 'gcc' failed with exit status 1

Cython installed via conda.
My compiler version -
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4

The full error message:
running install
running bdist_egg
running egg_info
creating tesserocr.egg-info
writing tesserocr.egg-info/PKG-INFO
writing top-level names to tesserocr.egg-info/top_level.txt
writing dependency_links to tesserocr.egg-info/dependency_links.txt
writing manifest file 'tesserocr.egg-info/SOURCES.txt'
reading manifest file 'tesserocr.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'tesserocr.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
skipping 'tesserocr.cpp' Cython extension (up-to-date)
building 'tesserocr' extension
creating build
creating build/temp.linux-x86_64-2.7
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/yonatan/anaconda/envs/scraper/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
tesserocr.cpp: In function ‘PyObject* pyx_pf_9tesserocr_14PyPageIterator_20SetBoundingBoxComponents(_pyx_obj_9tesserocr_PyPageIterator, bool, bool)’:
tesserocr.cpp:3933:25: error: ‘class tesseract::PageIterator’ has no member named ‘SetBoundingBoxComponents’
__pyx_v_self->_piter->SetBoundingBoxComponents(__pyx_v_include_upper_dots, pyx_v_include_lower_dots);
^
tesserocr.cpp: In function ‘PyObject
pyx_pf_9tesserocr_14PyPageIterator_34GetImage(pyx_obj_9tesserocr_PyPageIterator, tesseract::PageIteratorLevel, int, PyObject)’:
tesserocr.cpp:5195:125: error: no matching function for call to ‘tesseract::PageIterator::GetImage(tesseract::PageIteratorLevel&, int&, Pix
&, int
, int
)’
__pyx_v_pix = __pyx_v_self->_piter->GetImage(__pyx_v_level, __pyx_v_padding, __pyx_v_opix, (&_pyx_v_left), (&pyx_v_top));
^
tesserocr.cpp:5195:125: note: candidate is:
In file included from tesserocr.cpp:258:0:
/usr/include/tesseract/pageiterator.h:239:8: note: Pix
tesseract::PageIterator::GetImage(tesseract::PageIteratorLevel, int, int
, int
) const
Pix* GetImage(PageIteratorLevel level, int padding,
^
/usr/include/tesseract/pageiterator.h:239:8: note: candidate expects 4 arguments, 5 provided
tesserocr.cpp: In function ‘PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_74AnalyseLayout(_pyx_obj_9tesserocr_PyTessBaseAPI, bool)’:
tesserocr.cpp:15239:83: error: no matching function for call to ‘tesseract::TessBaseAPI::AnalyseLayout(bool&)’
__pyx_v_piter = __pyx_v_self->_baseapi.AnalyseLayout(_pyx_v_merge_similar_words);
^
tesserocr.cpp:15239:83: note: candidate is:
In file included from tesserocr.cpp:262:0:
/usr/include/tesseract/baseapi.h:489:17: note: tesseract::PageIterator
tesseract::TessBaseAPI::AnalyseLayout()
PageIterator* AnalyseLayout();
^
/usr/include/tesseract/baseapi.h:489:17: note: candidate expects 0 arguments, 1 provided
tesserocr.cpp: In function ‘tesseract::TessResultRenderer* __pyx_f_9tesserocr_13PyTessBaseAPI__get_renderer(_pyx_obj_9tesserocr_PyTessBaseAPI, _pyx_t_9tesseract_cchar_t)’:
tesserocr.cpp:15592:88: error: no matching function for call to ‘tesseract::TessHOcrRenderer::TessHOcrRenderer(_pyx_t_9tesseract_cchar_t&, bool&)’
__pyx_t_2 = new tesseract::TessHOcrRenderer(__pyx_v_outputbase, __pyx_v_font_info);
^
tesserocr.cpp:15592:88: note: candidates are:
In file included from tesserocr.cpp:261:0:
/usr/include/tesseract/renderer.h:175:3: note: tesseract::TessHOcrRenderer::TessHOcrRenderer()
TessHOcrRenderer();
^
/usr/include/tesseract/renderer.h:175:3: note: candidate expects 0 arguments, 2 provided
/usr/include/tesseract/renderer.h:173:16: note: tesseract::TessHOcrRenderer::TessHOcrRenderer(const tesseract::TessHOcrRenderer&)
class TESS_API TessHOcrRenderer : public TessResultRenderer {
^
/usr/include/tesseract/renderer.h:173:16: note: candidate expects 1 argument, 2 provided
tesserocr.cpp:15635:106: error: no matching function for call to ‘tesseract::TessPDFRenderer::TessPDFRenderer(pyx_t_9tesseract_cchar_t&, const char)’
__pyx_t_3 = new tesseract::TessPDFRenderer(__pyx_v_outputbase, __pyx_v_self->baseapi.GetDatapath());
^
tesserocr.cpp:15635:106: note: candidates are:
In file included from tesserocr.cpp:261:0:
/usr/include/tesseract/renderer.h:188:3: note: tesseract::TessPDFRenderer::TessPDFRenderer(const char
)
TessPDFRenderer(const char _datadir);
^
/usr/include/tesseract/renderer.h:188:3: note: candidate expects 1 argument, 2 provided
/usr/include/tesseract/renderer.h:186:16: note: tesseract::TessPDFRenderer::TessPDFRenderer(const tesseract::TessPDFRenderer&)
class TESS_API TessPDFRenderer : public TessResultRenderer {
^
/usr/include/tesseract/renderer.h:186:16: note: candidate expects 1 argument, 2 provided
tesserocr.cpp:15719:69: error: no matching function for call to ‘tesseract::TessUnlvRenderer::TessUnlvRenderer(_pyx_t_9tesseract_cchar_t&)’
__pyx_t_4 = new tesseract::TessUnlvRenderer(__pyx_v_outputbase);
^
tesserocr.cpp:15719:69: note: candidates are:
In file included from tesserocr.cpp:261:0:
/usr/include/tesseract/renderer.h:227:3: note: tesseract::TessUnlvRenderer::TessUnlvRenderer()
TessUnlvRenderer();
^
/usr/include/tesseract/renderer.h:227:3: note: candidate expects 0 arguments, 1 provided
/usr/include/tesseract/renderer.h:225:16: note: tesseract::TessUnlvRenderer::TessUnlvRenderer(const tesseract::TessUnlvRenderer&)
class TESS_API TessUnlvRenderer : public TessResultRenderer {
^
/usr/include/tesseract/renderer.h:225:16: note: no known conversion for argument 1 from ‘_pyx_t_9tesseract_cchar_t* {aka const char}’ to ‘const tesseract::TessUnlvRenderer&’
tesserocr.cpp:15803:72: error: no matching function for call to ‘tesseract::TessBoxTextRenderer::TessBoxTextRenderer(_pyx_t_9tesseract_cchar_t&)’
__pyx_t_5 = new tesseract::TessBoxTextRenderer(__pyx_v_outputbase);
^
tesserocr.cpp:15803:72: note: candidates are:
In file included from tesserocr.cpp:261:0:
/usr/include/tesseract/renderer.h:238:3: note: tesseract::TessBoxTextRenderer::TessBoxTextRenderer()
TessBoxTextRenderer();
^
/usr/include/tesseract/renderer.h:238:3: note: candidate expects 0 arguments, 1 provided
/usr/include/tesseract/renderer.h:236:16: note: tesseract::TessBoxTextRenderer::TessBoxTextRenderer(const tesseract::TessBoxTextRenderer&)
class TESS_API TessBoxTextRenderer : public TessResultRenderer {
^
/usr/include/tesseract/renderer.h:236:16: note: no known conversion for argument 1 from ‘_pyx_t_9tesseract_cchar_t* {aka const char}’ to ‘const tesseract::TessBoxTextRenderer&’
tesserocr.cpp:15887:69: error: no matching function for call to ‘tesseract::TessTextRenderer::TessTextRenderer(_pyx_t_9tesseract_cchar_t&)’
__pyx_t_6 = new tesseract::TessTextRenderer(__pyx_v_outputbase);
^
tesserocr.cpp:15887:69: note: candidates are:
In file included from tesserocr.cpp:261:0:
/usr/include/tesseract/renderer.h:164:3: note: tesseract::TessTextRenderer::TessTextRenderer()
TessTextRenderer();
^
/usr/include/tesseract/renderer.h:164:3: note: candidate expects 0 arguments, 1 provided
/usr/include/tesseract/renderer.h:162:16: note: tesseract::TessTextRenderer::TessTextRenderer(const tesseract::TessTextRenderer&)
class TESS_API TessTextRenderer : public TessResultRenderer {
^
/usr/include/tesseract/renderer.h:162:16: note: no known conversion for argument 1 from ‘pyx_t_9tesseract_cchar_t* {aka const char}’ to ‘const tesseract::TessTextRenderer&’
tesserocr.cpp: In function ‘PyObject
__pyx_pf_9tesserocr_13PyTessBaseAPI_106IsValidCharacter(_pyx_obj_9tesserocr_PyTessBaseAPI, _pyx_t_9tesseract_cchar_t)’:
tesserocr.cpp:18170:60: error: ‘class tesseract::TessBaseAPI’ has no member named ‘IsValidCharacter’
__pyx_t_1 = __Pyx_PyBool_FromLong(__pyx_v_self->_baseapi.IsValidCharacter(__pyx_v_character)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2045; __pyx_clineno = __LINE
; goto pyx_L1_error;}
^
tesserocr.cpp:365:36: note: in definition of macro ‘__Pyx_PyBool_FromLong’
#define __Pyx_PyBool_FromLong(b) ((b) ? __Pyx_NewRef(Py_True) : __Pyx_NewRef(Py_False))
^
tesserocr.cpp: In function ‘void inittesserocr()’:
tesserocr.cpp:23205:67: error: ‘PSM_RAW_LINE’ is not a member of ‘tesseract’
__pyx_t_2 = __Pyx_PyInt_From_enum__tesseract_3a__3a_PageSegMode(tesseract::PSM_RAW_LINE); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 124; __pyx_clineno = __LINE
; goto __pyx_L1_error;}
^
error: command 'gcc' failed with exit status 1

No package 'tesseract' found

I tried to install tesserocr in Ubuntu. I got following error. I have installed tesseract already. I donot know why it can not find.

Can someone help me out ?

$tesseract -v
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
$ CPPFLAGS=-I/usr/lib pip install tesserocr
  Using cached tesserocr-2.1.2.tar.gz
    Complete output from command python setup.py egg_info:
    running egg_info
    creating pip-egg-info/tesserocr.egg-info
    writing pip-egg-info/tesserocr.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/tesserocr.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/tesserocr.egg-info/dependency_links.txt
    writing manifest file 'pip-egg-info/tesserocr.egg-info/SOURCES.txt'
    warning: manifest_maker: standard file '-c' not found

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-DEBtw3/tesserocr/setup.py", line 166, in <module>
        test_suite='tests'
      File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
        dist.run_commands()
      File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
        cmd_obj.run()
      File "/home/eijmmmp/BCReader/.virtEnv/local/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 195, in run
        self.find_sources()
      File "/home/eijmmmp/BCReader/.virtEnv/local/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 222, in find_sources
        mm.run()
      File "/home/eijmmmp/BCReader/.virtEnv/local/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 306, in run
        self.add_defaults()
      File "/home/eijmmmp/BCReader/.virtEnv/local/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 335, in add_defaults
        sdist.add_defaults(self)
      File "/home/eijmmmp/BCReader/.virtEnv/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 160, in add_defaults
        build_ext = self.get_finalized_command('build_ext')
      File "/usr/lib/python2.7/distutils/cmd.py", line 311, in get_finalized_command
        cmd_obj = self.distribution.get_command_obj(command, create)
      File "/usr/lib/python2.7/distutils/dist.py", line 846, in get_command_obj
        cmd_obj = self.command_obj[command] = klass(self)
      File "/home/eijmmmp/BCReader/.virtEnv/local/lib/python2.7/site-packages/setuptools/__init__.py", line 137, in __init__
        _Command.__init__(self, dist)
      File "/usr/lib/python2.7/distutils/cmd.py", line 64, in __init__
        self.initialize_options()
      File "/tmp/pip-build-DEBtw3/tesserocr/setup.py", line 120, in initialize_options
        build_args = package_config()
      File "/tmp/pip-build-DEBtw3/tesserocr/setup.py", line 59, in package_config
        raise Exception(error)
    Exception: Package tesseract was not found in the pkg-config search path.
    Perhaps you should add the directory containing `tesseract.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'tesseract' found

Segfault in pixa_to_list

Hi, I've encountered a segfault in pixa_to_list and I can reproduce it consistently. I don't have any idea how to fix this though.

This image here will always make tesserocr segfault:
fail

This image on the other hand works fine:
success

The code I'm using for testing is simple:

import tesserocr
from PIL import Image
import sys

print(tesserocr.tesseract_version())
print(tesserocr.get_languages())

png = Image.open(sys.argv[1]).convert('L')

# print(tesserocr.image_to_text(png))

with tesserocr.PyTessBaseAPI() as api:
    api.SetImage(png)
    boxes = api.GetComponentImages(tesserocr.RIL.WORD, True)
    for _, box, _, _ in boxes:
        pad = box['h'] * 0.2
        api.SetRectangle(box['x']-pad, box['y']-pad, box['w']+pad, box['h']+pad)
        text = api.GetUTF8Text().strip()
        confidence=api.MeanTextConf()
        print(text, confidence)

Here is a crash report from OS X: crashreport.txt

Here is the output from a succesful run (including version numbers and so on): success.txt

I'm using tesseract 3.05.00 which I compiled myself as I had this problem with the 3.04 also and I thought maybe the new version would fix the issue.

Here are the relevant environment variables I used when I executed python setup.py install for tesserocr:

declare -x CFLAGS="-g -fno-omit-frame-pointer  -UNDEBUG -O0"
declare -x CPPFLAGS="-I/Users/otimpe/dev/tesseract-3.05.00/dist/include"
declare -x DYLD_LIBRARY_PATH="/Users/otimpe/dev/tesseract-3.05.00/dist/lib"
declare -x LDFLAGS="-L/Users/otimpe/dev/tesseract-3.05.00/dist/lib -g"
declare -x TESSDATA_PREFIX="/usr/local/share"

How to recognize single character

If i use a image with several characters, it works. How ever it does't work if i want to recognize single character. What should I do?
Thanks!

PDF support?

Could anybody provide an example of how to OCR a PDF?.

Thanks guys!

'tesserocr.PyPageIterator' object has no attribute 'GetUTF8Text'

Hello,
I'm iterating over RIL.BLOCK and want to get the text of each BLOCK.
In tesserocr.pyx line 390 the following is written:

        >>> for e in iterate_level(api.AnalyseLayout(), RIL.WORD):
            ...     word = e.GetUTF8Text()

Unfortunately this does not work. I get the following error:

AttributeError: 'tesserocr.PyPageIterator' object has no attribute 'GetUTF8Text'

build fails on freeBSD

appearently it does not find the leptonica headers which are in /usr/local/include for BSD systems. so i guess this might be a problem on OSX too.

with pip the issue can be worked around with setting CPPFLAGS:

CPPFLAGS=-I/usr/local/include pip install git+https://github.com/sirfz/tesserocr.git

setup.py takes an -I paramater:

 python setup.py build build_ext -I/usr/local/include

with pip:

Collecting git+https://github.com/sirfz/tesserocr.git
  Cloning https://github.com/sirfz/tesserocr.git to /tmp/pip-uq2glx-build
Installing collected packages: tesserocr
  Running setup.py install for tesserocr ... error
    Complete output from command /usr/home/[...]/venv/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/tmp/pip-uq2glx-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-Z6nb_e-record/install-record.txt --single-version-externally-managed --compile --install-headers /usr/home[...]/venv/include/site/python2.7/tesserocr:
    /usr/home/ub/work/artfacts-scanner/venv/lib/python2.7/site-packages/setuptools/dist.py:285: UserWarning: Normalizing '2.0.2-beta' to '2.0.2b0'
      normalized_version,
    running install
    running build
    running build_ext
    building 'tesserocr' extension
    creating build
    creating build/temp.freebsd-10.3-RELEASE-p3-amd64-2.7
    cc -fno-strict-aliasing -O2 -pipe -fstack-protector -fno-strict-aliasing -DNDEBUG -fPIC -I/usr/local/include/python2.7 -c tesserocr.cpp -o build/temp.freebsd-10.3-RELEASE-p3-amd64-2.7/tesserocr.o
    tesserocr.cpp:248:10: fatal error: 'leptonica/allheaders.h' file not found
    #include "leptonica/allheaders.h"
             ^
    1 error generated.
    error: command 'cc' failed with exit status 1

    ----------------------------------------
Command "/usr/home/[...]/venv/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/tmp/pip-uq2glx-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-Z6nb_e-record/install-record.txt --single-version-externally-managed --compile --install-headers /usr/home/ub/[...]/venv/include/site/python2.7/tesserocr" failed with error code 1 in /tmp/pip-uq2glx-build

Error while executing Python as a script

Hi guys,
I was just expermimenting around with the API and found something that I think is an issue.
Instead of executing a script, the API brings out a weird cursor(something shaped like a plus symbol) and freezes
Screenshot:
https://s13.postimg.org/ennhe0e1j/IMG_20170213_113348.jpg

Steps to replicate:
This is the script I wrote http://pastebin.com/JbFR7MaG
Instead of doing the normal python script.py to execute the script, I first made the script executable by doing chmod +x script.py .
I then executed the script by doing ./script.py image.png
The script doesn't execute after the import statement and stops with the + shaped cursor.

Is this an issue? Or am I doing something wrong?

Thanks,

AttributeError when calling SetImage() (python 3)

In Python 3.5.2 (in an ipython console) I've copied the file eurotext.tif from this repository to my working directory. I get an error trying to work with that image:

In [50]: from tesserocr import PyTessBaseAPI

In [51]: with PyTessBaseAPI as api:
    ...:     api.SetImageFile('eurotext.tif')
    ...:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-51-252118aec8ba> in <module>()
----> 1 with PyTessBaseAPI as api:
      2     api.SetImageFile('eurotext.tif')
      3

AttributeError: __exit__

Also trying to use it directly:

In [52]: tesseract = PyTessBaseAPI()

In [53]: tesseract.SetImage('eurotext.tif')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-53-661b17ef8a1f> in <module>()
----> 1 tesseract.SetImage('eurotext.tif')

tesserocr.pyx in tesserocr.PyTessBaseAPI.SetImage (tesserocr.cpp:13256)()

tesserocr.pyx in tesserocr._image_buffer (tesserocr.cpp:2916)()

tesserocr.pyx in tesserocr._image_buffer (tesserocr.cpp:2780)()

AttributeError: 'str' object has no attribute 'save'

I'm able to open the image in PIL so it's a valid image:

In [54]: from PIL import Image

In [55]: im = Image.open('eurotext.tif')

In [56]: im
Out[56]: <PIL.TiffImagePlugin.TiffImageFile image mode=1 size=1024x800 at 0x7F7E100BF5F8>

What's going on here? Thanks in advance.

Here's what I have installed if that's helpful.

Cython==0.25.1
dask==0.12.0
decorator==4.0.10
ipython==5.1.0
ipython-genutils==0.1.0
networkx==1.11
numpy==1.11.2
pexpect==4.2.1
pickleshare==0.7.4
Pillow==3.4.2
prompt-toolkit==1.0.9
ptyprocess==0.5.1
Pygments==2.1.3
scikit-image==0.12.3
scipy==0.18.1
simplegeneric==0.8.1
six==1.10.0
tesserocr==2.1.3
toolz==0.8.1
traitlets==4.3.1
wcwidth==0.1.7

Also I am able to run tesserocr's tests (python3 setup.py test) without any errors so I think tesserocr is installed ok.

pip install tesserocr in CentOS

Hi,

I have a problem installing tesserocr in CentOS.

It gives me this error

Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
  cythoning tesserocr.pyx to tesserocr.cpp
  building 'tesserocr' extension
  creating build
  creating build/temp.linux-x86_64-2.7
  gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include -I/usr/local/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  In file included from /usr/local/include/tesseract/genericvector.h:29:0,
                   from tesserocr.cpp:417:
  /usr/local/include/tesseract/helpers.h: In member function ‘void tesseract::TRand::set_seed(const string&)’:
  /usr/local/include/tesseract/helpers.h:50:5: error: ‘hash’ is not a member of ‘std’
       std::hash<std::string> hasher;
       ^
  /usr/local/include/tesseract/helpers.h:50:26: error: expected primary-expression before ‘>’ token
       std::hash<std::string> hasher;
                            ^
  /usr/local/include/tesseract/helpers.h:50:28: error: ‘hasher’ was not declared in this scope
       std::hash<std::string> hasher;
                              ^
  error: command 'gcc' failed with exit status 1

It worked just fine in Ubuntu. Any idea what's wrong?

I thougt maybe it's a problem of gcc version, so I updated it to 5.4.0 but it didn't help.

Error while performing OCR on image

Hello,
I am using the following code in order to perform OCR on an image (attached):

from tesserocr import PyTessBaseAPI
from PIL import Image

DEFAULT_LANGUAGE = "spa"
filePath = "/home/jorge/Desktop/prueba_tess/c1.png"

if __name__ == '__main__':
    img = Image.open(filePath)
    tesseract = PyTessBaseAPI(lang=DEFAULT_LANGUAGE)
    tesseract.SetImage(img)
    tesseract.Recognize()
    print tesseract.GetUTF8Text()
    tesseract.End()

but what I am getting with this particular image in console is the following:
start >= 0 && start + num <= length_:Error:Assert failed:in file ratngs.cpp, line 321

Here is what I am using
tesserocr 2.2.1
tesseract 3.04.01
leptonica-1.71
libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

I think I have all correctly set up because I have extracted text from other images. but with this one throws that "error"...
Any help is appreciated.
Thanks in advance!
c1

2.2.0 fails to build with tesseract 3.02.01

The compilation on slightly older system (Ubuntu Precise used on Travis CI) fails:

  Supporting tesseract v3.04.01

  Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 197633}, 'library_dirs': ['/home/travis/build/WeblateOrg/weblate/.tesseract/lib'], 'include_dirs': ['/home/travis/build/WeblateOrg/weblate/.tesseract/include']}

  running bdist_wheel

  running build

  running build_ext

  building 'tesserocr' extension

  creating build

  creating build/temp.linux-x86_64-2.7

  gcc -pthread -fno-strict-aliasing -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/travis/build/WeblateOrg/weblate/.tesseract/include -I/opt/python/2.7.9/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11

  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for Ada/C/ObjC but not for C++ [enabled by default]

  cc1plus: error: unrecognized command line option ‘-std=c++11’

  error: command 'gcc' failed with exit status 1

The problem seems to be in setup.py line 130 as that compares version which is comparing TESSERACT_VERSION which is 197633 at this point with 4. while it should probably compare with 0x40000.

Library crash on an image of 1px size

Hello!

I have an image which consists of one pixel (1x1). When I try to call:

api.SetImage(image)
all_lines = api.GetComponentImages(RIL.TEXTLINE, True)

The Python process crashes:

*** Error in `/home/peter/Projects/ContentTagging/env/bin/python': munmap_chunk(): invalid pointer: 0x00007f1d296a54b0 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x722ab)[0x7f1d322a92ab]
/usr/lib/libc.so.6(+0x7890e)[0x7f1d322af90e]
/home/peter/Projects/ContentTagging/env/lib/python3.6/site-packages/tesserocr.cpython-36m-x86_64-linux-gnu.so(+0x1bb3d)[0x7f1d28448b3d]
/usr/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x12c)[0x7f1d31e309bc]
/usr/lib/libpython3.6m.so.1.0(+0x168bdd)[0x7f1d31e3fbdd]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(+0x16853a)[0x7f1d31e3f53a]
/usr/lib/libpython3.6m.so.1.0(+0x168af3)[0x7f1d31e3faf3]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(+0x16853a)[0x7f1d31e3f53a]
/usr/lib/libpython3.6m.so.1.0(+0x168af3)[0x7f1d31e3faf3]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(+0x16853a)[0x7f1d31e3f53a]
/usr/lib/libpython3.6m.so.1.0(+0x168af3)[0x7f1d31e3faf3]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(PyEval_EvalCodeEx+0x277)[0x7f1d31e3ff47]
/usr/lib/libpython3.6m.so.1.0(PyEval_EvalCode+0x1b)[0x7f1d31dfba5b]
/usr/lib/libpython3.6m.so.1.0(+0x11c871)[0x7f1d31df3871]
/usr/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x8f)[0x7f1d31e3091f]
/usr/lib/libpython3.6m.so.1.0(+0x168bdd)[0x7f1d31e3fbdd]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(+0x167291)[0x7f1d31e3e291]
/usr/lib/libpython3.6m.so.1.0(+0x16878a)[0x7f1d31e3f78a]
/usr/lib/libpython3.6m.so.1.0(+0x168af3)[0x7f1d31e3faf3]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(+0x167291)[0x7f1d31e3e291]
/usr/lib/libpython3.6m.so.1.0(+0x16878a)[0x7f1d31e3f78a]
/usr/lib/libpython3.6m.so.1.0(+0x168af3)[0x7f1d31e3faf3]
/usr/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x317)[0x7f1d31dfbd77]
/usr/lib/libpython3.6m.so.1.0(PyEval_EvalCodeEx+0x277)[0x7f1d31e3ff47]
/usr/lib/libpython3.6m.so.1.0(PyEval_EvalCode+0x1b)[0x7f1d31dfba5b]
/usr/lib/libpython3.6m.so.1.0(+0x1eddc2)[0x7f1d31ec4dc2]
/usr/lib/libpython3.6m.so.1.0(PyRun_FileExFlags+0x9d)[0x7f1d31ec762d]
/usr/lib/libpython3.6m.so.1.0(PyRun_SimpleFileExFlags+0x1a7)[0x7f1d31ec7817]
/usr/lib/libpython3.6m.so.1.0(Py_Main+0x6b1)[0x7f1d31ebc6f1]
/home/peter/Projects/ContentTagging/env/bin/python(main+0xfd)[0x400a5d]
/usr/lib/libc.so.6(__libc_start_main+0xf1)[0x7f1d32257511]
/home/peter/Projects/ContentTagging/env/bin/python(_start+0x2a)[0x400b9a]

I am using version 2.1.3 with Python 3.6.0

RuntimeError: Error reading image

I checkout this project and run the test_api.py.

All tests that have test_image* failure

tesserocr/tests/test_api.py", line 70, in test_image_file
    self._api.SetImageFile(self._image_file)
  File "tesserocr.pyx", line 1545, in tesserocr.PyTessBaseAPI.SetImageFile (tesserocr.cpp:13568)
    raise RuntimeError('Error reading image')
RuntimeError: Error reading image

image_to_text recognizes more text than using the PyTessBaseAPI

This is kind of weird . The part of the document I am trying to ocr is the following:
image

image_to_text produces the folowing:

Test Results

Latest Result

 

, SINGAPORE

Parameter Unit Outcome

08-Jun-2017

17-021396-01
Viscosity @ 50°C cSt 367.8
Density @ 15°C kg/m® 989.0
Sulphur % (m/m) 2.69
Flash Point °C >67.0
Acid Number mg KOH /g 0.09
Total Sediment Ace. % (m/m) 0.05
Micro Carbon Residue % (m/m) 15.34
Pour Point °C <21
Water content % (V/V) 0.39
Ash % (m/m) 0.039
Vanadium mg/kg 112
Sodium mg/kg 21
Calcium mg/kg 9
Zinc mg/kg <1
Phosphorus mg/kg <1
Iron mg/kg 31
Nickel mg/kg 35
Magnesium mg/kg <1
Potassium mg/kg <1
Silicon mg/kg 11
Aluminium mg/kg 7
Aluminium + Silicon mg/kg 18
Quantity MT
Quantity loss/gain MT
CCAl 850
Net Specific Energy MJ/kg 40.18

When I perform the same operation using the PyTessBaseAPI , Iget the following bounding boxes:
image

The problem is that I need the relative position of the values, so that to extract a key to value inference, whle performing the same action for multiple documents, so i can not (and dont want to) manually interfere. I can not understand why 7 and 9 are not recognized. To add to this, I got a lot fewer results when I set the segmentation mode to AUTO (this result is given using SPARSE_TEXT). Is there a solution to this absurd problem or is it a matter of luck?

L_SEVERITY_NONE was not declared in this scope

I am trying to do a pip3 and pip install of tesserocr on Debian stretch and getting the following error:

# pip3 install tesserocr
Collecting tesserocr
  Using cached tesserocr-2.1.3.tar.gz
Building wheels for collected packages: tesserocr
  Running setup.py bdist_wheel for tesserocr ... error
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-3coz3mk5/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpmrdg6puopip-wheel- --python-tag cp35:
  running bdist_wheel
  running build
  running build_ext
  Supporting tesseract v3.04.01
  Configs from pkg-config: {'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 197633}, 'include_dirs': ['/usr/include']}
  cythoning tesserocr.pyx to tesserocr.cpp
  building 'tesserocr' extension
  creating build
  creating build/temp.linux-x86_64-3.5
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fdebug-prefix-map=/build/python3.5-MLq5fN/python3.5-3.5.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.5m -c tesserocr.cpp -o build/temp.linux-x86_64-3.5/tesserocr.o
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  tesserocr.cpp: In function ‘PyObject* PyInit_tesserocr()’:
  tesserocr.cpp:24651:18: error: ‘L_SEVERITY_NONE’ was not declared in this scope
     setMsgSeverity(L_SEVERITY_NONE);
                    ^~~~~~~~~~~~~~~
  tesserocr.cpp:24651:33: error: ‘setMsgSeverity’ was not declared in this scope
     setMsgSeverity(L_SEVERITY_NONE);
                                   ^
  error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
  
  ----------------------------------------
  Failed building wheel for tesserocr
  Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr
  Running setup.py install for tesserocr ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-3coz3mk5/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-j2y3y173-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    Supporting tesseract v3.04.01
    Configs from pkg-config: {'cython_compile_time_env': {'TESSERACT_VERSION': 197633}, 'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract']}
    skipping 'tesserocr.cpp' Cython extension (up-to-date)
    building 'tesserocr' extension
    creating build
    creating build/temp.linux-x86_64-3.5
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fdebug-prefix-map=/build/python3.5-MLq5fN/python3.5-3.5.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.5m -c tesserocr.cpp -o build/temp.linux-x86_64-3.5/tesserocr.o
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    tesserocr.cpp: In function ‘PyObject* PyInit_tesserocr()’:
    tesserocr.cpp:24651:18: error: ‘L_SEVERITY_NONE’ was not declared in this scope
       setMsgSeverity(L_SEVERITY_NONE);
                      ^~~~~~~~~~~~~~~
    tesserocr.cpp:24651:33: error: ‘setMsgSeverity’ was not declared in this scope
       setMsgSeverity(L_SEVERITY_NONE);
                                     ^
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-3coz3mk5/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-j2y3y173-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-3coz3mk5/tesserocr/

Pip install fails on linux virtual environment

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-lqe1qy/tesserocr/

Collecting tesserocr
Downloading tesserocr-2.1.1.tar.gz (47kB)
100% |████████████████████████████████| 51kB 3.1MB/s
Complete output from command python setup.py egg_info:
running egg_info
creating pip-egg-info/tesserocr.egg-info
writing pip-egg-info/tesserocr.egg-info/PKG-INFO
writing dependency_links to pip-egg-info/tesserocr.egg-info/dependency_links.txt
writing top-level names to pip-egg-info/tesserocr.egg-info/top_level.txt
writing manifest file 'pip-egg-info/tesserocr.egg-info/SOURCES.txt'
warning: manifest_maker: standard file '-c' not found

Package lept was not found in the pkg-config search path.
Perhaps you should add the directory containing `lept.pc'
to the PKG_CONFIG_PATH environment variable
No package 'lept' found
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-build-lqe1qy/tesserocr/setup.py", line 163, in <module>
    test_suite='tests'
  File "/opt/rh/python33/root/usr/lib64/python3.3/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/rh/python33/root/usr/lib64/python3.3/distutils/dist.py", line 929, in run_commands
    self.run_command(cmd)
  File "/opt/rh/python33/root/usr/lib64/python3.3/distutils/dist.py", line 948, in run_command
    cmd_obj.run()
  File "/var/lib/openshift/573147277628e1669600012d/python/virtenv/venv/lib/python3.3/site-packages/setuptools-22.0.5-py3.3.egg/setuptools/command/egg_info.py", line 193, in run
  File "/var/lib/openshift/573147277628e1669600012d/python/virtenv/venv/lib/python3.3/site-packages/setuptools-22.0.5-py3.3.egg/setuptools/command/egg_info.py", line 216, in find_sources
  File "/var/lib/openshift/573147277628e1669600012d/python/virtenv/venv/lib/python3.3/site-packages/setuptools-22.0.5-py3.3.egg/setuptools/command/egg_info.py", line 300, in run
  File "/var/lib/openshift/573147277628e1669600012d/python/virtenv/venv/lib/python3.3/site-packages/setuptools-22.0.5-py3.3.egg/setuptools/command/egg_info.py", line 329, in add_defaults
  File "/var/lib/openshift/573147277628e1669600012d/python/virtenv/venv/lib/python3.3/site-packages/setuptools-22.0.5-py3.3.egg/setuptools/command/sdist.py", line 132, in add_defaults
  File "/opt/rh/python33/root/usr/lib64/python3.3/distutils/cmd.py", line 298, in get_finalized_command
    cmd_obj = self.distribution.get_command_obj(command, create)
  File "/opt/rh/python33/root/usr/lib64/python3.3/distutils/dist.py", line 821, in get_command_obj
    cmd_obj = self.command_obj[command] = klass(self)
  File "/var/lib/openshift/573147277628e1669600012d/python/virtenv/venv/lib/python3.3/site-packages/setuptools-22.0.5-py3.3.egg/setuptools/__init__.py", line 132, in __init__
  File "/opt/rh/python33/root/usr/lib64/python3.3/distutils/cmd.py", line 62, in __init__
    self.initialize_options()
  File "/tmp/pip-build-lqe1qy/tesserocr/setup.py", line 117, in initialize_options
    build_args = package_config()
  File "/tmp/pip-build-lqe1qy/tesserocr/setup.py", line 72, in package_config
    opt = options[f[:2]]
KeyError: '-g'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-lqe1qy/tesserocr/

build fails on osx

$ pip install ./tesserocr --upgrade
Processing ./tesserocr
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/var/folders/1f/8hp6xg6j5wn8pzb94y56x4f00000gn/T/pip-kjLPF8-build/setup.py", line 6, in <module>
        from Cython.Distutils import build_ext
    ImportError: No module named Cython.Distutils

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /var/folders/1f/8hp6xg6j5wn8pzb94y56x4f00000gn/T/pip-kjLPF8-build/

git bisect

ff1c3c764269864c731c305a59e1d03a9a6ac821 is the first bad commit
commit ff1c3c764269864c731c305a59e1d03a9a6ac821
Author: FZ <[email protected]>
Date:   Sat May 21 20:10:04 2016 +0300

    setup now requires Cython and passes tesseract version as cython_compile_time_env

:100644 100644 3edde89bbaec3781147180d476778fb51ff15f6f 1153e1d0f3f30149cb0c98a419a4355e0b15e10a M  setup.py
:000000 100644 0000000000000000000000000000000000000000 0ffc837bfff77266e17c0f48bb0fe52bb1733df7 A  tesseractversion.pyx
:100644 000000 32375e79fcba5805a9654f036f32f0b224cdded6 0000000000000000000000000000000000000000 D  tesserocr.cpp

crash on rotated image

reubano@tokpro [~]⚡ convert eurotext.tif -rotate 3 +repage eurotext_ang.tif
reubano@tokpro [~]⚡ tesseract eurotext_ang.tif - -psm 0 
Orientation: 0
Orientation in degrees: 0
Orientation confidence: 20.66
Script: 1
Script confidence: 39.58
image = Image.open('eurotext_ang.tif')

with PyTessBaseAPI(psm=PSM.AUTO_OSD) as api:
    api.SetImage(image)
    api.Recognize()
    it = api.AnalyseLayout()
    it.Orientation()

output

AttributeError: 'NoneType' object has no attribute 'Orientation'

Segfault when tessdata are missing

If you do not have installed tessdata or TESSDATA_PREFIX is wrong, Python with tesserocr segfaults. I think it should rather fail with some error.

The command line tesseract handles this gracefully with error:

Error opening data file /invalid/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

To reproduce this, just set environment variable TESSDATA_PREFIX to non existing directory.

Failed to install Tesserocr

Hi, i tried to install the Tesseract 4.0 on ubuntu as described on the website
all worked fine except for the the last command pip install tesserocr although i already have python 2.7 installed,
im attaching a snapshot of the error that i have when i run the above command
python
Any ideas on how to solve this issue?
Thanks

Python tesserocr image_to_text Runtime Error:failed to read picture

platform:Ubuntu 15.1/Python 2.7 This is a demo which I used,from https://github.com/sirfz/tesserocr

import tesserocr
from PIL import Image

print tesserocr.tesseract_version()  # print tesseract-ocr version
print tesserocr.get_languages()  # prints tessdata path and list of available languages

image = Image.open('03.jpg')  # I verify the file and directory is right
print tesserocr.image_to_text(image)  # print ocr text from image
# or
print tesserocr.file_to_text('03.jpg')

And above get these output:

tesseract 3.04.00
 leptonica-1.73
  zlib 1.2.8

(u'/usr/share/tesseract-ocr/tessdata/', [u'eng', u'osd', u'equ'])
Traceback (most recent call last):
  File "testImage.py", line 8, in <module>
    print tesserocr.image_to_text(image)  # print ocr text from image
  File "tesserocr.pyx", line 2281, in tesserocr.image_to_text (tesserocr.cpp:20529)
RuntimeError: Failed to read picture

Does it support white list

How could I add white list for the recognition please? For example, if I know my image contains digits only. How could I set to the result limit to [0-9]?

setup.py doesn't recognize tesseract 4 (the result of tesseract -v is printed in stdout)

In my tesseract installation (version 4.00), when i do tesseract -v the result is printed in the stdout, so the version in get_tesseract_version() in setup.py will be always equal to an empty string

A quick fix can be:

p = subprocess.Popen(['tesseract', '-v'], stderr=subprocess.PIPE, stdout=subprocess.PIPE)
stdout_version, version = p.communicate()
if version == '':
    version = stdout_version

Segmentation fault on GetUTF8Text()

Hi.
Thanks for your previous messages.

I have just tested with your image file again.

Unfortunately, it says "Segmentation fault(core dumped)" after finding 12 textlines.
Error occured in ocrResult = api.GetUTF8Text() line.

Seems like I have not installed tesserocr properly.

I am eager to hear from you soon.
Thanks.

Build failed on Ubuntu

Hi, I am trying to install tesserocr on my Unbuntu x64 14.04.

Unfortunately, I failed to install tesserocr.
Here is the output:

ubuntu@ubuntu-MS:~/tmp/tesserocr$ sudo python setup.py build_ext -I/usr/local/include
running build_ext
Traceback (most recent call last):
File "setup.py", line 166, in
test_suite='tests'
File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib/python2.7/distutils/dist.py", line 970, in run_command
cmd_obj = self.get_command_obj(command)
File "/usr/lib/python2.7/distutils/dist.py", line 846, in get_command_obj
cmd_obj = self.command_obj[command] = klass(self)
File "/usr/lib/python2.7/dist-packages/setuptools/init.py", line 82, in init
_Command.init(self,dist)
File "/usr/lib/python2.7/distutils/cmd.py", line 64, in init
self.initialize_options()
File "setup.py", line 120, in initialize_options
build_args = package_config()
File "setup.py", line 59, in package_config
raise Exception(error)
Exception: Package tesseract was not found in the pkg-config search path.
Perhaps you should add the directory containing `tesseract.pc'
to the PKG_CONFIG_PATH environment variable
No package 'tesseract' found

Any idea?

Restricting character set

Is it possible to restrict the character set which is recognized?
The tesseract project already has configuraton files for such things (see e.g. how-to-recognize-only-digits), but I wasn't able to figure out how to do this with this project.

Add support for PSM mode 0

I found some code on the forum and a blog that may help:

OSResults *orientationStruct = new OSResults();
bool gotOrientation = myTess->DetectOS(orientationStruct);
int bestOrientation = -1;
float bestOrientationScore = 0;

if ((gotOrientation) && (orientationStruct->orientations != NULL)) {
    for (int i=0; i<4; i++) {
        if (orientationStruct->orientations[i] > bestOrientationScore) {
            bestOrientation = i;
            bestOrientationScore = orientationStruct->orientations[i];
        }
    }
}

// This is the result we were asked for
results.textOrientation = bestOrientation; 
#include <tesseract/baseapi.h>
#include <tesseract/osdetect.h>
#include <leptonica/allheaders.h>

int main(int argc, char **argv) {  
    OSResults os_results;
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    Pix *image = pixRead(filename);
    api->SetImage(image);

    // To detect correct OS and flip images
    api->DetectOS(&os_results);
    OrientationDetector os_detector = OrientationDetector(&os_results);
    int correct_orientation = os_detector.get_orientation();

    // Had to add this condition because get_orientation result and
    // pixRotateOrth were not in sync.
    if (correct_orientation == 1) {
        image = pixRotate90(image, -1);
    }
    else if (correct_orientation == 3) {
        image = pixRotate90(image, 1);
    }
    else if (correct_orientation == 2) {
        pixRotate180(image, image);
    }

    api->SetImage(image);
    char* ocrResult = api->GetUTF8Text();
    fprintf(stdout, "Recognized Text: %s\n", ocrResult);
    api->End();    
    pixDestroy(&image);
    delete [] ocrResult;
    return 0;
}

Cant install tesseocr with pip -r

Hi, I want to install tess with pip insatll -r requirements.txt (in running tests with tox actually), but this scenario doesnt work. See:

$ virtualenv test
$ cd test/
$ source bin/activate
(test) $ echo Cython >> requirements.txt
(test) $ echo tesserocr >> requirements.txt
(test) $ pip install -r requirements.txt

I got error:

Collecting Cython (from -r requirements.txt (line 1))
  Using cached Cython-0.25.2-cp35-cp35m-manylinux1_x86_64.whl
Collecting tesserocr (from -r requirements.txt (line 2))
  Using cached tesserocr-2.1.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-ag5vqu0y/tesserocr/setup.py", line 11, in <module>
        from Cython.Distutils import build_ext
    ImportError: No module named 'Cython'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-ag5vqu0y/tesserocr/

I'm Python newbie, but I think this issue maybe due to lack of setup_requires or install_requires in https://github.com/sirfz/tesserocr/blob/master/setup.py?

Cannot get text with French language

Hi !

I'm trying to use tesserocr with french language but I keep getting errors on Unicode decoder

api = PyTessBaseAPI(lang='fra')
api.SetImage(Image.open("20170509_182040.jpg"))
api.SetSourceResolution(300)
api.GetUTF8Text()
Returns:
Traceback (most recent call last):
File "", line 1, in
File "tesserocr.pyx", line 2033, in tesserocr.PyTessBaseAPI.GetUTF8Text (tesserocr.cpp:18137)
File "tesserocr.pyx", line 294, in tesserocr._free_str (tesserocr.cpp:2567)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 341: invalid continuation byte

Although the english version is working:

api = PyTessBaseAPI()
api.SetImage(Image.open("20170509_182040.jpg"))
api.SetSourceResolution(300)
api.GetUTF8Text()
Returns :
'The text that I want'

This is my installation :

tesserocr.version
'2.1.3'

tesserocr.tesseract_version()
'tesseract 3.05.00\n leptonica-1.74.1\n libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8\n'

MacOS Sierra

Is it a known issue or do I need to change something to get it to work ?

Thanks for your help !

(Fixed): Segmentation Fault on import tesserocr

I successfully installed tesserocr using the tesseract4 branch.

This is the output of tesseract -v on my system (Ubuntu 14.04)

tesseract 4.00.00alpha
leptonica-1.74.1
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0

Found AVX
Found SSE

tesserocr, tesseract and leptonica have been built from source.

I get a segmentation fault when I import tesserocr in python. Here is the entire core dump I obtained using gdb if it helps. Please let me know the steps I should take to fix this.

buralako@puck:~/git/tesserocr$ echo "import tesserocr" > trial.py
buralako@puck:~/git/tesserocr$ gdb python
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run trial.py 
Starting program: /usr/bin/python trial.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
GenericVector<int>::clear (this=this@entry=0x5d31) at ../ccutil/genericvector.h:857
857	  if (size_reserved_ > 0) {
(gdb) backtrace
#0  GenericVector<int>::clear (this=this@entry=0x5d31) at ../ccutil/genericvector.h:857
#1  0x00007ffff5be5941 in ~GenericVector (this=0x5d31, __in_chrg=<optimized out>) at genericvector.h:666
#2  ~GenericVectorEqEq (this=0x5d31, __in_chrg=<optimized out>) at genericvector.h:642
#3  tesseract::UnicharCompress::Cleanup (this=this@entry=0xbc0f08) at unicharcompress.cpp:432
#4  0x00007ffff5be5f76 in tesseract::UnicharCompress::SetupDecoder (this=this@entry=0xbc0f08) at unicharcompress.cpp:387
#5  0x00007ffff5be639f in tesseract::UnicharCompress::DeSerialize (this=this@entry=0xbc0f08, fp=fp@entry=0x7fffffffce30)
    at unicharcompress.cpp:321
#6  0x00007ffff5b779ff in tesseract::LSTMRecognizer::DeSerialize (this=this@entry=0xbc0cb0, fp=fp@entry=0x7fffffffce30)
    at lstmrecognizer.cpp:111
#7  0x00007ffff5a69ccb in tesseract::Tesseract::init_tesseract_lang_data (this=this@entry=0xb765f0, arg0=arg0@entry=0xb6d0b8 "", 
    textbase=textbase@entry=0x0, language=language@entry=0xb4c928 "eng", oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, 
    configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, 
    set_only_non_debug_params=set_only_non_debug_params@entry=false, mgr=mgr@entry=0x7fffffffd090) at tessedit.cpp:193
#8  0x00007ffff5a6a216 in tesseract::Tesseract::init_tesseract_internal (this=this@entry=0xb765f0, arg0=arg0@entry=0xb6d0b8 "", 
    textbase=textbase@entry=0x0, language=language@entry=0xb4c928 "eng", oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, 
    configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, 
    set_only_non_debug_params=set_only_non_debug_params@entry=false, mgr=mgr@entry=0x7fffffffd090) at tessedit.cpp:402
#9  0x00007ffff5a6abb4 in tesseract::Tesseract::init_tesseract (this=0xb765f0, arg0=0xb6d0b8 "", textbase=textbase@entry=0x0, 
    language=language@entry=0x7ffff5bf050f "eng", oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, 
    configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, 
    set_only_non_debug_params=set_only_non_debug_params@entry=false, mgr=mgr@entry=0x7fffffffd090) at tessedit.cpp:324
#10 0x00007ffff5a1da46 in tesseract::TessBaseAPI::Init (this=this@entry=0x7ffff67eb720 <__pyx_v_9tesserocr__api>, data=data@entry=0x0, 
    data_size=data_size@entry=0, language=0x7ffff5bf050f "eng", language@entry=0x0, oem=oem@entry=tesseract::OEM_DEFAULT, 
    configs=configs@entry=0x0, configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, 
    set_only_non_debug_params=set_only_non_debug_params@entry=false, reader=reader@entry=0x0) at baseapi.cpp:330
#11 0x00007ffff5a1ddde in tesseract::TessBaseAPI::Init (this=this@entry=0x7ffff67eb720 <__pyx_v_9tesserocr__api>, 
    datapath=datapath@entry=0x0, language=language@entry=0x0, oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, 
    configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, 
    set_only_non_debug_params=set_only_non_debug_params@entry=false) at baseapi.cpp:284
#12 0x00007ffff65ca23d in Init (language=0x0, datapath=0x0, this=0x7ffff67eb720 <__pyx_v_9tesserocr__api>)
    at /usr/local/include/tesseract/baseapi.h:239
#13 inittesserocr () at tesserocr.cpp:25141
#14 0x000000000042266c in _PyImport_LoadDynamicModule ()
#15 0x0000000000540948 in ?? ()
#16 0x0000000000540d08 in ?? ()
#17 0x000000000054111b in ?? ()
#18 0x000000000051dc50 in ?? ()
#19 0x00000000004dc9cb in PyEval_CallObjectWithKeywords ()
#20 0x000000000049b87e in PyEval_EvalFrameEx ()
#21 0x00000000004a1634 in ?? ()
#22 0x000000000044e4a5 in PyRun_FileExFlags ()
#23 0x000000000044ec9f in PyRun_SimpleFileExFlags ()
#24 0x000000000044f904 in Py_Main ()
#25 0x00007ffff7815f45 in __libc_start_main (main=0x44f9c2 <main>, argc=2, argv=0x7fffffffde18, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffde08) at libc-start.c:287
#26 0x0000000000578c4e in _start ()

install breaks with tesseract with postfix "dev" in version

Traceback (most recent call last):
  File "setup.py", line 147, in <module>
    test_suite='tests'
  File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
    dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
    self.run_command(cmd_name)
  File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 970, in run_command
    cmd_obj = self.get_command_obj(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 846, in get_command_obj
    cmd_obj = self.command_obj[command] = klass(self)
  File "/home/leonardo/.virtualenvs/pbot/local/lib/python2.7/site-packages/setuptools/__init__.py", line 132, in __init__
    _Command.__init__(self, dist)
  File "/usr/lib/python2.7/distutils/cmd.py", line 64, in __init__
    self.initialize_options()
  File "setup.py", line 107, in initialize_options
    build_args = package_config()
  File "setup.py", line 74, in package_config
    config['cython_compile_time_env'] = {'TESSERACT_VERSION': version_to_int(version.strip())}
  File "setup.py", line 43, in version_to_int
    return int(''.join(version.split('.')), 16)
ValueError: invalid literal for int() with base 16: '30500dev'

installation problem in python3

Hi, I'm trying to install tesserocr in a Docker container that's set up as follows:

FROM ubuntu:14.04

RUN apt-get -y update && apt-get install -y tesseract-ocr python3-imaging python3-pip python3-skimage libtesseract-dev libleptonica-dev

RUN pip3 install pytesseract ipython Cython

Then inside the container I manually run the command:

pip3 install tesserocr

It fails with the following. Any tips for how to get past this? Thanks.

Downloading/unpacking tesserocr
  Downloading tesserocr-2.1.3.tar.gz (49kB): 49kB downloaded
  Running setup.py (path:/tmp/pip_build_root/tesserocr/setup.py) egg_info for package tesserocr
    /usr/local/lib/python3.4/dist-packages/Cython/Distutils/old_build_ext.py:30: UserWarning: Cython.Distutils.old_build_ext does not properly handle dependencies and is deprecated.
      "Cython.Distutils.old_build_ext does not properly handle dependencies "
    Supporting tesseract v3.03
    Building with configs: {'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 771}}
Installing collected packages: tesserocr
  Running setup.py install for tesserocr
    /usr/local/lib/python3.4/dist-packages/Cython/Distutils/old_build_ext.py:30: UserWarning: Cython.Distutils.old_build_ext does not properly handle dependencies and is deprecated.
      "Cython.Distutils.old_build_ext does not properly handle dependencies "
    Supporting tesseract v3.03
    Building with configs: {'cython_compile_time_env': {'TESSERACT_VERSION': 771}, 'libraries': ['tesseract', 'lept']}
    cythoning tesserocr.pyx to tesserocr.cpp
    building 'tesserocr' extension
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.4m -c tesserocr.cpp -o build/temp.linux-x86_64-3.4/tesserocr.o
    cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++ [enabled by default]
    tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_14PyPageIterator_20SetBoundingBoxComponents(__pyx_obj_9tesserocr_PyPageIterator*, bool, bool)':
    tesserocr.cpp:4610:25: error: 'class tesseract::PageIterator' has no member named 'SetBoundingBoxComponents'
       __pyx_v_self->_piter->SetBoundingBoxComponents(__pyx_v_include_upper_dots, __pyx_v_include_lower_dots);
                             ^
    tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_14PyPageIterator_34GetImage(__pyx_obj_9tesserocr_PyPageIterator*, tesseract::PageIteratorLevel, int, PyObject*)':
    tesserocr.cpp:5842:125: error: no matching function for call to 'tesseract::PageIterator::GetImage(tesseract::PageIteratorLevel&, int&, Pix*&, int*, int*)'
       __pyx_v_pix = __pyx_v_self->_piter->GetImage(__pyx_v_level, __pyx_v_padding, __pyx_v_opix, (&__pyx_v_left), (&__pyx_v_top));
                                                                                                                                 ^
    tesserocr.cpp:5842:125: note: candidate is:
    In file included from tesserocr.cpp:424:0:
    /usr/include/tesseract/pageiterator.h:239:8: note: Pix* tesseract::PageIterator::GetImage(tesseract::PageIteratorLevel, int, int*, int*) const
       Pix* GetImage(PageIteratorLevel level, int padding,
            ^
    /usr/include/tesseract/pageiterator.h:239:8: note:   candidate expects 4 arguments, 5 provided
    tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_74AnalyseLayout(__pyx_obj_9tesserocr_PyTessBaseAPI*, bool)':
    tesserocr.cpp:16244:83: error: no matching function for call to 'tesseract::TessBaseAPI::AnalyseLayout(bool&)'
       __pyx_v_piter = __pyx_v_self->_baseapi.AnalyseLayout(__pyx_v_merge_similar_words);
                                                                                       ^
    tesserocr.cpp:16244:83: note: candidate is:
    In file included from tesserocr.cpp:429:0:
    /usr/include/tesseract/baseapi.h:489:17: note: tesseract::PageIterator* tesseract::TessBaseAPI::AnalyseLayout()
       PageIterator* AnalyseLayout();
                     ^
    /usr/include/tesseract/baseapi.h:489:17: note:   candidate expects 0 arguments, 1 provided
    tesserocr.cpp: In function 'tesseract::TessResultRenderer* __pyx_f_9tesserocr_13PyTessBaseAPI__get_renderer(__pyx_obj_9tesserocr_PyTessBaseAPI*, __pyx_t_9tesseract_cchar_t*)':
    tesserocr.cpp:16588:88: error: no matching function for call to 'tesseract::TessHOcrRenderer::TessHOcrRenderer(__pyx_t_9tesseract_cchar_t*&, bool&)'
           __pyx_t_2 = new tesseract::TessHOcrRenderer(__pyx_v_outputbase, __pyx_v_font_info);
                                                                                            ^
    tesserocr.cpp:16588:88: note: candidates are:
    In file included from tesserocr.cpp:427:0:
    /usr/include/tesseract/renderer.h:175:3: note: tesseract::TessHOcrRenderer::TessHOcrRenderer()
       TessHOcrRenderer();
       ^
    /usr/include/tesseract/renderer.h:175:3: note:   candidate expects 0 arguments, 2 provided
    /usr/include/tesseract/renderer.h:173:16: note: tesseract::TessHOcrRenderer::TessHOcrRenderer(const tesseract::TessHOcrRenderer&)
     class TESS_API TessHOcrRenderer : public TessResultRenderer {
                    ^
    /usr/include/tesseract/renderer.h:173:16: note:   candidate expects 1 argument, 2 provided
    tesserocr.cpp:16631:106: error: no matching function for call to 'tesseract::TessPDFRenderer::TessPDFRenderer(__pyx_t_9tesseract_cchar_t*&, const char*)'
           __pyx_t_3 = new tesseract::TessPDFRenderer(__pyx_v_outputbase, __pyx_v_self->_baseapi.GetDatapath());
                                                                                                              ^
    tesserocr.cpp:16631:106: note: candidates are:
    In file included from tesserocr.cpp:427:0:
    /usr/include/tesseract/renderer.h:188:3: note: tesseract::TessPDFRenderer::TessPDFRenderer(const char*)
       TessPDFRenderer(const char *datadir);
       ^
    /usr/include/tesseract/renderer.h:188:3: note:   candidate expects 1 argument, 2 provided
    /usr/include/tesseract/renderer.h:186:16: note: tesseract::TessPDFRenderer::TessPDFRenderer(const tesseract::TessPDFRenderer&)
     class TESS_API TessPDFRenderer : public TessResultRenderer {
                    ^
    /usr/include/tesseract/renderer.h:186:16: note:   candidate expects 1 argument, 2 provided
    tesserocr.cpp:16715:69: error: no matching function for call to 'tesseract::TessUnlvRenderer::TessUnlvRenderer(__pyx_t_9tesseract_cchar_t*&)'
           __pyx_t_4 = new tesseract::TessUnlvRenderer(__pyx_v_outputbase);
                                                                         ^
    tesserocr.cpp:16715:69: note: candidates are:
    In file included from tesserocr.cpp:427:0:
    /usr/include/tesseract/renderer.h:227:3: note: tesseract::TessUnlvRenderer::TessUnlvRenderer()
       TessUnlvRenderer();
       ^
    /usr/include/tesseract/renderer.h:227:3: note:   candidate expects 0 arguments, 1 provided
    /usr/include/tesseract/renderer.h:225:16: note: tesseract::TessUnlvRenderer::TessUnlvRenderer(const tesseract::TessUnlvRenderer&)
     class TESS_API TessUnlvRenderer : public TessResultRenderer {
                    ^
    /usr/include/tesseract/renderer.h:225:16: note:   no known conversion for argument 1 from '__pyx_t_9tesseract_cchar_t* {aka const char*}' to 'const tesseract::TessUnlvRenderer&'
    tesserocr.cpp:16799:72: error: no matching function for call to 'tesseract::TessBoxTextRenderer::TessBoxTextRenderer(__pyx_t_9tesseract_cchar_t*&)'
           __pyx_t_5 = new tesseract::TessBoxTextRenderer(__pyx_v_outputbase);
                                                                            ^
    tesserocr.cpp:16799:72: note: candidates are:
    In file included from tesserocr.cpp:427:0:
    /usr/include/tesseract/renderer.h:238:3: note: tesseract::TessBoxTextRenderer::TessBoxTextRenderer()
       TessBoxTextRenderer();
       ^
    /usr/include/tesseract/renderer.h:238:3: note:   candidate expects 0 arguments, 1 provided
    /usr/include/tesseract/renderer.h:236:16: note: tesseract::TessBoxTextRenderer::TessBoxTextRenderer(const tesseract::TessBoxTextRenderer&)
     class TESS_API TessBoxTextRenderer : public TessResultRenderer {
                    ^
    /usr/include/tesseract/renderer.h:236:16: note:   no known conversion for argument 1 from '__pyx_t_9tesseract_cchar_t* {aka const char*}' to 'const tesseract::TessBoxTextRenderer&'
    tesserocr.cpp:16883:69: error: no matching function for call to 'tesseract::TessTextRenderer::TessTextRenderer(__pyx_t_9tesseract_cchar_t*&)'
           __pyx_t_6 = new tesseract::TessTextRenderer(__pyx_v_outputbase);
                                                                         ^
    tesserocr.cpp:16883:69: note: candidates are:
    In file included from tesserocr.cpp:427:0:
    /usr/include/tesseract/renderer.h:164:3: note: tesseract::TessTextRenderer::TessTextRenderer()
       TessTextRenderer();
       ^
    /usr/include/tesseract/renderer.h:164:3: note:   candidate expects 0 arguments, 1 provided
    /usr/include/tesseract/renderer.h:162:16: note: tesseract::TessTextRenderer::TessTextRenderer(const tesseract::TessTextRenderer&)
     class TESS_API TessTextRenderer : public TessResultRenderer {
                    ^
    /usr/include/tesseract/renderer.h:162:16: note:   no known conversion for argument 1 from '__pyx_t_9tesseract_cchar_t* {aka const char*}' to 'const tesseract::TessTextRenderer&'
    tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_108IsValidCharacter(__pyx_obj_9tesserocr_PyTessBaseAPI*, PyObject*)':
    tesserocr.cpp:19649:60: error: 'class tesseract::TessBaseAPI' has no member named 'IsValidCharacter'
       __pyx_t_1 = __Pyx_PyBool_FromLong(__pyx_v_self->_baseapi.IsValidCharacter(__pyx_t_2)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2161, __pyx_L1_error)
                                                                ^
    tesserocr.cpp:532:36: note: in definition of macro '__Pyx_PyBool_FromLong'
     #define __Pyx_PyBool_FromLong(b) ((b) ? __Pyx_NewRef(Py_True) : __Pyx_NewRef(Py_False))
                                        ^
    tesserocr.cpp: In function 'PyObject* PyInit_tesserocr()':
    tesserocr.cpp:25002:67: error: 'PSM_RAW_LINE' is not a member of 'tesseract'
       __pyx_t_2 = __Pyx_PyInt_From_enum__tesseract_3a__3a_PageSegMode(tesseract::PSM_RAW_LINE); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 132, __pyx_L1_error)
                                                                       ^
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    Complete output from command /usr/bin/python3 -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/tesserocr/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-cwirkbsc-record/install-record.txt --single-version-externally-managed --compile:
    running install

running build

running build_ext

/usr/local/lib/python3.4/dist-packages/Cython/Distutils/old_build_ext.py:30: UserWarning: Cython.Distutils.old_build_ext does not properly handle dependencies and is deprecated.

  "Cython.Distutils.old_build_ext does not properly handle dependencies "

Supporting tesseract v3.03

Building with configs: {'cython_compile_time_env': {'TESSERACT_VERSION': 771}, 'libraries': ['tesseract', 'lept']}

cythoning tesserocr.pyx to tesserocr.cpp

building 'tesserocr' extension

creating build

creating build/temp.linux-x86_64-3.4

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.4m -c tesserocr.cpp -o build/temp.linux-x86_64-3.4/tesserocr.o

cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++ [enabled by default]

tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_14PyPageIterator_20SetBoundingBoxComponents(__pyx_obj_9tesserocr_PyPageIterator*, bool, bool)':

tesserocr.cpp:4610:25: error: 'class tesseract::PageIterator' has no member named 'SetBoundingBoxComponents'

   __pyx_v_self->_piter->SetBoundingBoxComponents(__pyx_v_include_upper_dots, __pyx_v_include_lower_dots);

                         ^

tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_14PyPageIterator_34GetImage(__pyx_obj_9tesserocr_PyPageIterator*, tesseract::PageIteratorLevel, int, PyObject*)':

tesserocr.cpp:5842:125: error: no matching function for call to 'tesseract::PageIterator::GetImage(tesseract::PageIteratorLevel&, int&, Pix*&, int*, int*)'

   __pyx_v_pix = __pyx_v_self->_piter->GetImage(__pyx_v_level, __pyx_v_padding, __pyx_v_opix, (&__pyx_v_left), (&__pyx_v_top));

                                                                                                                             ^

tesserocr.cpp:5842:125: note: candidate is:

In file included from tesserocr.cpp:424:0:

/usr/include/tesseract/pageiterator.h:239:8: note: Pix* tesseract::PageIterator::GetImage(tesseract::PageIteratorLevel, int, int*, int*) const

   Pix* GetImage(PageIteratorLevel level, int padding,

        ^

/usr/include/tesseract/pageiterator.h:239:8: note:   candidate expects 4 arguments, 5 provided

tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_74AnalyseLayout(__pyx_obj_9tesserocr_PyTessBaseAPI*, bool)':

tesserocr.cpp:16244:83: error: no matching function for call to 'tesseract::TessBaseAPI::AnalyseLayout(bool&)'

   __pyx_v_piter = __pyx_v_self->_baseapi.AnalyseLayout(__pyx_v_merge_similar_words);

                                                                                   ^

tesserocr.cpp:16244:83: note: candidate is:

In file included from tesserocr.cpp:429:0:

/usr/include/tesseract/baseapi.h:489:17: note: tesseract::PageIterator* tesseract::TessBaseAPI::AnalyseLayout()

   PageIterator* AnalyseLayout();

                 ^

/usr/include/tesseract/baseapi.h:489:17: note:   candidate expects 0 arguments, 1 provided

tesserocr.cpp: In function 'tesseract::TessResultRenderer* __pyx_f_9tesserocr_13PyTessBaseAPI__get_renderer(__pyx_obj_9tesserocr_PyTessBaseAPI*, __pyx_t_9tesseract_cchar_t*)':

tesserocr.cpp:16588:88: error: no matching function for call to 'tesseract::TessHOcrRenderer::TessHOcrRenderer(__pyx_t_9tesseract_cchar_t*&, bool&)'

       __pyx_t_2 = new tesseract::TessHOcrRenderer(__pyx_v_outputbase, __pyx_v_font_info);

                                                                                        ^

tesserocr.cpp:16588:88: note: candidates are:

In file included from tesserocr.cpp:427:0:

/usr/include/tesseract/renderer.h:175:3: note: tesseract::TessHOcrRenderer::TessHOcrRenderer()

   TessHOcrRenderer();

   ^

/usr/include/tesseract/renderer.h:175:3: note:   candidate expects 0 arguments, 2 provided

/usr/include/tesseract/renderer.h:173:16: note: tesseract::TessHOcrRenderer::TessHOcrRenderer(const tesseract::TessHOcrRenderer&)

 class TESS_API TessHOcrRenderer : public TessResultRenderer {

                ^

/usr/include/tesseract/renderer.h:173:16: note:   candidate expects 1 argument, 2 provided

tesserocr.cpp:16631:106: error: no matching function for call to 'tesseract::TessPDFRenderer::TessPDFRenderer(__pyx_t_9tesseract_cchar_t*&, const char*)'

       __pyx_t_3 = new tesseract::TessPDFRenderer(__pyx_v_outputbase, __pyx_v_self->_baseapi.GetDatapath());

                                                                                                          ^

tesserocr.cpp:16631:106: note: candidates are:

In file included from tesserocr.cpp:427:0:

/usr/include/tesseract/renderer.h:188:3: note: tesseract::TessPDFRenderer::TessPDFRenderer(const char*)

   TessPDFRenderer(const char *datadir);

   ^

/usr/include/tesseract/renderer.h:188:3: note:   candidate expects 1 argument, 2 provided

/usr/include/tesseract/renderer.h:186:16: note: tesseract::TessPDFRenderer::TessPDFRenderer(const tesseract::TessPDFRenderer&)

 class TESS_API TessPDFRenderer : public TessResultRenderer {

                ^

/usr/include/tesseract/renderer.h:186:16: note:   candidate expects 1 argument, 2 provided

tesserocr.cpp:16715:69: error: no matching function for call to 'tesseract::TessUnlvRenderer::TessUnlvRenderer(__pyx_t_9tesseract_cchar_t*&)'

       __pyx_t_4 = new tesseract::TessUnlvRenderer(__pyx_v_outputbase);

                                                                     ^

tesserocr.cpp:16715:69: note: candidates are:

In file included from tesserocr.cpp:427:0:

/usr/include/tesseract/renderer.h:227:3: note: tesseract::TessUnlvRenderer::TessUnlvRenderer()

   TessUnlvRenderer();

   ^

/usr/include/tesseract/renderer.h:227:3: note:   candidate expects 0 arguments, 1 provided

/usr/include/tesseract/renderer.h:225:16: note: tesseract::TessUnlvRenderer::TessUnlvRenderer(const tesseract::TessUnlvRenderer&)

 class TESS_API TessUnlvRenderer : public TessResultRenderer {

                ^

/usr/include/tesseract/renderer.h:225:16: note:   no known conversion for argument 1 from '__pyx_t_9tesseract_cchar_t* {aka const char*}' to 'const tesseract::TessUnlvRenderer&'

tesserocr.cpp:16799:72: error: no matching function for call to 'tesseract::TessBoxTextRenderer::TessBoxTextRenderer(__pyx_t_9tesseract_cchar_t*&)'

       __pyx_t_5 = new tesseract::TessBoxTextRenderer(__pyx_v_outputbase);

                                                                        ^

tesserocr.cpp:16799:72: note: candidates are:

In file included from tesserocr.cpp:427:0:

/usr/include/tesseract/renderer.h:238:3: note: tesseract::TessBoxTextRenderer::TessBoxTextRenderer()

   TessBoxTextRenderer();

   ^

/usr/include/tesseract/renderer.h:238:3: note:   candidate expects 0 arguments, 1 provided

/usr/include/tesseract/renderer.h:236:16: note: tesseract::TessBoxTextRenderer::TessBoxTextRenderer(const tesseract::TessBoxTextRenderer&)

 class TESS_API TessBoxTextRenderer : public TessResultRenderer {

                ^

/usr/include/tesseract/renderer.h:236:16: note:   no known conversion for argument 1 from '__pyx_t_9tesseract_cchar_t* {aka const char*}' to 'const tesseract::TessBoxTextRenderer&'

tesserocr.cpp:16883:69: error: no matching function for call to 'tesseract::TessTextRenderer::TessTextRenderer(__pyx_t_9tesseract_cchar_t*&)'

       __pyx_t_6 = new tesseract::TessTextRenderer(__pyx_v_outputbase);

                                                                     ^

tesserocr.cpp:16883:69: note: candidates are:

In file included from tesserocr.cpp:427:0:

/usr/include/tesseract/renderer.h:164:3: note: tesseract::TessTextRenderer::TessTextRenderer()

   TessTextRenderer();

   ^

/usr/include/tesseract/renderer.h:164:3: note:   candidate expects 0 arguments, 1 provided

/usr/include/tesseract/renderer.h:162:16: note: tesseract::TessTextRenderer::TessTextRenderer(const tesseract::TessTextRenderer&)

 class TESS_API TessTextRenderer : public TessResultRenderer {

                ^

/usr/include/tesseract/renderer.h:162:16: note:   no known conversion for argument 1 from '__pyx_t_9tesseract_cchar_t* {aka const char*}' to 'const tesseract::TessTextRenderer&'

tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_108IsValidCharacter(__pyx_obj_9tesserocr_PyTessBaseAPI*, PyObject*)':

tesserocr.cpp:19649:60: error: 'class tesseract::TessBaseAPI' has no member named 'IsValidCharacter'

   __pyx_t_1 = __Pyx_PyBool_FromLong(__pyx_v_self->_baseapi.IsValidCharacter(__pyx_t_2)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2161, __pyx_L1_error)

                                                            ^

tesserocr.cpp:532:36: note: in definition of macro '__Pyx_PyBool_FromLong'

 #define __Pyx_PyBool_FromLong(b) ((b) ? __Pyx_NewRef(Py_True) : __Pyx_NewRef(Py_False))

                                    ^

tesserocr.cpp: In function 'PyObject* PyInit_tesserocr()':

tesserocr.cpp:25002:67: error: 'PSM_RAW_LINE' is not a member of 'tesseract'

   __pyx_t_2 = __Pyx_PyInt_From_enum__tesseract_3a__3a_PageSegMode(tesseract::PSM_RAW_LINE); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 132, __pyx_L1_error)

                                                                   ^

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Cleaning up...
Command /usr/bin/python3 -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/tesserocr/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-cwirkbsc-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/tesserocr
Storing debug log for failure in /root/.pip/pip.log

error:start >= 0 && start + num <= length_:Error:Assert failed:in file ratngs.cpp, line 321

here is my part code:

import sys
from PIL import Image
from tesserocr import PyTessBaseAPI, RIL,iterate_level
import cv2
import numpy as np

def processText(image):
    image = cv2.imread(image)
    iimage = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    kernel = np.ones((1, 1), np.uint8)
    #iimage = cv2.dilate(iimage, kernel, iterations=1)
    #iimage = cv2.erode(iimage, kernel, iterations=1)
    cv2.imwrite("7.png", iimage)
    img = cv2.adaptiveThreshold(iimage, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
    cv2.imwrite("8.png", img)
    image = Image.open('8.png')
    im1 = cv2.imread("midResult/init3.jpeg")

    wordList = []
    with PyTessBaseAPI(psm=1) as api:
        api.SetImage(image)
        api.Recognize()
        boxes = api.GetComponentImages(RIL.TEXTLINE, False)
        print 'Found {} textline image components.'.format(len(boxes))
        for i, (im, box, _, _) in enumerate(boxes):
            radioSize = 0.08
            api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
            ocrResult = api.GetUTF8Text()
            conf = api.MeanTextConf()
            print(u"Box[{0}]:x={x},y={y},w={w},h={h},""confidence:{1},text:{2}").format(i, conf, ocrResult, **box)
            cv2.rectangle(im1, (box['x'], box['y']),
                          (box['x'] + box['w'],
                           box['y']+box['h']), (0, 255, 0), 2)
    cv2.imwrite("11.png", im1)

processText("midResult/init3.jpeg")

I have debugged that when running into api.Recognize(), the error in the title occured. But when i changed the pic paremeter into 3.jpeg, the error disappeared. But these two pictures are nearly the same.
what's wrong with this?
3
init3

Error while importing "from tesserocr import PyTessBaseAPI"

After successfull compilation the example code in the readme fails while importing PyTessBaseAPI

from tesserocr import PyTessBaseAPI
  File "build/bdist.linux-x86_64/egg/tesserocr.py", line 7, in <module>
  File "build/bdist.linux-x86_64/egg/tesserocr.py", line 6, in __bootstrap__
ImportError: /home/leonardo/.python-eggs/tesserocr-2.0.2b0-py2.7-linux-x86_64.egg-tmp/tesserocr.so: undefined symbol: _ZN9tesseract11TessBaseAPI13AnalyseLayoutEb

Publish Wheels

Ideally with static libraries included so that all that is required for installation in pip.

Building on Windows

I have been working on making some bots for programs that only run in windows and I was wondering if you had any pointers on compiling on windows.
I was actually able to build tesserocr.lib but I cannot get past that step.

I used https://github.com/peirick/VS2015_Tesseract to build libtesseract and used that to satisfy all of the imports.

Thank you.

Is this expected? PIL+image_to_text and file_to_text give different results

I ran the sample code thats in the readme text with one of my images and interestingly, it gives two different results when using PIL and then image_to_text rather than going directly with file_to_text. The PIL version seems to perform better, and the images are just regular JPEGs.
Sample code being referenced and output is below

CODE

import tesserocr
from PIL import Image

print tesserocr.tesseract_version()  # print tesseract-ocr version
print tesserocr.get_languages()  # prints tessdata path and list of available languages

image = Image.open(l.blobname)
print tesserocr.image_to_text(image)  # print ocr text from image
print "==================================================================="
# or
print tesserocr.file_to_text(l.blobname)

OUTPUT


tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

(u'/usr/share/tesseract-ocr/tessdata/', [u'equ', u'eng', u'osd'])
Everyday
Low
Price
Hem n 796577
um pm 5»

murAflWuNNEHSM

 


===================================================================
 

1m mm umvtnm

s 1 7885333252;

 
     
   
  

Everyday
Low

 

PVICE

SetCvImage

Is there a way to set OpenCV images directly like the one in Python Tesseract? Opencv always comes in handy for preprocessing and it would be a waste of resources to save it to a file and read it again

incorrectly detects orientation

I've noticed the orientation example doesn't distinguish between upside down/rightside up and clockwise/counter clockwise orientations.

reubano@tokpro [~]⚡ tesseract -psm 0 up.jpg - 
Orientation: 0
Orientation in degrees: 0
Orientation confidence: 0.23
Script: 1
Script confidence: 0.98

reubano@tokpro [~]⚡ tesseract -psm 0 down.jpg - 
Orientation: 2
Orientation in degrees: 180
Orientation confidence: 0.21
Script: 1
Script confidence: 0.61
with PyTessBaseAPI(psm=PSM.AUTO_OSD) as api:
    for path in ['up.jpg', 'down.jpg']:
        image = Image.open(path)
        api.SetImage(image)
        api.Recognize()
        it = api.AnalyseLayout()    
        print it.Orientation()

(0, 0, 2, 0.0)
(0, 0, 2, 0.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.