prabhakar267 / image2text Goto Github PK

:clipboard: Python wrapper to grab text from images and save as text files using Tesseract Engine

Python 100.00%

tesseract tesseract-engine optical-character-recognition ocr image2text tesseract-ocr python-wrapper tesseract-installation

image2text's Introduction

Image2Text

Image2Text is a python wrapper to grab text from images and save as text files using Google Tesseract Engine. Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. In 2006 Tesseract was considered one of the most accurate open-source OCR engines then available.

Quick Links:

Usage
- Running Tests
Tesseract Installation
- Linux
- Windows
Sample Results
- Sample Image
- Text output

Usage

python main.py -i <input_path> -o <output_path>

usage: main.py [-h] -i INPUT [-o OUTPUT] [-d]

required arguments:
  -i INPUT, --input INPUT       Single image file path or images directory path

optional arguments:
  -o OUTPUT, --output OUTPUT    (Optional) Output directory for converted text
  -d, --debug                   Enable verbose DEBUG logging

python main.py -i sample/

python main.py -i sample/ -o output/

Running Tests

python -m unittest

Tesseract Installation

Linux

[sudo] apt-get install tesseract-ocr

Windows

Install tesseract-ocr from UB Mannheim here: https://github.com/UB-Mannheim/tesseract/wiki
Add the installed Tesseract-OCR directory path to PATH system variable

Mac

brew install tesseract

Sample Results

Sample Image

(Wikipedia page for Google | Lang : Simple English)

Text output

A man signing in at Google’s main aﬁce, Googleplex.

Google Inc. is an American multinational corporation
that is best known for running one of the largest search
engines on the World Wide Web (WWW). Every day,
200 million (200,000,000) people use it. Google’s main
ofﬁce (“Googleplex”) is in Mountain View, California,
USA.

With Google Search, people can also search for pictures,
Usenet newsgroups, news, and things to buy online. By
June 2004, Google had 4.28 billion web pages on its
database, 880 million (880,000,000) pictures and 845
million (845,000,000) Usenet messages — six billion
things.

“To google,” as an action word (verb) means “to search
for something on Google”. Because Google is so popular
(more than half of people on the web use it) it has been
used to mean “to search the web”. Google dislikes this
use since the name of the company is a trademark.

As a public company, Google Inc. trades on the
NASDAQ under the tickers GOOG and GOOGL.

In August 2015, Google announced it was being restruc-
tured under a new holding company called Alphabet Inc.

1 History

Google was started in early 1996 by Larry Page and
Sergey Brin, two students at Stanford University, USA.
It used to be called Backrub. Later, they made it into a
company, Google Inc., on September 7, 1998 at a friend’s
garage in Menlo Park, California. In February 1999, the
company moved to 165 University Ave., Palo Alto, Cal-
ifornia. Later that year, it moved to another place, now

called the “Googleplex”.

In September 2001, Google’s rating system (“PageR-
ank”, for saying which information is more helpful) got a
US. Patent. The patent was to Stanford University, with
Lawrence (Larry) Page as the inventor (the person who
ﬁrst had the idea).

Google makes an important, though shrinking, percent-
age of its money through its friends like America Online
and InterActiveCorp. It has a special group known as the
Partner Solutions Organization (PSO) which helps make
contracts, helps making accounts better, and gives engi-
neering help.

2 How Google makes money

Google makes money by advertising. People or compa-
nies who want people to buy their product, service, or
ideas give Google money, and Google shows an adver-
tisement to people Google thinks will click on the adver-
tisement. Google only gets money when people click on
the link, so it tries to know as much about people as pos-
sible to only show the advertisement to the “right people”.
It does this with Google Analytics, which sends data back
to Google whenever someone visits a web site. From this
and other data, Google makes a proﬁle about the person,
which it then uses to ﬁgure out which advertisements to
show.

3 The name “Google”

The name “Google” is a misspelling of the word
g00g01.[7][8] Milton Sirotta, nephew of US. mathemati-
cian Edward Kasner, made this word in 1938, for the
number 1 followed by one hundred zeroes ( 10100 ). It
is said that the word “googol” was chosen as a name for
this number because it sounded like baby talk. Google
uses this word because the company wants to make lots
of stuff on the Web easy to ﬁnd and use. Andy Bechtol-
sheim ﬁrst thought of the name.

The name for Google’s main ofﬁce, the “Googleplex,” is a
play on a different, even bigger number, the "googolpleX",
which is 1 followed by one googol of zeroes.

Stargazers over time

image2text's People

Contributors

Stargazers

Watchers

Forkers

shvbsle yashica-gupta arielshad atri-mandal vigneshtdev chaintng cougar avjdataminer aadityaganapule1996 anushanetskope kpolimis satyamg025 kartishr akshayjh ian2009 fitrialif apatwary12 chrmorais ryanqfeeney baifengbai harshadeepg kbasar hbcbh1999 janesmile97 sangeethashree rohithredd94 akankshya-ap tmvanetten mahefaabel ginking wei765 vpineda7 24gaurangi kiran8143 hixing vibhorgarg01 zhongshuiping chetan3602 qm31122016 arnaudmkonan shangadi ganeshhubale sacramentorodrigo laocoi linjing930711 duke79 makhthum sunilseegi aa6my khuongnd tamaliudk yajanarao yashugupta786 shadetree01010100 avsaditya bhattrajat subhendusethi rods-honorio thejosher sumonst21 javathunderman anudy himanshudhami rajibmitra kmvinayaka raider420 sunilgummadi weatherguyto podilaaditya codeslayer001 justmaho amirunpri2018 bachmanna braimourad basilbelbeisi sahwar sanketbajoria fossabot alex8137 it-mikhail alexandredewilde apputhangads xiaojia1234 diaaesmail hoangtienduc baba-image-telling-for-blind arpitgothi joexinfa dennyazevedo khihort rdsaha kep-w zjturing sasi-007 danglive ajasra youly172 chttrjeankr ikostan govindtank

image2text's Issues

traceback error

when i am trying to check the module using main.py i am getting a traceback error (most recent call last at)
.....
.....
.....

OSerror: No such file or directory

The screenshot is attached

Add instructions to run script on MacOS

Usage section in Readme mentions the instruction to run the script on Linux, but it should be more generic to support MacOS as well.

Don't run in Windows

Hello,
I Tried so hard to run your program. But it don't work.
My computer has installed Tesseract-orc ( Window)
Please help me, How to run on windows.
thanks

Add verbosity in arguments

Add verbosity (DEBUG / INFO) in arguments with default as INFO.
Set verbosity according to the user input

How to use this tool with other language?

I wanna use this tool for Vietnamese language but I cannot find where to set up this. I hope you will reply me soon. Thanks.

How to run in Windows?

I'm trying to get this to run in windows but I keep getting this error

INFO: Could not find files for the given pattern(s).
ERROR:root:tesseract-ocr missing, use install tesseract to resolve.

I have installed tesseract via pip install tesseract and I think I added it to my system PATH variable correctly (unsure). What can I do to get it to work?

Provide more informative error if "tesseract-ocr" is missing

If the "tesseract-ocr" is missing on a user's system, the error user gets is:

OSError: [Errno 2] No such file or directory

which is not very descriptive, we should give a more informative error
tesseract-ocr missing, use sudo apt-get install tesseract-ocr to resolve

Refer to #4

FileNotFoundError: [WinError 2] The system cannot find the file specified

i have given the input of the image path from command as fallows :
python main.py "C:\Users\colorssoftware1\Downloads\ocr-convert-image-to-text-master\sample\file-page1"

the output as fallows:
File "main.py", line 54, in
main(path)
File "main.py", line 16, in main
if call(['which', 'tesseract']):
File "C:\Program Files (x86)\Python36-32\lib\subprocess.py", line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Program Files (x86)\Python36-32\lib\subprocess.py", line 707, in init
restore_signals, start_new_session)
File "C:\Program Files (x86)\Python36-32\lib\subprocess.py", line 990, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

where is the problem ?

[Feature Request] Use only single input command to accept both file and directory

Currently we have two input arguments - input_dir and input_file instead we should only have one input parameter and based on the input type we should accordingly work on file/directory.

Add instructions to run script on Windows

Usage section in Readme mentions the instruction to run the script on Linux, but it should be more generic to support Windows as well.

Add unit tests

Add unit tests for the script

Performance on large directories

Currently, the performance of the application on directories with a lot of images is very slow. #44 improves the performance on large directories.

Add functionality to provide output path as well instead of assuming it

Output path is currently set as converted-text inside the source directory. Instead we should be taking this information from the user

ValueError when running example

I'm simply attempting to run the example and I get a ValueError in one of the PIL modules. Any ideas what is going on?

C:\Projects\OCR\ocr-convert-image-to-text-master>python main.py C:\Projects\OCR
Traceback (most recent call last):
File "main.py", line 8, in
from pytesser.pytesser import *
File "C:\Projects\OCR\ocr-convert-image-to-text-master\pytesser\pytesser.py", line 6, in
import Image
File "C:\Python27\lib\site-packages\PIL\Image.py", line 27, in
from . import VERSION, PILLOW_VERSION, _plugins
ValueError: Attempted relative import in non-package

Losing text in the result

Hi, thank you for sharing your solution.

I'm trying to convert scanned pdf to text. I used imagemagick to convert from pdf to jpg and OCR for the last conversion. I tried your code to see if I get better results but I still lose a lot of information in the process. Do you have any tips to improve my results? Or, is there a better way to look for standards in a document and I'm not seeing?

The document I try to check does not have a standard format. But usually the information I need are in a table