dannnylo / rtesseract Goto Github PK

View Code? Open in Web Editor NEW

805.0 10.0 85.0 1.39 MB

Ruby library for working with the Tesseract OCR.

Home Page: http://rubygems.org/gems/rtesseract

License: MIT License

Ruby 99.20% Shell 0.80%

rtesseract tesseract-ocr hacktoberfest tesseract ruby

rtesseract's Introduction

RTesseract

Ruby library for working with the Tesseract OCR.

Installation

Check if tesseract ocr programs are installed:

$ tesseract --version

If not, you can install them with a command like:

$ apt install tesseract-ocr

$ brew install tesseract

or for Heroku 22 to add the buildpack https://github.com/pathwaysmedical/heroku-buildpack-tesseract

Add this line to your application's Gemfile:

gem 'rtesseract'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rtesseract

Usage

It's very simple to use rtesseract.

Convert image to string

  image = RTesseract.new("my_image.jpg")
  image.to_s # Getting the value

Convert image to searchable PDF

  image = RTesseract.new("my_image.jpg")
  image.to_pdf  # Getting open file of pdf

Convert image to TSV

  image = RTesseract.new("my_image.jpg")
  image.to_tsv  # Getting open file of tsv

This will preserve the image colors, pictures and structure in the generated pdf.

Options

Language

    RTesseract.new('test.jpg', lang: 'deu')

eng - English
deu - German
deu-f - German fraktur
fra - French
ita - Italian
nld - Dutch
por - Portuguese
spa - Spanish
vie - Vietnamese
or any other supported by tesseract.

Note: Make sure you have installed the language to tesseract

Other options

  RTesseract.new('test.jpg', config_file: :digits)  # Only digit recognition

  RTesseract.new('test.jpg', config_file: 'digits quiet')

BOUNDING BOX: TO GET WORDS WITH THEIR POSITIONS

  RTesseract.new('test_words.png').to_box
  => [
    { :word => 'If', :confidence=>89, :x_start=>52, :y_start=>13, :x_end=>63, :y_end=>27},
    { :word => 'you', :confidence=>96, :x_start=>69, :y_start=>17, :x_end=>100, :y_end=>31},
    { :word => 'are', :confidence=>92, :x_start=>108, :y_start=>17, :x_end=>136, :y_end=>27},
    { :word => 'a', :confidence=>92, :x_start=>133, :y_start=>8, :x_end=>147, :y_end=>35},
    { :word => 'friend,', :confidence=>95, :x_start=>158, :y_start=>13, :x_end=>214, :y_end=>29},
    { :word => 'you', :confidence=>96, :x_start=>51, :y_start=>39, :x_end=>82, :y_end=>53},
    { :word => 'speak', :confidence=>96, :x_start=>90, :y_start=>35, :x_end=>140, :y_end=>53},
    { :word => 'the', :confidence=>96, :x_start=>146, :y_start=>35, :x_end=>174, :y_end=>49},
    { :word => 'password,', :confidence=>96, :x_start=>182, :y_start=>35, :x_end=>267, :y_end=>53},
    { :word => 'and', :confidence=>96, :x_start=>51, :y_start=>57, :x_end=>81, :y_end=>71},
    { :word => 'the', :confidence=>96, :x_start=>89, :y_start=>57, :x_end=>117, :y_end=>71},
    { :word => 'doors', :confidence=>96, :x_start=>124, :y_start=>57, :x_end=>172, :y_end=>71},
    { :word => 'will', :confidence=>96, :x_start=>180, :y_start=>57, :x_end=>208, :y_end=>71},
    { :word => 'open.', :confidence=>96, :x_start=>216, :y_start=>61, :x_end=>263, :y_end=>75}
  ]

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/dannnylo/rtesseract. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Rtesseract project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

rtesseract's People

Contributors

Stargazers

Watchers

Forkers

scrumers hooopo marcosinger jimi-c lukesarnacki rdht sub2u jdejong bredoxon musclesound hsiss zprickett leoromanovsky asheavenue oliviermilla eunoia dizhu espinosa matanco heroiceric samjay123 federicolucca vanboom ernestjkaufman theonekedia pavel-jurasek pumpchaser eymlo kludgemaster slickcodestudio vinodadhikary danielpflood kikihakiem mumer92 simonoff hfl devilpersonnel ttilberg debbbbie chengguangnan creadone abarrak luxflux i-s-o mjmurra4 lukeasrodgers ramakrishna409 plusor westonplatter kamalpaneru mertcelebi kylelkh alisnic yobisense tjad tomorth kevinabond zetahawke btfshadow mjy blewa b1naryher0 pragmaticed masterslowpoke cponcax kirpen cocoonventures hintmedia saharak-manoo dyet92k melvinnau jasonfb angry-boss andresguitarblack bearerpipelinetest rafath ryankopf kanocode dimaspriyanto iq-scm joewai-ror prey daniellemky sm0k3duxx

rtesseract's Issues

can't work will in centos

use rtesseract,but get nothing and don't output error
it maybe happed in rtesseract 139

`#{@command} "#{tmp_image.path}" "#{path.gsub(".txt","")}" #{lang} #{psm} #{config_file} #{clear_console_output}`

avoid Ignoring hyphens

rtesseract is awesome! Thank you for the library.

Is there any option not remove hyphens (i.e. ABC-DEF)?

Unable to parse image at URL when ran inside Docker container

RTesseract.new takes the path to the image to be processed as shown:

image = RTesseract.new('./img.png')
puts image.to_s

=> text found in the image

Now let's consider the image hosted at this url: https://via.placeholder.com/150

When ran locally, RTesseract appears to be able to take the URL as an image path.

image = RTesseract.new('https://via.placeholder.com/150')
puts image.to_s

=> 150x 150

However this does not work when ran inside an alpine docker container.

Traceback (most recent call last):
	3: from main.rb:3:in `<main>'
	2: from /usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract.rb:41:in `to_s'
	1: from /usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract/text.rb:8:in `run'
/usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract/command.rb:57:in `run': Error, cannot read input file https://via.placeholder.com/150: No such file or directory (RTesseract::Error)
Error during processing.

I've set up a small project where all this can be seen here:
https://github.com/abardallis/tesseract-test.git

To be honest, I'm a little surprised that it works locally and it not working in the docker container makes a little more sense seeing as how passing in a url isn't mentioned in the documentation for rtesseract.

However, it does work locally and I would really enjoy it if it could work on this docker container as well given that I'm hoping to deploy this thing and do not want to store images locally first in order to be able to run them through rtesseract (for reasons).

This might be an underlying issue in how I'm building my docker image and have nothing to do with rtesseract, but I figure this is a good place to start. If there's no way of getting the thing to work on docker, and you could at least provide some context on why this thing is working locally, that would at least help me sleep at night.

Thanks!

ImageNotSelected using online file

Hello,

I'm trying to work with online files.
I tried to fetch distant file into a tempfile in aim that rtesseract could read words on it.
I tried using this code:
tmp_file = Tempfile.new(self.title) open(fileUrl, 'r:UTF-8') do |url_file| #fileUrl is a string tmp_file.write(url_file.read) end tmp_file.rewind begin RTesseract.new(self.title, command: 'tesseract_error', debug: true ).to_s rescue => e return e.inspect end

The result is an RTesseract::ImageNotSelectedError
I don't know if it's due to the fact i get to_s in a def method converted to json in a serializer but when i return image i got a formatted json with rmagick processor and a source.

Am I doing wrong omewhere?

Thanks

Support for OCR engine modes

It seems like rtesseract currently doesn't support the option to specify an OCR engine mode.

I am currently using tesseract version 3.05.01 which has the following options:

OCR Engine modes:
  0    Original Tesseract only.
  1    Cube only.
  2    Tesseract + cube.
  3    Default, based on what is available.

I was wondering if there is any reason for not implementing such functionality. I'd be happy to try implementing it and opening a PR if needed.

Support hOCR output and parsing

I'd like to add support for reporting the bounding box of each word, similar to this API: https://github.com/meh/ruby-tesseract-ocr

I'm just adding this for record keeping and plan on working on it. I didn't miss anything, this sort of functionality doesn't exist in this gem right?

Conversion of image to searchable pdf

Hello, you describe how rtesseract can covert a scanned image into a searchable pdf using this code:

image = RTesseract.new("my_image.jpg")
image.to_pdf # Getting the pdf path
image.to_s # Still can get the value only.

...

some stuff

...

image.clean # to delete file once finished

I am already extracting the text into a pdf, but I don´t know how to make exactly that, converting an image (actually a page coming from a scanned document I have already converted into jpg) in a searchable pdf.

Is there any place where you have described this process a little more in detail, I am sorry but I don´t get the point about how you do it.

Also, is it possible to convert this way and entire pdf document with some page making unnecessary splitting pages one by one?

Best regards and thanks in advance

Way to provide different data files?

Is there a way to supply a data file instead of english? I'm trying to do digital displays. I have the data file though.

Confidence

Is there a way to get the confidence of a OCR-run?

Specs fail when run in random order

Adding --order rand to .rspec file and running the test suite causes numerous failures. I believe that the " support default config processors" specs in rtesseract_spec.rb interfere with other tests; other sources of interference are possible as well. I discovered this while trying to add a feature to this gem -- its tests passed when run on its own, but failed as part of the full suite.

One thing that might help address this would be a way to "reset" configuration options to their default; this could be run after each test in the suite. It would add some time to the test suite, but it's hard to see a better path to fixing this problem without a significant rewrite.

to_box returns empty array

Got a basic script to get all words from an image, but for some reason the to_box method returns an empty array. The to_s method returns the string from the image so OCR is working.

#scan.rb
require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'rtesseract'
end

x = RTesseract.new('path/to/image.png', lang: 'nld', classify_enable_learning: '0', psm: '6', tessedit_char_blacklist: ': =').to_box
puts x.inspect

Weird collision between RTesseract and Commander gems

Heya. I've been trying to implement RTesseract in my app and it's been going great in testing so far, thanks for making it! However, once I started running my scripts from the command line, I've started running into a really strange problem when calling RTesseract.new(path):

wrong number of arguments (given 0, expected 1)
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/runner.rb:171:in `command'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/delegates.rb:16:in `command'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/command.rb:11:in `initialize'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/text.rb:6:in `new'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/text.rb:6:in `run'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract.rb:38:in `to_s'

I've been researching this pretty much all week and I've come to the conclusion that this is some sort of weird collision between the RTesseract and Commander gems. For whatever reason, when RTesseract::Command is initialized, this happens:

From: /home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/command.rb @ line 11 RTesseract::Command#initialize:

     7:     def initialize(source, output, options)
     8:       @source = source
     9:       @output = output
    10:       @options = options
 => 11: binding.pry
    12:       @full_command = [ options.command, @source, @output]
    13:     end

[1] pry(#<RTesseract::Command>)> options.command
ArgumentError: wrong number of arguments (given 0, expected 1)
from /home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/runner.rb:171:in `command'
[2] pry(#<RTesseract::Command>)> options
=> #<RTesseract::Configuration command="tesseract", debug_file="/dev/null">

Even though options is an RTesseract::Configuration object, it's somehow getting a callback or reference of some sort stuck on it from the Commander gem and thinks its a method. I have no idea how to clear this up. Right now I'm going to fork RTesseract and replace options.command with 'tesseract' as that's the only command I need, but I'd rather not have to maintain that long term.

Any ideas?

Document how to scale

these two snippets

      image = RTesseract.read(image_path) do |img|
        img = img.scale 1.5
      end

      image = RTesseract.read(image_path) do |img|
        img = img.scale 2
      end

will have the same result. The image isn't changed and won't be touched. Absolutely no scaling here for the image outside the block.

The example is analogue to the one in the readme.
For scaling you should use the eclamation mark methods for changing the instance itself.
So it should be:

      image = RTesseract.read(image_path) do |img|
        img.scale! 2
      end

Tesseract 3.04

Does it support tesseract-3.04?

RTesseract::ConversionError in Ruby on Rails app

Installed gems fine, tesseract and imagemagik already installed on server.
Running tesseract command manually in terminal works successfully.
Running application locally on OS X enviroment works successfully.

uploaded_io = params[:picture]
    File.open(Rails.root.join('public', 'uploads', uploaded_io.original_filename), 'wb') do |file|
      file.write(uploaded_io.read)
    end
    dl = RTesseract.new(Rails.root.join('public', 'uploads',uploaded_io.original_filename).to_s)
    @string = dl.to_s

Only once I've deployed to my development server does it all break returning the error. The files are been copied across to the public/uploads folder correctly. And are readable as tested by running the tesseract command outside of ruby on the same file.

The result when running the app is RTesseract::ConversionError on the dl.to_s action

Unsure on what I'm missing..

Rails can't locate tesseract directory

I'm in a new rails 7 app running webpack. When I try to run

image_path = ActionController::Base.helpers.image_url('receipt.jpg')
image = RTesseract.new(image_path, lang: 'eng')
image.to_s

I keep getting the error

Errno::ENOENT: No such file or directory - tesseract

even though I have followed all the steps for setup. Any idea how to fix it?

RMagick processor generates image with BBP higher than 32

RMagick processor generates image with BBP higher than 32, which trips up Tesseract. I tried changing the rmagick processor to convert to PNG instead of TIF and that solves the problem.

Error in pixReadFromTiffStream: can't handle bpp > 32
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading test.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.

"can't modify frozen String (RuntimeError)" when trying to launch a RTesseract.new instance

Hey,

Thank you for your work on this great gem, I've already used it successfully in a couple of projects. But since I started to work on a new laptop a few days ago it stopped working...

Whenever I try to use old working code like

img = MiniMagick::Image.open(image_path)
str = RTesseract.new(img.path, processor: 'mini_magick').to_s

I get this error

Class:0x5220ff0: C:/RailsInstaller/Ruby1.9.3/lib/ruby/gems/1.9.1/gems/win32console-1.3.2-x86-ming

w32/lib/Win32/Console/ANSI.rb:163:in `sub!': can't modify frozen String (RuntimeError)

I have everything installed correctly on my machine (Windows) : RailsInstaller, ImageMagick, Tesseract, and I am able to open and modify images with MiniMagick without any problem... Any help on this would be nice :)

Tests fail when português not installed

Hello!

I just ran the tests locally on my monolingual computer. Because I don't have the português dictionary for tesseract installed, one of the tests fail:
RTesseract.new(@image_tiff,{:lang=>"por"}).to_s_without_spaces.should eql("43ZZ")

Not sure what the fix would be to install português, or to disable this test if the test runner doesn't have the proper dictionary.

What are your thoughts?

options -psm

Hello,
How to pass the parameter -psm 1

support for outputbase option

AFAIKT there is no support for the outputbase option:

  tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

Is that correct?

Add support for converting to tiff without compression

Currently, RMagick will derive the compression to tiff based on the source input file when write is called on image_to_tiff. See http://www.imagemagick.org/RMagick/doc/info.html#quality

If you try to get the text for a jpeg, you'll get the following error:

CompressionNotSupported `JPEG' @ error/tiff.c/WriteTIFFImage/2589

This can be fixed by allowing write to be called with the compress option. See http://www.imagemagick.org/RMagick/doc/info.html#compression. Here's an example:

require "RMagick"
module RMagickProcessor
  extend self
  def image_to_tiff
    tmp_file = Tempfile.new(["",".tif"])
    cat = @instance || Magick::Image.read(@source.to_s).first
    cat.crop!(@x, @y, @w, @h) unless [@x, @y, @w, @h].compact == []
    cat.write(tmp_file.path.to_s){self.compression = Magick::NoCompression}
    return tmp_file
  end

  def read_with_processor(path)
    Magick::Image.read(path.to_s).first
  end

  def is_a_instance?(object)
    object.class == Magick::Image
  end
end

I can submit a patch for this if you want. Maybe allow a compression option on the RTesseract.new that can be passed through to this image_to_tiff method?

rtesseract aws eb not working

ImageNotSelectedError after using `.read`

I can successfully use RTesseract in this way:

filename = 'path/to/foo.jpg'
image = RTesseract.new(filename)
puts image.to_s

When I use .read to try and transform the image exactly as in the example on the read me, though...

filename = 'path/to/foo.jpg'
image = RTesseract.read(filename) do |img|
  img = img.white_threshold(245)
  img = img.quantize(256,Magick::GRAYColorspace)
end
puts image.to_s

I get .../gems/rtesseract-1.3.0/lib/rtesseract.rb:228:in to_s': RTesseract::ImageNotSelectedError (RTesseract::ImageNotSelectedError)`

I get that same error even if I merely do

  image = RTesseract.read(filename) do |img| end
  puts image.to_s

Release new gem version

Hello, first of all thanks for your work!

I got stuck on the 2.0 version of the gem that is the latest in rubygems.com, I am experiencing a file namespace clash due to

require 'utils'

in rtesseract.rb: rails has already require'd a utils.rb before it processed rtesseract.rb and does not do it twice.

Just before starting a pull request, I noticed you have already amended this. Can you please update rubygems with your latest code? Thanks again :)

Best Heroku Build Pack

Based on this comment I see that @jessethebuilder prefers this Heroku buildpack.

Is there a common consensus by folks using your gem on what Heroku buildpack is the best for running tesseract OCR in heroku?

Tesseract OCR

I am bascially a beginner in programming and developing a web application which converts Image into Text i have used Orcad for this but it's for simple text and creating many issues i want to know is this is helpful to me that i can give an interface to user in which he can upload a file of formats like png, jpeg etc and that web app converts it into text i am bascially learning node.js for this kindly guide me which is better and easy to learn particulary for this project?

it's a good gem! but it still has some problems

i have a png format picture. the captcha is number '8706', but i got '8705'. sometimes, i want to got '0', it output ''u".

No implicit conversion of Tempfile into String when using URL

I'm trying to get the text from an image downloaded via URL:

url = "https://urlhere.com/image.png"
file = Tempfile.new(['image', File.extname(url)])
file.binmode
file.write open(url).read
file.flush
            
image = RTesseract.new(file)
image.to_s

This results in the following error:

TypeError: no implicit conversion of Tempfile into String

Not sure what I'm doing wrong here.

Orientation support?

Hello, is there any plan to support Tesseract'Orientation detection?

Handling timeouts

Hey there, I need to implement timeout for a long running Tesseract command.

I came up with two options how to do it:

Add the timeout option to the RTesseract.new and reimplement the Command#run using the Open3.popen3 instead of Open3.capture3 and catch the timout there (if set)
Add some async option to the RTesseract.new and implement some run_async and results methods, also using Open3.popen3, which would return PID therefore the timeout (killing the process) can be handled in the client code.

What do you think? Should I try to open a PR? Thanks!

Is it possible to read image from object instead of file?

Error reading resulting txt file in convert operation

Intermittently, I encounter an error where the output of the tesseract command in RTesseract.convert does not produce the txt file in a timely fashion (or possibly at all).

The error is:

No such file or directory - /tmp/20140108-25243-1mgiinb.txt
2014-01-08T19:59:21Z 25243 TID-121zho WARN: /home/deployer/apps/my_app/shared/bundle/ruby/1.9.1/bundler/gems/rtesseract-5653a4485fe9/lib/rtesseract.rb:140:in `read'

This error is pretty common when I am processing multiple images concurrently (using Sidekiq) and the RMagick processor. It seems to be a lot less frequent when I switch over to the MiniMagick processor, but still occurs.

JPG parsed significantly better than PDF

When parsing the same file as a PDF instead of a JPG, I got far worse results. Is there an obvious reason for this difference?

Some .to_s results contain tesseract(?) error messages

When I process some (not all) images, then I see Tesseract(?) errors appended to the result text:

Error in boxClipToRectangle: box outside rectangle\nError in pixScanForForeground: invalid box\nCreated On: 10 March 2015\nRepository: Frost\n\nEntomological Museum (PSUC)\n\n \n\f

OS X 10.12.6
rvm Ruby 2.5.1
rtesseract 3.0.2
rmagick 2.16.0

Can supply the image if needed.

Use existing tiff

I'm extracting all the pages of a PDF as TIFFs thru the mini_magick gem & I'd like to feed each of these to rtesseract w/o having it unnecessarily re-generate new, temporary tiffs. Short of monkey patching your image method, is there any way to do this?

feature request/question

I have an image I'm trying to OCR where the text is in two columns. as you can imagine, parsing the text Column A to Column B is rather difficult, so I was wondering how I might tell Rtesseract to just read the left-half of my image and then read the right-half of my image.

I suppose I could use imagemagick to split the image in two, then Ocr both halves. would be nice if I could tell Tesseract explicitly to use only a 4-corner bounding box for the OCR and ignore everything else in the image. (That way I could eliminate step of splitting the image in half and making two more images)

Using class method read results in error

It appears that the example using RTesseract::read fails with the following error:

NameError: uninitialized constant RMagickProcessor::Magick
    from .../gems/rtesseract-1.2.0/lib/processors/rmagick.rb:21:in `read_with_processor'
    from .../gems/rtesseract-1.2.0/lib/rtesseract.rb:66:in `read'
    from (irb):2

Looking through the code, it appears the processor is not require'd (via RMagickProcessor::setup). The tests are passing since this only needs to happen once. This is taken care of by RTesseract#choose_processor! in the constructor but there is no such call in RTesseract::read

RTesseract.read throws Encoding::UndefinedConversionError

Hello, I have an issue with read, when doing:

ocr = RTesseract.read("my_image.jpg") do |img|
  img
end

this fails in the from_blob method with the error Encoding::UndefinedConversionError.

Now, the same image goes fine through RTesseract when doing

ocr = RTesseract.new("my_image.jpg")
ocr.to_s

so the file is fine. Any setting I should set to fix this?

How to pass CLI flags?

I see that options can be set via a hash, but what about flags like -l best/eng -c preserve_interword_spaces=1?

Conversion Error

Hi - I am trying to convert my first image... and getting this debug output and a Conversion Error exception. If I run tesseract to convert the image from the command line, it works fine. Possibly I have a configuration issue? Thanks for any suggestions.

Tesseract Open Source OCR Engine v3.03 with Leptonica
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading /tmp/20150310-23757-162n079.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.
RTesseract::ConversionError: RTesseract::ConversionError

After further testing, it appears that this error is coming from tesseract not able to process .tif files.

tesseract sample.png stdout   ** WORKS **

tesseract sample.tif stdout   ** FAILS **

s.to_s RTesseract::ConversionError: RTesseract::ConversionError

Hi, I got this error. Please advise on how to solve this

irb(main):001:0> require 'rtesseract' => true irb(main):002:0> RTesseract.new("imag.jpg") => #<RTesseract:0x007fceda3e6ce0 @command="tesseract", @lang="", @psm=nil, @processor=RMagickProcessor, @debug=false, @options_cmd=[], @clear_console_output=true, @options={}, @h=nil, @w=nil, @y=nil, @x=nil, @value="", @source=#<Pathname:imag.jpg>> irb(main):003:0> s = _ => #<RTesseract:0x007fceda3e6ce0 @command="tesseract", @lang="", @psm=nil, @processor=RMagickProcessor, @debug=false, @options_cmd=[], @clear_console_output=true, @options={}, @h=nil, @w=nil, @y=nil, @x=nil, @value="", @source=#<Pathname:imag.jpg>> irb(main):004:0> s.to_s RTesseract::ConversionError: RTesseract::ConversionError from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:204:in rescue in convert'
from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:199:in convert' from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:225:in to_s'
from (irb):4
from /Users/jalenong/.rbenv/versions/2.2.2/bin/irb:11:in <main>'

`mini_magick` support is broken in `RTesseract.read`

TL;DL: RTesseract.read('foo.jpg', processor: 'mini_magick') results in cannot load such file -- RMagick

As the hash method with the misleading name option actually deletes keys from hashes, the given options hash won't be the same in lines six and nine.

So setting processor to mini_magick will just work until it gets deleted by the option method.
After that RTesseract will be initialized without processor option and therefore fall back to RMagick.

Set tesseract config

Is there any way to use another config file?

If yes would it be nice to see it documented.

Thank you very much!

support tesseract 3

Conversion Error on Heroku

I am getting the following when running RTesseract on Heroku

RTesseract::ConversionError (No such file or directory @ rb_sysopen - /tmp/1489639277.1960108205.txt)

The numbers for the file in tmp change, but I only get this on Heroku. Everything is working fine on localhost.