Giter VIP home page Giter VIP logo

rtesseract's People

Contributors

abarrak avatar alexanderadam avatar antronin avatar c4lliope avatar dannnylo avatar debbbbie avatar deepsourcebot avatar dependabot[bot] avatar espinosa avatar hooopo avatar jimi-c avatar joao avatar lukeasrodgers avatar luxflux avatar pragmaticed avatar pumpchaser avatar ryankopf avatar scrumers avatar steakchaser avatar thiagoalessio avatar tjad avatar zprickett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rtesseract's Issues

No implicit conversion of Tempfile into String when using URL

I'm trying to get the text from an image downloaded via URL:

url = "https://urlhere.com/image.png"
file = Tempfile.new(['image', File.extname(url)])
file.binmode
file.write open(url).read
file.flush
            
image = RTesseract.new(file)
image.to_s

This results in the following error:

TypeError: no implicit conversion of Tempfile into String

Not sure what I'm doing wrong here.

feature request/question

I have an image I'm trying to OCR where the text is in two columns. as you can imagine, parsing the text Column A to Column B is rather difficult, so I was wondering how I might tell Rtesseract to just read the left-half of my image and then read the right-half of my image.

I suppose I could use imagemagick to split the image in two, then Ocr both halves. would be nice if I could tell Tesseract explicitly to use only a 4-corner bounding box for the OCR and ignore everything else in the image. (That way I could eliminate step of splitting the image in half and making two more images)

Conversion of image to searchable pdf

Hello, you describe how rtesseract can covert a scanned image into a searchable pdf using this code:

image = RTesseract.new("my_image.jpg")
image.to_pdf # Getting the pdf path
image.to_s # Still can get the value only.

...

some stuff

...

image.clean # to delete file once finished

I am already extracting the text into a pdf, but I don´t know how to make exactly that, converting an image (actually a page coming from a scanned document I have already converted into jpg) in a searchable pdf.

Is there any place where you have described this process a little more in detail, I am sorry but I don´t get the point about how you do it.

Also, is it possible to convert this way and entire pdf document with some page making unnecessary splitting pages one by one?

Best regards and thanks in advance

Support for OCR engine modes

It seems like rtesseract currently doesn't support the option to specify an OCR engine mode.

I am currently using tesseract version 3.05.01 which has the following options:

OCR Engine modes:
  0    Original Tesseract only.
  1    Cube only.
  2    Tesseract + cube.
  3    Default, based on what is available.

I was wondering if there is any reason for not implementing such functionality. I'd be happy to try implementing it and opening a PR if needed.

"can't modify frozen String (RuntimeError)" when trying to launch a RTesseract.new instance

Hey,

Thank you for your work on this great gem, I've already used it successfully in a couple of projects. But since I started to work on a new laptop a few days ago it stopped working...

Whenever I try to use old working code like

img = MiniMagick::Image.open(image_path)
str = RTesseract.new(img.path, processor: 'mini_magick').to_s

I get this error

Class:0x5220ff0: C:/RailsInstaller/Ruby1.9.3/lib/ruby/gems/1.9.1/gems/win32console-1.3.2-x86-ming

w32/lib/Win32/Console/ANSI.rb:163:in `sub!': can't modify frozen String (RuntimeError)

I have everything installed correctly on my machine (Windows) : RailsInstaller, ImageMagick, Tesseract, and I am able to open and modify images with MiniMagick without any problem... Any help on this would be nice :)

Add support for converting to tiff without compression

Currently, RMagick will derive the compression to tiff based on the source input file when write is called on image_to_tiff. See http://www.imagemagick.org/RMagick/doc/info.html#quality

If you try to get the text for a jpeg, you'll get the following error:

CompressionNotSupported `JPEG' @ error/tiff.c/WriteTIFFImage/2589

This can be fixed by allowing write to be called with the compress option. See http://www.imagemagick.org/RMagick/doc/info.html#compression. Here's an example:

require "RMagick"
module RMagickProcessor
  extend self
  def image_to_tiff
    tmp_file = Tempfile.new(["",".tif"])
    cat = @instance || Magick::Image.read(@source.to_s).first
    cat.crop!(@x, @y, @w, @h) unless [@x, @y, @w, @h].compact == []
    cat.write(tmp_file.path.to_s){self.compression = Magick::NoCompression}
    return tmp_file
  end

  def read_with_processor(path)
    Magick::Image.read(path.to_s).first
  end

  def is_a_instance?(object)
    object.class == Magick::Image
  end
end

I can submit a patch for this if you want. Maybe allow a compression option on the RTesseract.new that can be passed through to this image_to_tiff method?

How to pass CLI flags?

I see that options can be set via a hash, but what about flags like -l best/eng -c preserve_interword_spaces=1?

Conversion Error

Hi - I am trying to convert my first image... and getting this debug output and a Conversion Error exception. If I run tesseract to convert the image from the command line, it works fine. Possibly I have a configuration issue? Thanks for any suggestions.

Tesseract Open Source OCR Engine v3.03 with Leptonica
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading /tmp/20150310-23757-162n079.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.
RTesseract::ConversionError: RTesseract::ConversionError

After further testing, it appears that this error is coming from tesseract not able to process .tif files.

tesseract sample.png stdout   ** WORKS **

tesseract sample.tif stdout   ** FAILS **

Specs fail when run in random order

Adding --order rand to .rspec file and running the test suite causes numerous failures. I believe that the " support default config processors" specs in rtesseract_spec.rb interfere with other tests; other sources of interference are possible as well. I discovered this while trying to add a feature to this gem -- its tests passed when run on its own, but failed as part of the full suite.

One thing that might help address this would be a way to "reset" configuration options to their default; this could be run after each test in the suite. It would add some time to the test suite, but it's hard to see a better path to fixing this problem without a significant rewrite.

ImageNotSelected using online file

Hello,

I'm trying to work with online files.
I tried to fetch distant file into a tempfile in aim that rtesseract could read words on it.
I tried using this code:
tmp_file = Tempfile.new(self.title) open(fileUrl, 'r:UTF-8') do |url_file| #fileUrl is a string tmp_file.write(url_file.read) end tmp_file.rewind begin RTesseract.new(self.title, command: 'tesseract_error', debug: true ).to_s rescue => e return e.inspect end

The result is an RTesseract::ImageNotSelectedError
I don't know if it's due to the fact i get to_s in a def method converted to json in a serializer but when i return image i got a formatted json with rmagick processor and a source.

Am I doing wrong omewhere?

Thanks

RMagick processor generates image with BBP higher than 32

RMagick processor generates image with BBP higher than 32, which trips up Tesseract. I tried changing the rmagick processor to convert to PNG instead of TIF and that solves the problem.

Error in pixReadFromTiffStream: can't handle bpp > 32
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading test.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.

s.to_s RTesseract::ConversionError: RTesseract::ConversionError

Hi, I got this error. Please advise on how to solve this

irb(main):001:0> require 'rtesseract' => true irb(main):002:0> RTesseract.new("imag.jpg") => #<RTesseract:0x007fceda3e6ce0 @command="tesseract", @lang="", @psm=nil, @processor=RMagickProcessor, @debug=false, @options_cmd=[], @clear_console_output=true, @options={}, @h=nil, @w=nil, @y=nil, @x=nil, @value="", @source=#<Pathname:imag.jpg>> irb(main):003:0> s = _ => #<RTesseract:0x007fceda3e6ce0 @command="tesseract", @lang="", @psm=nil, @processor=RMagickProcessor, @debug=false, @options_cmd=[], @clear_console_output=true, @options={}, @h=nil, @w=nil, @y=nil, @x=nil, @value="", @source=#<Pathname:imag.jpg>> irb(main):004:0> s.to_s RTesseract::ConversionError: RTesseract::ConversionError from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:204:in rescue in convert'
from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:199:in convert' from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:225:in to_s'
from (irb):4
from /Users/jalenong/.rbenv/versions/2.2.2/bin/irb:11:in <main>'

can't work will in centos

use rtesseract,but get nothing and don't output error
it maybe happed in rtesseract 139

`#{@command} "#{tmp_image.path}" "#{path.gsub(".txt","")}" #{lang} #{psm} #{config_file} #{clear_console_output}`

RTesseract::ConversionError in Ruby on Rails app

  • Installed gems fine, tesseract and imagemagik already installed on server.
  • Running tesseract command manually in terminal works successfully.
  • Running application locally on OS X enviroment works successfully.
uploaded_io = params[:picture]
    File.open(Rails.root.join('public', 'uploads', uploaded_io.original_filename), 'wb') do |file|
      file.write(uploaded_io.read)
    end
    dl = RTesseract.new(Rails.root.join('public', 'uploads',uploaded_io.original_filename).to_s)
    @string = dl.to_s

Only once I've deployed to my development server does it all break returning the error. The files are been copied across to the public/uploads folder correctly. And are readable as tested by running the tesseract command outside of ruby on the same file.

The result when running the app is RTesseract::ConversionError on the dl.to_s action

Unsure on what I'm missing..

Document how to scale

these two snippets

      image = RTesseract.read(image_path) do |img|
        img = img.scale 1.5
      end
      image = RTesseract.read(image_path) do |img|
        img = img.scale 2
      end

will have the same result. The image isn't changed and won't be touched. Absolutely no scaling here for the image outside the block.

The example is analogue to the one in the readme.
For scaling you should use the eclamation mark methods for changing the instance itself.
So it should be:

      image = RTesseract.read(image_path) do |img|
        img.scale! 2
      end

Tests fail when português not installed

Hello!

I just ran the tests locally on my monolingual computer. Because I don't have the português dictionary for tesseract installed, one of the tests fail:
RTesseract.new(@image_tiff,{:lang=>"por"}).to_s_without_spaces.should eql("43ZZ")

Not sure what the fix would be to install português, or to disable this test if the test runner doesn't have the proper dictionary.

What are your thoughts?

Tesseract OCR

I am bascially a beginner in programming and developing a web application which converts Image into Text i have used Orcad for this but it's for simple text and creating many issues i want to know is this is helpful to me that i can give an interface to user in which he can upload a file of formats like png, jpeg etc and that web app converts it into text i am bascially learning node.js for this kindly guide me which is better and easy to learn particulary for this project?

Conversion Error on Heroku

I am getting the following when running RTesseract on Heroku

RTesseract::ConversionError (No such file or directory @ rb_sysopen - /tmp/1489639277.1960108205.txt)

The numbers for the file in tmp change, but I only get this on Heroku. Everything is working fine on localhost.

Handling timeouts

Hey there, I need to implement timeout for a long running Tesseract command.

I came up with two options how to do it:

  1. Add the timeout option to the RTesseract.new and reimplement the Command#run using the Open3.popen3 instead of Open3.capture3 and catch the timout there (if set)
  2. Add some async option to the RTesseract.new and implement some run_async and results methods, also using Open3.popen3, which would return PID therefore the timeout (killing the process) can be handled in the client code.

What do you think? Should I try to open a PR? Thanks!

ImageNotSelectedError after using `.read`

I can successfully use RTesseract in this way:

filename = 'path/to/foo.jpg'
image = RTesseract.new(filename)
puts image.to_s

When I use .read to try and transform the image exactly as in the example on the read me, though...

filename = 'path/to/foo.jpg'
image = RTesseract.read(filename) do |img|
  img = img.white_threshold(245)
  img = img.quantize(256,Magick::GRAYColorspace)
end
puts image.to_s

I get .../gems/rtesseract-1.3.0/lib/rtesseract.rb:228:in to_s': RTesseract::ImageNotSelectedError (RTesseract::ImageNotSelectedError)`

I get that same error even if I merely do

  image = RTesseract.read(filename) do |img| end
  puts image.to_s

Release new gem version

Hello, first of all thanks for your work!

I got stuck on the 2.0 version of the gem that is the latest in rubygems.com, I am experiencing a file namespace clash due to

require 'utils'

in rtesseract.rb: rails has already require'd a utils.rb before it processed rtesseract.rb and does not do it twice.

Just before starting a pull request, I noticed you have already amended this. Can you please update rubygems with your latest code? Thanks again :)

Set tesseract config

Is there any way to use another config file?

If yes would it be nice to see it documented.

Thank you very much!

RTesseract.read throws Encoding::UndefinedConversionError

Hello, I have an issue with read, when doing:

ocr = RTesseract.read("my_image.jpg") do |img|
  img
end

this fails in the from_blob method with the error Encoding::UndefinedConversionError.

Now, the same image goes fine through RTesseract when doing

ocr = RTesseract.new("my_image.jpg")
ocr.to_s

so the file is fine. Any setting I should set to fix this?

Some .to_s results contain tesseract(?) error messages

When I process some (not all) images, then I see Tesseract(?) errors appended to the result text:

Error in boxClipToRectangle: box outside rectangle\nError in pixScanForForeground: invalid box\nCreated On: 10 March 2015\nRepository: Frost\n\nEntomological Museum (PSUC)\n\n \n\f

OS X 10.12.6
rvm Ruby 2.5.1
rtesseract 3.0.2
rmagick 2.16.0

Can supply the image if needed.

Error reading resulting txt file in convert operation

Intermittently, I encounter an error where the output of the tesseract command in RTesseract.convert does not produce the txt file in a timely fashion (or possibly at all).

The error is:

No such file or directory - /tmp/20140108-25243-1mgiinb.txt
2014-01-08T19:59:21Z 25243 TID-121zho WARN: /home/deployer/apps/my_app/shared/bundle/ruby/1.9.1/bundler/gems/rtesseract-5653a4485fe9/lib/rtesseract.rb:140:in `read'

This error is pretty common when I am processing multiple images concurrently (using Sidekiq) and the RMagick processor. It seems to be a lot less frequent when I switch over to the MiniMagick processor, but still occurs.

to_box returns empty array

Got a basic script to get all words from an image, but for some reason the to_box method returns an empty array. The to_s method returns the string from the image so OCR is working.

#scan.rb
require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'rtesseract'
end

x = RTesseract.new('path/to/image.png', lang: 'nld', classify_enable_learning: '0', psm: '6', tessedit_char_blacklist: ': =').to_box
puts x.inspect

Weird collision between RTesseract and Commander gems

Heya. I've been trying to implement RTesseract in my app and it's been going great in testing so far, thanks for making it! However, once I started running my scripts from the command line, I've started running into a really strange problem when calling RTesseract.new(path):

wrong number of arguments (given 0, expected 1)
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/runner.rb:171:in `command'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/delegates.rb:16:in `command'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/command.rb:11:in `initialize'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/text.rb:6:in `new'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/text.rb:6:in `run'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract.rb:38:in `to_s'

I've been researching this pretty much all week and I've come to the conclusion that this is some sort of weird collision between the RTesseract and Commander gems. For whatever reason, when RTesseract::Command is initialized, this happens:

From: /home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/command.rb @ line 11 RTesseract::Command#initialize:

     7:     def initialize(source, output, options)
     8:       @source = source
     9:       @output = output
    10:       @options = options
 => 11: binding.pry
    12:       @full_command = [ options.command, @source, @output]
    13:     end

[1] pry(#<RTesseract::Command>)> options.command
ArgumentError: wrong number of arguments (given 0, expected 1)
from /home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/runner.rb:171:in `command'
[2] pry(#<RTesseract::Command>)> options
=> #<RTesseract::Configuration command="tesseract", debug_file="/dev/null">

Even though options is an RTesseract::Configuration object, it's somehow getting a callback or reference of some sort stuck on it from the Commander gem and thinks its a method. I have no idea how to clear this up. Right now I'm going to fork RTesseract and replace options.command with 'tesseract' as that's the only command I need, but I'd rather not have to maintain that long term.

Any ideas?

support for outputbase option

AFAIKT there is no support for the outputbase option:

  tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

Is that correct?

Rails can't locate tesseract directory

I'm in a new rails 7 app running webpack. When I try to run

image_path = ActionController::Base.helpers.image_url('receipt.jpg')
image = RTesseract.new(image_path, lang: 'eng')
image.to_s

I keep getting the error

Errno::ENOENT: No such file or directory - tesseract

even though I have followed all the steps for setup. Any idea how to fix it?

avoid Ignoring hyphens

rtesseract is awesome! Thank you for the library.

Is there any option not remove hyphens (i.e. ABC-DEF)?

`mini_magick` support is broken in `RTesseract.read`

TL;DL: RTesseract.read('foo.jpg', processor: 'mini_magick') results in cannot load such file -- RMagick

As the hash method with the misleading name option actually deletes keys from hashes, the given options hash won't be the same in lines six and nine.

So setting processor to mini_magick will just work until it gets deleted by the option method.
After that RTesseract will be initialized without processor option and therefore fall back to RMagick.

Using class method read results in error

It appears that the example using RTesseract::read fails with the following error:

NameError: uninitialized constant RMagickProcessor::Magick
    from .../gems/rtesseract-1.2.0/lib/processors/rmagick.rb:21:in `read_with_processor'
    from .../gems/rtesseract-1.2.0/lib/rtesseract.rb:66:in `read'
    from (irb):2

Looking through the code, it appears the processor is not require'd (via RMagickProcessor::setup). The tests are passing since this only needs to happen once. This is taken care of by RTesseract#choose_processor! in the constructor but there is no such call in RTesseract::read

Unable to parse image at URL when ran inside Docker container

RTesseract.new takes the path to the image to be processed as shown:

image = RTesseract.new('./img.png')
puts image.to_s

=> text found in the image

Now let's consider the image hosted at this url: https://via.placeholder.com/150

placeholder image

When ran locally, RTesseract appears to be able to take the URL as an image path.

image = RTesseract.new('https://via.placeholder.com/150')
puts image.to_s

=> 150x 150

However this does not work when ran inside an alpine docker container.

Traceback (most recent call last):
	3: from main.rb:3:in `<main>'
	2: from /usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract.rb:41:in `to_s'
	1: from /usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract/text.rb:8:in `run'
/usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract/command.rb:57:in `run': Error, cannot read input file https://via.placeholder.com/150: No such file or directory (RTesseract::Error)
Error during processing.

I've set up a small project where all this can be seen here:
https://github.com/abardallis/tesseract-test.git

To be honest, I'm a little surprised that it works locally and it not working in the docker container makes a little more sense seeing as how passing in a url isn't mentioned in the documentation for rtesseract.

However, it does work locally and I would really enjoy it if it could work on this docker container as well given that I'm hoping to deploy this thing and do not want to store images locally first in order to be able to run them through rtesseract (for reasons).

This might be an underlying issue in how I'm building my docker image and have nothing to do with rtesseract, but I figure this is a good place to start. If there's no way of getting the thing to work on docker, and you could at least provide some context on why this thing is working locally, that would at least help me sleep at night.

Thanks!

Use existing tiff

I'm extracting all the pages of a PDF as TIFFs thru the mini_magick gem & I'd like to feed each of these to rtesseract w/o having it unnecessarily re-generate new, temporary tiffs. Short of monkey patching your image method, is there any way to do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.