dannnylo / rtesseract Goto Github PK
View Code? Open in Web Editor NEWRuby library for working with the Tesseract OCR.
Home Page: http://rubygems.org/gems/rtesseract
License: MIT License
Ruby library for working with the Tesseract OCR.
Home Page: http://rubygems.org/gems/rtesseract
License: MIT License
I'm trying to get the text from an image downloaded via URL:
url = "https://urlhere.com/image.png"
file = Tempfile.new(['image', File.extname(url)])
file.binmode
file.write open(url).read
file.flush
image = RTesseract.new(file)
image.to_s
This results in the following error:
TypeError: no implicit conversion of Tempfile into String
Not sure what I'm doing wrong here.
I have an image I'm trying to OCR where the text is in two columns. as you can imagine, parsing the text Column A to Column B is rather difficult, so I was wondering how I might tell Rtesseract to just read the left-half of my image and then read the right-half of my image.
I suppose I could use imagemagick to split the image in two, then Ocr both halves. would be nice if I could tell Tesseract explicitly to use only a 4-corner bounding box for the OCR and ignore everything else in the image. (That way I could eliminate step of splitting the image in half and making two more images)
Hello, you describe how rtesseract can covert a scanned image into a searchable pdf using this code:
image = RTesseract.new("my_image.jpg")
image.to_pdf # Getting the pdf path
image.to_s # Still can get the value only.
image.clean # to delete file once finished
I am already extracting the text into a pdf, but I don´t know how to make exactly that, converting an image (actually a page coming from a scanned document I have already converted into jpg) in a searchable pdf.
Is there any place where you have described this process a little more in detail, I am sorry but I don´t get the point about how you do it.
Also, is it possible to convert this way and entire pdf document with some page making unnecessary splitting pages one by one?
Best regards and thanks in advance
support tesseract 3
It seems like rtesseract
currently doesn't support the option to specify an OCR engine mode.
I am currently using tesseract version 3.05.01
which has the following options:
OCR Engine modes:
0 Original Tesseract only.
1 Cube only.
2 Tesseract + cube.
3 Default, based on what is available.
I was wondering if there is any reason for not implementing such functionality. I'd be happy to try implementing it and opening a PR if needed.
Hello, is there any plan to support Tesseract'Orientation detection?
Hey,
Thank you for your work on this great gem, I've already used it successfully in a couple of projects. But since I started to work on a new laptop a few days ago it stopped working...
Whenever I try to use old working code like
img = MiniMagick::Image.open(image_path)
str = RTesseract.new(img.path, processor: 'mini_magick').to_s
I get this error
w32/lib/Win32/Console/ANSI.rb:163:in `sub!': can't modify frozen String (RuntimeError)
I have everything installed correctly on my machine (Windows) : RailsInstaller, ImageMagick, Tesseract, and I am able to open and modify images with MiniMagick without any problem... Any help on this would be nice :)
Currently, RMagick will derive the compression to tiff based on the source input file when write is called on image_to_tiff
. See http://www.imagemagick.org/RMagick/doc/info.html#quality
If you try to get the text for a jpeg, you'll get the following error:
CompressionNotSupported `JPEG' @ error/tiff.c/WriteTIFFImage/2589
This can be fixed by allowing write to be called with the compress option. See http://www.imagemagick.org/RMagick/doc/info.html#compression. Here's an example:
require "RMagick"
module RMagickProcessor
extend self
def image_to_tiff
tmp_file = Tempfile.new(["",".tif"])
cat = @instance || Magick::Image.read(@source.to_s).first
cat.crop!(@x, @y, @w, @h) unless [@x, @y, @w, @h].compact == []
cat.write(tmp_file.path.to_s){self.compression = Magick::NoCompression}
return tmp_file
end
def read_with_processor(path)
Magick::Image.read(path.to_s).first
end
def is_a_instance?(object)
object.class == Magick::Image
end
end
I can submit a patch for this if you want. Maybe allow a compression option on the RTesseract.new
that can be passed through to this image_to_tiff
method?
I see that options can be set via a hash, but what about flags like -l best/eng -c preserve_interword_spaces=1
?
Hi - I am trying to convert my first image... and getting this debug output and a Conversion Error exception. If I run tesseract to convert the image from the command line, it works fine. Possibly I have a configuration issue? Thanks for any suggestions.
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading /tmp/20150310-23757-162n079.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.
RTesseract::ConversionError: RTesseract::ConversionError
After further testing, it appears that this error is coming from tesseract not able to process .tif files.
tesseract sample.png stdout ** WORKS **
tesseract sample.tif stdout ** FAILS **
Adding --order rand
to .rspec
file and running the test suite causes numerous failures. I believe that the " support default config processors" specs in rtesseract_spec.rb
interfere with other tests; other sources of interference are possible as well. I discovered this while trying to add a feature to this gem -- its tests passed when run on its own, but failed as part of the full suite.
One thing that might help address this would be a way to "reset" configuration options to their default; this could be run after each test in the suite. It would add some time to the test suite, but it's hard to see a better path to fixing this problem without a significant rewrite.
Hello,
I'm trying to work with online files.
I tried to fetch distant file into a tempfile in aim that rtesseract could read words on it.
I tried using this code:
tmp_file = Tempfile.new(self.title) open(fileUrl, 'r:UTF-8') do |url_file| #fileUrl is a string tmp_file.write(url_file.read) end tmp_file.rewind begin RTesseract.new(self.title, command: 'tesseract_error', debug: true ).to_s rescue => e return e.inspect end
The result is an RTesseract::ImageNotSelectedError
I don't know if it's due to the fact i get to_s in a def method converted to json in a serializer but when i return image i got a formatted json with rmagick processor and a source.
Am I doing wrong omewhere?
Thanks
RMagick processor generates image with BBP higher than 32, which trips up Tesseract. I tried changing the rmagick processor to convert to PNG instead of TIF and that solves the problem.
Error in pixReadFromTiffStream: can't handle bpp > 32
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading test.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.
Hi, I got this error. Please advise on how to solve this
irb(main):001:0> require 'rtesseract' => true irb(main):002:0> RTesseract.new("imag.jpg") => #<RTesseract:0x007fceda3e6ce0 @command="tesseract", @lang="", @psm=nil, @processor=RMagickProcessor, @debug=false, @options_cmd=[], @clear_console_output=true, @options={}, @h=nil, @w=nil, @y=nil, @x=nil, @value="", @source=#<Pathname:imag.jpg>> irb(main):003:0> s = _ => #<RTesseract:0x007fceda3e6ce0 @command="tesseract", @lang="", @psm=nil, @processor=RMagickProcessor, @debug=false, @options_cmd=[], @clear_console_output=true, @options={}, @h=nil, @w=nil, @y=nil, @x=nil, @value="", @source=#<Pathname:imag.jpg>> irb(main):004:0> s.to_s RTesseract::ConversionError: RTesseract::ConversionError from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:204:in
rescue in convert'
from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:199:in convert' from /Users/jalenong/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/rtesseract-1.3.0/lib/rtesseract.rb:225:in
to_s'
from (irb):4
from /Users/jalenong/.rbenv/versions/2.2.2/bin/irb:11:in <main>'
use rtesseract,but get nothing and don't output error
it maybe happed in rtesseract 139
`#{@command} "#{tmp_image.path}" "#{path.gsub(".txt","")}" #{lang} #{psm} #{config_file} #{clear_console_output}`
uploaded_io = params[:picture]
File.open(Rails.root.join('public', 'uploads', uploaded_io.original_filename), 'wb') do |file|
file.write(uploaded_io.read)
end
dl = RTesseract.new(Rails.root.join('public', 'uploads',uploaded_io.original_filename).to_s)
@string = dl.to_s
Only once I've deployed to my development server does it all break returning the error. The files are been copied across to the public/uploads folder correctly. And are readable as tested by running the tesseract command outside of ruby on the same file.
The result when running the app is RTesseract::ConversionError on the dl.to_s action
Unsure on what I'm missing..
these two snippets
image = RTesseract.read(image_path) do |img|
img = img.scale 1.5
end
image = RTesseract.read(image_path) do |img|
img = img.scale 2
end
will have the same result. The image isn't changed and won't be touched. Absolutely no scaling here for the image outside the block.
The example is analogue to the one in the readme.
For scaling you should use the eclamation mark methods for changing the instance itself.
So it should be:
image = RTesseract.read(image_path) do |img|
img.scale! 2
end
I'd like to add support for reporting the bounding box of each word, similar to this API: https://github.com/meh/ruby-tesseract-ocr
I'm just adding this for record keeping and plan on working on it. I didn't miss anything, this sort of functionality doesn't exist in this gem right?
Hello!
I just ran the tests locally on my monolingual computer. Because I don't have the português dictionary for tesseract installed, one of the tests fail:
RTesseract.new(@image_tiff,{:lang=>"por"}).to_s_without_spaces.should eql("43ZZ")
Not sure what the fix would be to install português, or to disable this test if the test runner doesn't have the proper dictionary.
What are your thoughts?
i have a png format picture. the captcha is number '8706', but i got '8705'. sometimes, i want to got '0', it output ''u".
I am bascially a beginner in programming and developing a web application which converts Image into Text i have used Orcad for this but it's for simple text and creating many issues i want to know is this is helpful to me that i can give an interface to user in which he can upload a file of formats like png, jpeg etc and that web app converts it into text i am bascially learning node.js for this kindly guide me which is better and easy to learn particulary for this project?
I am getting the following when running RTesseract on Heroku
RTesseract::ConversionError (No such file or directory @ rb_sysopen - /tmp/1489639277.1960108205.txt)
The numbers for the file in tmp change, but I only get this on Heroku. Everything is working fine on localhost.
Hey there, I need to implement timeout for a long running Tesseract command.
I came up with two options how to do it:
RTesseract.new
and reimplement the Command#run
using the Open3.popen3
instead of Open3.capture3
and catch the timout there (if set)async
option to the RTesseract.new
and implement some run_async
and results
methods, also using Open3.popen3
, which would return PID therefore the timeout (killing the process) can be handled in the client code.What do you think? Should I try to open a PR? Thanks!
Is there a way to supply a data file instead of english? I'm trying to do digital displays. I have the data file though.
I can successfully use RTesseract in this way:
filename = 'path/to/foo.jpg'
image = RTesseract.new(filename)
puts image.to_s
When I use .read
to try and transform the image exactly as in the example on the read me, though...
filename = 'path/to/foo.jpg'
image = RTesseract.read(filename) do |img|
img = img.white_threshold(245)
img = img.quantize(256,Magick::GRAYColorspace)
end
puts image.to_s
I get .../gems/rtesseract-1.3.0/lib/rtesseract.rb:228:in
to_s': RTesseract::ImageNotSelectedError (RTesseract::ImageNotSelectedError)`
I get that same error even if I merely do
image = RTesseract.read(filename) do |img| end
puts image.to_s
Hello, first of all thanks for your work!
I got stuck on the 2.0 version of the gem that is the latest in rubygems.com, I am experiencing a file namespace clash due to
require 'utils'
in rtesseract.rb
: rails has already require
'd a utils.rb
before it processed rtesseract.rb
and does not do it twice.
Just before starting a pull request, I noticed you have already amended this. Can you please update rubygems with your latest code? Thanks again :)
Is there any way to use another config file?
If yes would it be nice to see it documented.
Thank you very much!
Hello, I have an issue with read, when doing:
ocr = RTesseract.read("my_image.jpg") do |img|
img
end
this fails in the from_blob
method with the error Encoding::UndefinedConversionError
.
Now, the same image goes fine through RTesseract when doing
ocr = RTesseract.new("my_image.jpg")
ocr.to_s
so the file is fine. Any setting I should set to fix this?
When I process some (not all) images, then I see Tesseract(?) errors appended to the result text:
Error in boxClipToRectangle: box outside rectangle\nError in pixScanForForeground: invalid box\nCreated On: 10 March 2015\nRepository: Frost\n\nEntomological Museum (PSUC)\n\n \n\f
OS X 10.12.6
rvm Ruby 2.5.1
rtesseract 3.0.2
rmagick 2.16.0
Can supply the image if needed.
Intermittently, I encounter an error where the output of the tesseract command in RTesseract.convert does not produce the txt file in a timely fashion (or possibly at all).
The error is:
No such file or directory - /tmp/20140108-25243-1mgiinb.txt
2014-01-08T19:59:21Z 25243 TID-121zho WARN: /home/deployer/apps/my_app/shared/bundle/ruby/1.9.1/bundler/gems/rtesseract-5653a4485fe9/lib/rtesseract.rb:140:in `read'
This error is pretty common when I am processing multiple images concurrently (using Sidekiq) and the RMagick processor. It seems to be a lot less frequent when I switch over to the MiniMagick processor, but still occurs.
Got a basic script to get all words from an image, but for some reason the to_box method returns an empty array. The to_s method returns the string from the image so OCR is working.
#scan.rb
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'rtesseract'
end
x = RTesseract.new('path/to/image.png', lang: 'nld', classify_enable_learning: '0', psm: '6', tessedit_char_blacklist: ': =').to_box
puts x.inspect
Based on this comment I see that @jessethebuilder prefers this Heroku buildpack.
Is there a common consensus by folks using your gem on what Heroku buildpack is the best for running tesseract OCR in heroku?
Hello,
How to pass the parameter -psm 1
Heya. I've been trying to implement RTesseract in my app and it's been going great in testing so far, thanks for making it! However, once I started running my scripts from the command line, I've started running into a really strange problem when calling RTesseract.new(path)
:
wrong number of arguments (given 0, expected 1)
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/runner.rb:171:in `command'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/delegates.rb:16:in `command'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/command.rb:11:in `initialize'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/text.rb:6:in `new'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/text.rb:6:in `run'
/home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract.rb:38:in `to_s'
I've been researching this pretty much all week and I've come to the conclusion that this is some sort of weird collision between the RTesseract and Commander gems. For whatever reason, when RTesseract::Command
is initialized, this happens:
From: /home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/rtesseract-3.0.2/lib/rtesseract/command.rb @ line 11 RTesseract::Command#initialize:
7: def initialize(source, output, options)
8: @source = source
9: @output = output
10: @options = options
=> 11: binding.pry
12: @full_command = [ options.command, @source, @output]
13: end
[1] pry(#<RTesseract::Command>)> options.command
ArgumentError: wrong number of arguments (given 0, expected 1)
from /home/libras/.rbenv/versions/2.4.5/lib/ruby/gems/2.4.0/gems/commander-4.4.7/lib/commander/runner.rb:171:in `command'
[2] pry(#<RTesseract::Command>)> options
=> #<RTesseract::Configuration command="tesseract", debug_file="/dev/null">
Even though options
is an RTesseract::Configuration
object, it's somehow getting a callback or reference of some sort stuck on it from the Commander gem and thinks its a method. I have no idea how to clear this up. Right now I'm going to fork RTesseract and replace options.command
with 'tesseract'
as that's the only command I need, but I'd rather not have to maintain that long term.
Any ideas?
AFAIKT there is no support for the outputbase option:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
Is that correct?
I'm in a new rails 7 app running webpack. When I try to run
image_path = ActionController::Base.helpers.image_url('receipt.jpg')
image = RTesseract.new(image_path, lang: 'eng')
image.to_s
I keep getting the error
Errno::ENOENT: No such file or directory - tesseract
even though I have followed all the steps for setup. Any idea how to fix it?
rtesseract is awesome! Thank you for the library.
Is there any option not remove hyphens (i.e. ABC-DEF
)?
TL;DL: RTesseract.read('foo.jpg', processor: 'mini_magick')
results in cannot load such file -- RMagick
As the hash method with the misleading name option
actually deletes keys from hashes, the given options hash won't be the same in lines six and nine.
So setting processor
to mini_magick
will just work until it gets deleted by the option
method.
After that RTesseract
will be initialized without processor
option and therefore fall back to RMagick.
It appears that the example using RTesseract::read
fails with the following error:
NameError: uninitialized constant RMagickProcessor::Magick
from .../gems/rtesseract-1.2.0/lib/processors/rmagick.rb:21:in `read_with_processor'
from .../gems/rtesseract-1.2.0/lib/rtesseract.rb:66:in `read'
from (irb):2
Looking through the code, it appears the processor is not require'd (via RMagickProcessor::setup
). The tests are passing since this only needs to happen once. This is taken care of by RTesseract#choose_processor!
in the constructor but there is no such call in RTesseract::read
Does it support tesseract-3.04?
RTesseract.new
takes the path to the image to be processed as shown:
image = RTesseract.new('./img.png')
puts image.to_s
=> text found in the image
Now let's consider the image hosted at this url: https://via.placeholder.com/150
When ran locally, RTesseract appears to be able to take the URL as an image path.
image = RTesseract.new('https://via.placeholder.com/150')
puts image.to_s
=> 150x 150
However this does not work when ran inside an alpine docker container.
Traceback (most recent call last):
3: from main.rb:3:in `<main>'
2: from /usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract.rb:41:in `to_s'
1: from /usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract/text.rb:8:in `run'
/usr/lib/ruby/gems/2.6.0/gems/rtesseract-3.1.2/lib/rtesseract/command.rb:57:in `run': Error, cannot read input file https://via.placeholder.com/150: No such file or directory (RTesseract::Error)
Error during processing.
I've set up a small project where all this can be seen here:
https://github.com/abardallis/tesseract-test.git
To be honest, I'm a little surprised that it works locally and it not working in the docker container makes a little more sense seeing as how passing in a url isn't mentioned in the documentation for rtesseract
.
However, it does work locally and I would really enjoy it if it could work on this docker container as well given that I'm hoping to deploy this thing and do not want to store images locally first in order to be able to run them through rtesseract
(for reasons).
This might be an underlying issue in how I'm building my docker image and have nothing to do with rtesseract, but I figure this is a good place to start. If there's no way of getting the thing to work on docker, and you could at least provide some context on why this thing is working locally, that would at least help me sleep at night.
Thanks!
I'm extracting all the pages of a PDF as TIFFs thru the mini_magick gem & I'd like to feed each of these to rtesseract w/o having it unnecessarily re-generate new, temporary tiffs. Short of monkey patching your image method, is there any way to do this?
When parsing the same file as a PDF instead of a JPG, I got far worse results. Is there an obvious reason for this difference?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.