Comments (28)
Tesseract will probably be able to do this in the future: tesseract-ocr/tesseract#728. If the Tesseract's recognition process can pick the right "dehyphenation" rules on a per-language basis, that's all we need.
Otherwise, removing hyphens on the dpScreenOCR side will require either a library for natural language processing or at least a spell checking library. In either case, the task is not trivial, since the recognized text can contain fragments in different languages. It will also require users to install extra data in addition to Tesseract languages.
Processing on the Tesseract side would definitely be the best solution, so I'd rather wait for tesseract-ocr/tesseract#728 for a while (although the issue more than 5.5 years old :)
from dpscreenocr.
Thank you for your interesting feedback.
Your two proposals would lead to a perfect solution.
However, I think we will have to wait for tesseract another 5 years. And your suggestion on the dpScreenOCR side is very complex (but very perfect).
Question:
Couldn't we implement the whole thing experimentally and "imperfectly" on the dpScreenOCR side by simply replacing certain characters?
This are rules for Linux bash. Maybe you can do something similar for C:
------------------------------------------------------------------------------------------
Rule 1: Replace "{a..z}-\n{a..z}" with ""
Rule 2: Replace "{a..z}\n{a..z}" with ""
(Rule 3: Do not replace "{a..z}-/n{A..Z}")
Then do the same with:
Hyphen U+2010
Breaking Hyphen U+2011
Figure Dash U+2012
EN Dash U+2013
EM Dash U+2014
Horizontal Bar U+2015
Hyphen Bullet U+2043
I think this would be very easy to implement. Of course, it would not be perfect. But it would save me a lot of post-processing work of the catched OCR text.
And of course this does not work with all language. Arabical, Chinese, Russian. But it would be enough for European, American, Austarian, African? languages.
Of course, there must be a way to switch this function on or off as an option in the GUI.
A feedback is appreciated. Thank you.
from dpscreenocr.
Unfortunately, the "naive" algorithm will not work in most cases, removing hyphens when they should be kept, e.g., "twentieth-century music".
If you don't mind this kind of de-hyphenation, you can do it in a script executed via the "Run executable" action. In fact, this way it's easy to implement the proper algorithm, which will remove hyphens only if the deh-hyphenated word is in the list of valid words in a file. For French, you can download such a list here:
https://salsa.debian.org/gpernot/wfrench/-/blob/master/french
On Unix-like systems, you can also install this file (as /usr/share/dict/french
) via the package manager. For example, this is the "wfrench" package on Ubuntu.
from dpscreenocr.
Thank you.
Yes I will try to make a script for the naive algorithm. (I am not skilled enought to take the version with "wfrench")
But it looks like the argument from dpScreenOCR has no '\n' in $1. This means I can not replace "{a..z}\n{a..z}"
.
This is how my script looks like:
#!/bin/bash
ersetzt="${1//'\n'/'eeeeeeee'}"
echo "$ersetzt" > "~/MyPath/ScreenOCR.txt"
Content of the file ScreenOCR.txt:
cĂ©lĂšbre = berĂŒhmt
Ă la campagne = auf dem Lande
des promenades au bord de la Seine = SpaziergÀnge am Seine-Ufer
lire un bon livre = ein gutes Buch lesen
le rĂŽle principal = die Hauptrolle
--> Confused: It looks like there has not replaced any '\n' . But they are still there. (?) Very strange.
from dpscreenocr.
I'm not skilled enough in Bash, so here is a simple Python script that unwraps paragraphs using Aspell for spell checking. You will need to install the needed Aspell language (e.g. aspell-fr
package for French on Ubuntu) and set ASPELL_LANG
(will be passed as --lang
option to Aspell).
The script works not only with the ASCII hyphen, but also with other kind of dashes (en dash, em dash, etc.).
You may want to remove the second call to is_valid_word()
, so that in case of ambiguity the script prefers the word without the hyphen. This is probably the right thing to do in the general case, e.g. you don't want "car-pet" instead of "carpet".
#!/usr/bin/env python3
import datetime
import os
import subprocess
import sys
import unicodedata
ASPELL_LANG = 'fr'
APPEND_TO_FILE = os.path.expanduser("~/ocr_history.txt")
def is_dash(c):
return unicodedata.category(c) == 'Pd'
def is_valid_word(word):
with subprocess.Popen(
('aspell',
'-a',
'--lang=' + ASPELL_LANG,
'--dont-suggest'),
stdout=subprocess.PIPE,
stdin=subprocess.PIPE,
universal_newlines=True) as p:
# ! to enter the terse mode (don't print * for correct words).
# ^ to spell check the rest of the line.
aspell_out = p.communicate(input='!\n^' + word)[0]
# We use this function to check words both with and without
# dashes. In the later case, Aspell checks each dash-separated
# part as an individual word.
#
# If all words are correct in the terse mode, the output will be
# a version info and an empty line.
return aspell_out.count('\n') == 2
def unwrap_paragraphs(text, out_f):
para = ''
for line in text.splitlines():
if not line:
# Empty line is a paragraph separator
if para:
out_f.write(para)
out_f.write('\n')
para = ''
out_f.write('\n')
continue
if not para:
para = line
continue
if not is_dash(para[-1]):
para += ' '
para += line
continue
para_rpartition = para.rpartition(' ')
para_last_word = para_rpartition[2]
line_lpartition = line.partition(' ')
line_first_word = line_lpartition[0]
word_with_dash = para_last_word + line_first_word
word_without_dash = para_last_word[:-1] + line_first_word
if (is_valid_word(word_without_dash)
# If the word valid both with and without the dash,
# keep the dashed variant.
and not is_valid_word(word_with_dash)):
para = (para_rpartition[0]
+ para_rpartition[1]
+ word_without_dash
+ line_lpartition[1]
+ line_lpartition[2])
else:
para += line
if para:
out_f.write(para)
if __name__ == '__main__':
with open(APPEND_TO_FILE, 'a', encoding='utf-8') as out_f:
out_f.write(
'=== {} ===\n\n'.format(
datetime.datetime.now().strftime(
"%Y-%m-%d %H:%M:%S")))
unwrap_paragraphs(sys.argv[1], out_f)
out_f.write('\n\n')
from dpscreenocr.
Thank you very much for your script. đ :-)
I appreciate it.
I have made the file "dpScreenOCRPython.py" with the content of this script and have added the path to it into the "action" tab.
I have found the output in ~/ocr_history.txt.
It works more or less.
Why only "more or less" ?
It looks like tesseract sometimes thinks, that there are two line breaks ('\n\n') although there is only one.
Result from tesseract:
Le centre commercial
Au centre commercial, on trouve sous un mĂȘme
toit? beaucoup de magasins de détail et
de services (banque, poste, restaurant, etc.).
Les clients stressés ne doivent plus aller
d'un magasin Ă l'autre pour faire leurs courses.
Les centres commerciaux se trouvent à la péri-
phérie des villes. On y va donc en voiture et
on gare sa voiture dans les grands parkings.
Il y a des familles qui passent toute la journée
du samedi dans les centres commerciaux.
And of course your Python script converts this into:
Le centre commercial
Au centre commercial, on trouve sous un mĂȘme toit? beaucoup de magasins de dĂ©tail et
de services (banque, poste, restaurant, etc.). Les clients stressés ne doivent plus aller
d'un magasin à l'autre pour faire leurs courses. Les centres commerciaux se trouvent à la périphérie des villes. On y va donc en voiture et
on gare sa voiture dans les grands parkings.
Il y a des familles qui passent toute la journée du samedi dans les centres commerciaux.
So it looks like it is not enough when Python only looks at '\n' . It should also convert '\n\n'
Second:
I would prefer that the Script does not create the "ocr_history.txt" file but brings the output directly into the clipboard instead.
I will not use the action-options ...
-copy text into clipboard and
-run a programm
... at the same time. So this will not be a conflict.
from dpscreenocr.
To copy text to the clipboard, you can use xsel
or xclip
. If you're not familiar with Python, it would be easier for you to replace the last block in the script (starts with if __name__ == '__main__':
) with the following:
if __name__ == '__main__':
unwrap_paragraphs(sys.argv[1], sys.stdout)
This way, the script will print to standard output instead of file, so you will be able to invoke it in a Bash script and then call xsel/xclip
, like:
#!/bin/bash
TEXT=$(~/dpScreenOCRPython.py "$1")
xsel --clipboard <<< "$TEXT"
Unfortunately, removing empty lines will unconditionally join all paragraphs. This is something that should be done on Tesseract side; they already have an issue on the tracker: tesseract-ocr/tesseract#2155. If you don't mind removing all empty lines, you can do it with TEXT=$(sed '/^$/d' <<< "$1")
before calling the Python script. Alternatively, here is a bit more sophisticated Python script that only removes an empty line if the next one starts with a lower-case character:
#!/usr/bin/env python3
import sys
lines = sys.argv[1].splitlines()
for i, line in enumerate(lines):
if (not line
and i + 1 < len(lines)
and (not lines[i + 1]
or lines[i + 1][0].islower())):
continue
print(line)
You can combine both scripts like:
#!/bin/bash
TEXT=$(~/remove_empty_lines.py "$1")
TEXT=$(~/dpScreenOCRPython.py "$TEXT")
xsel --clipboard <<< "$TEXT"
from dpscreenocr.
Thank you very much. That's great stuff.
I think this is good enough for my purpose (translating from French into German with DeepL).
from dpscreenocr.
Follow up:
Your original Python script (#23 (comment)) makes two things:
- it replaces things like:
beau-
coup
into
beaucoup
- it replaces things like:
les
championnats
into
les championnats
Actually in the meantime I would prefer a script that only makes the
beau-
coup
replacement.
Am I right, that in your original Python Script, you have separate sections for this two challanges. If yes, which section does what?
Thank you.
from dpscreenocr.
In the block that starts with if not is_dash(para[-1]):
, replace para += ' '
with para += '\n'
.
from dpscreenocr.
Thank you.
from dpscreenocr.
Is there any way to use it without aspell?
from dpscreenocr.
You can replace aspell with another spell checker (e.g. hunspell), but without a spell checker the script will be useless since there will be no way to tell if a word without a hyphen is correct.
from dpscreenocr.
@danpla okay, there are too many script which one should dpscreenocr execute the bash one or?
Weirdedly It did not work on me so tried to debug it:
Traceback (most recent call last):
File "/home/tbb/dpScreenOCRPython.py", line 94, in <module>
unwrap_paragraphs(sys.argv[1], out_f)
IndexError: list index out of range
from dpscreenocr.
It looks like you called the script without an argument. dpScreenOCRPython.py "some text"
should work. It's supposed that you will use the script with the "Run executable" action, in which case the argument (the recognized text) will be passed by dpScreenOCR.
from dpscreenocr.
I already do the way but no works, I want to have the stuff that dpscreenocrpython.py fixed on my clipboard but I guess I have to use the bash script to achieve it, I do not know what to do can you give instruction for who does not know any coding stuff
from dpscreenocr.
@danpla I made it work somehow dunno, is there any way to make it slee-py to sleepy I mean when - in middle or some?
from dpscreenocr.
It should work automatically if you set English by changing ASPELL_LANG = 'fr'
to ASPELL_LANG = 'en'
in the script.
from dpscreenocr.
@danpla you should add it as feature to dpscreen though, sometimes it does not work at all weird, thanks anyway
from dpscreenocr.
@danpla it works on terminal (dpscreenocrpy) but does not work on run executable option should I open other options (copy to text clipboard add text to history?)
from dpscreenocr.
By default, the Python script appends text to the ocr_history.txt
file in your home directory. If you want the text to be copied to the clipboard instead, see #23 (comment).
from dpscreenocr.
@danpla but I dont understand it, should I execute to bash script or python script to get work this on dpscreenocr since dpscreenocr cant execute multiple stuff
from dpscreenocr.
You should use the bash script (the piece of code that starts with #!/bin/bash
in #23 (comment)) with "Run executable". This script will, in turn, get the text from the Python script and then send it to the clipboard using the xsel
util.
You will need to disable the "Copy text to clipboard" action, since otherwise it will overwrite the clipboard text set by xsel
.
from dpscreenocr.
@danpla thanks for help, can you make it work for like these examples?
it works when slee- py
but not slee -py
from dpscreenocr.
@danpla
Sorry if this is a silly question, should I turn on or turn off the "Split Text Blocks" feature for this script to work better?
from dpscreenocr.
This option has no effect on how the script works. But if you're capturing several columns of text at once, then it probably makes sense to enable "Split text blocks," regardless of whether you're using this script.
from dpscreenocr.
Related Issues (20)
- Avast and AVG detect qwindowsvistastyle.dll and qwindows.dll as "Win32:MalOb-IJ [Cryp]" HOT 4
- Using Tesseract 4 languages with Tesseract 5 HOT 2
- Cannot install english language on Ubuntu 20.04 HOT 3
- Antivirus flags dpScreenOCR as malware HOT 1
- Issues with vertical Japanese text HOT 2
- Problems with oblique text HOT 4
- Reading QR-Codes HOT 3
- Comparison with other OCR tools HOT 2
- Screenshots please HOT 3
- Scoop Package Manager HOT 4
- add-apt-repository: The repository [...] does not have a Release file HOT 5
- Why do I not get version 1.3.0 via Repo update? HOT 3
- will it work on bsds? HOT 1
- rpm package HOT 1
- run as one shot tool HOT 1
- Languages are not displayed after upgrading Ubuntu to 22.10 or later HOT 9
- Does dpscreenocr do postprocessing HOT 3
- I can't set Capslock as cancel selection HOT 1
- Crash/BSOD on Windows 11 when using Hebrew HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â€ïž Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dpscreenocr.