Comments (2)
ocropus-gpageseg assumes that text lines are roughly the same scale. In return, it can detect even touching text lines in noisy documents pretty well. But that's only one of many strategies and possible tradeoffs. Your documents look like they are quite clean but have large variations in font size.
The best way to do text line recognition reliably is probably to run multiple different line detectors and combine their outputs.
As a simple version of that, you could try to run ocropus-gpageseg at different scales, try to recognize all the candidate text lines from the different parameter settings, and throw away those that give gibberish either due to being merged or split up.
Obviously, that is not going to be cheap. But ultimately, the only arbiter of whether a text line has been correctly segmented is whether you can recognize it, so for general purpose text line segmentation, invoking a recognizer somewhere is necessary.
For Latin script, you can also try to classify individual connected components as text/non-text and then attempt to group those together.
I'm planning on releasing a 2D LSTM based segmenter at some point, but that will still take a while.
from dup-ocropy.
Actually, in my example above the layout segmentation is perfect with ocropus-gpageseg --vscale 2
.
from dup-ocropy.
Related Issues (20)
- Model for french medieval manuscript HOT 9
- Manually correcting segmentation HOT 2
- --probabilities option of ocropus-rpred causes IndexError HOT 1
- I can't run the test HOT 4
- i HOT 2
- Is there a graphical depiction of the model being used/trained here? HOT 4
- I get bad scaling issues HOT 2
- Error running : ocropus-nlbin ersch.png -o book HOT 2
- Having ERROR: book/0001.bin.png SKIPPED image too tall for a text line (1080, 1920) (use -n to disable this check) HOT 1
- How does it apply to java HOT 4
- Trying to test out ocropus from sources
- 404 Not Found for en-default.pyrnn.gz HOT 1
- AssertionError: you must install and use OCRopus with Python version 2.7 or later, but not Python 3.x HOT 7
- EOF error with cpickle.Unpickler in common.py HOT 2
- Not found, while second step with wget, ERROR 404: Not Found. HOT 1
- run-test, error, with Python 3.7.4 HOT 1
- can't set up conda environment correctly for ocropus HOT 1
- ocropy-gtedit changes certain punctuation and diacritic characters HOT 1
- I want to get 1,000 synthetically generated data? Where do i set the number of data's to be generated? Thanks
- On-premise to cloud migration issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dup-ocropy.