I tried with this little piece of image but it didn't recognize anything <p dir="a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Fails to recognize short codes about normcap HOT 4 CLOSED

danibs commented on June 10, 2024

Fails to recognize short codes

from normcap.

Comments (4)

danibs commented on June 10, 2024 1

Your problem is with this specific kind of images, correct? Answer: YES
If you e.g. try to recognize text like this paragraph of my comment, it works? Answer: there is an issue with I (upper i) that it was readed as | (pipe). Same issue if I choose italian+english languages or only english.

I tried with v0.4.0.

I will try tesseract and in case of success I will post solution.

Thanks

from normcap.

danibs commented on June 10, 2024 1

@dynobo I asked for help in Google Groups and Nguyen answer to me.
Hope it help you to improve (if you want to) NormCap.
I faithfully reproduce the answer.

I think you may need to do some preprocessing for your image before send it to tesseract:

For example:
----------- image -----------

----------------------
----------- gray_image -----------

----------------------
----------- blur1 -----------

----------------------
----------- otsu -----------

----------------------
----------- erosion -----------

----------------------
----------- blur -----------

----------------------

SINGLE_LINE
6KDYT?79M"

AUTO
6KDYT?79M"

RAW_LINE
6KDYT79M

SPARSE_TEXT_OSD
6KDYT?79M"

SINGLE_WORD
6KDYT79M

As you can see, 2 PSM modes could give the correct results:

Here is the full code in python:

image_org = cv2.imread("unnamed.png")
height, width = image_org.shape[:2]

# calculate the amount of pixels to crop from the border
x_border = int(width * 0.1)
y_border = int(height * 0.1)

image = image_org[y_border:height-y_border, x_border:width-x_border]
cv2_show("image", image, 600)

gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2_show("gray_image", gray_image, 600)

blur1 = cv2.GaussianBlur(gray_image,(21,21),0)
cv2_show("blur1", blur1, 600)


# global thresholding
ret, otsu = cv2.threshold(blur1,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
cv2_show("otsu", otsu, 800)

kernel = np.ones((3,3),np.uint8)
erosion = cv2.erode(otsu,kernel,iterations = 1)
cv2_show("erosion", erosion, 800)

blur = cv2.GaussianBlur(erosion,(5,5),0)
cv2_show("blur", blur, 600)


results = get_text(255-blur)
for ret in results:
    print(ret[0][0])
    print(ret[1][0])

from normcap.

dynobo commented on June 10, 2024 1

I'm glad you found a solution, and thanks a lot for taking your time to share it here 🙂

I probably won't include the sequence of filters in NormCap, as these seem very use-case specific and might hurt detection under different circumstances.

But your experiments regarding PSM modes are really interesting. In the past, I also stumbled upon the semi-good detection quality for characters which are not real words (like UUIDs, hashes or something), and always wanted to add a mode to NormCap that helps in such use-cases. There are also the tesseract-settings load_system_dawg and load_freq_dawg to disable the dictionary based heuristics, and I can image that those settings, combined with PSM setting RAW_LINE or SINGLE_WORD could be added as such a new mode...

I've create a new issue #412 to follow up on that idea, and close this issue here.

from normcap.

dynobo commented on June 10, 2024

@danibs , thanks for reporting this issue and submitting a sample!

Just to be sure: Your problem is with this specific kind of images, correct? If you e.g. try to recognize text like this paragraph of my comment, it works?

I tried to detect your sample, and the result is indeed a complete mess. I locally tried a lot of different settings, downloaded the larger "best" .traineddata files and tested various pre-processing of the image (especially scaling it down, as the font seems to be made for very small text), but I wasn't able to improve the detection quality significantly. 🙁

I'm afraid, the problem is too difficult for NormCap with its general purpose settings. Especially the combination of an unusual "dotted" font with the random letters (no "real" words) makes it really hard to detect.

If you have a lot of those sequences to detect, you could try to run tesseract directly and try to tweak preprocessing and settings for your specific use case.

I'll leave this issue open for some weeks, maybe someone else has an idea...

from normcap.

Fails to recognize short codes about normcap HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent