Describe the bug I am evaluating the UnstructuredClient for proce

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet about unstructured HOT 3 OPEN

DarioBernardo commented on May 29, 2024

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet

from unstructured.

Comments (3)

christinestraub commented on May 29, 2024

Hi @DarioBernardo Can you please share the PDF document (c_20230111133942393_2525540.pdf)?

from unstructured.

DarioBernardo commented on May 29, 2024

Hi @christinestraub thank you for looking into my issue, no unfortunately I can't share the document, but I am sure the issue is replicable with most greek documents. Something I think worth mentioning is that the document is a scan of a paper document, hence it is made from images.

from unstructured.

DarioBernardo commented on May 29, 2024

I'd like to provide some additional context regarding the issue. I searched online for publicly available PDF documents that could help replicate the problem. I've confirmed that the issue arises when the API attempts to perform OCR on characters from images in PDFs. Specifically, when the PDF is a scan of a document, the OCR tool behind the API fails to recognize Greek characters and substitutes them with ASCII characters instead. However, if the content can be directly read from the PDF, the correct non-ASCII Unicode escape characters are provided. This may be due to limitations in Tesseract, which I believe is the OCR tool behind the API.

For instance, you can test this using the document available here. The document title, being part of an image, is not recognized correctly, whereas the rest of the document, which is text-based, is accurately processed.

from unstructured.

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet about unstructured HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent