Comments (3)
Hi @DarioBernardo Can you please share the PDF document (c_20230111133942393_2525540.pdf
)?
from unstructured.
Hi @christinestraub thank you for looking into my issue, no unfortunately I can't share the document, but I am sure the issue is replicable with most greek documents. Something I think worth mentioning is that the document is a scan of a paper document, hence it is made from images.
from unstructured.
I'd like to provide some additional context regarding the issue. I searched online for publicly available PDF documents that could help replicate the problem. I've confirmed that the issue arises when the API attempts to perform OCR on characters from images in PDFs. Specifically, when the PDF is a scan of a document, the OCR tool behind the API fails to recognize Greek characters and substitutes them with ASCII characters instead. However, if the content can be directly read from the PDF, the correct non-ASCII Unicode escape characters are provided. This may be due to limitations in Tesseract, which I believe is the OCR tool behind the API.
For instance, you can test this using the document available here. The document title, being part of an image, is not recognized correctly, whereas the rest of the document, which is text-based, is accurately processed.
from unstructured.
Related Issues (20)
- ModuleNotFoundError: No module named 'torch._C' HOT 1
- Deprecate `CheckBox` so that all `Element` objects are a subclass of `Text` HOT 3
- feat/Move the category field to Element
- partition_pdf is loading the model at every call HOT 3
- Switch `skip_infer_table_types` default to `None` instead of list HOT 1
- Add support for pinecone serverless indexes HOT 2
- Add manual coordinate constraints to `partition_pdf()`. HOT 2
- Unstrutured library is unable to extract CDATA from the xml data HOT 1
- bug/windows reopen temp file (pdf hi_res) HOT 1
- Set `resolve_entities=False` by default in `lxml` parser for `partition_xml`
- feat/custom-metadata HOT 6
- pptx initial error HOT 1
- bug/<Compatibility Issue with Chinese Text in Document Parsing> HOT 4
- ImportError: cannot import name 'CompositeElement' from 'unstructured.documents.elements'bug/<short-name> HOT 1
- Unable to load file HOT 3
- bug/bounding boxes using strategy="hi_res" are wrong HOT 1
- unstructured-ingest s3 command causes Fsspec.Downloader.download_config.download_dir to be None HOT 1
- bug/PIL.UnidentifiedImageError: cannot identify image file HOT 1
- DOCX doesn't recognize listitems within textbox HOT 4
- `partition_doc` fails the first time it is run in the AMD64 container HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unstructured.