unstructured-io / unstructured-api Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
I'm using the dockerized version, unfortunately the container fails after a few seconds, here are the logs:
docker-unstructured-unstructured-api-1 | Traceback (most recent call last):
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 168, in _load_global_deps
docker-unstructured-unstructured-api-1 | ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/ctypes/init.py", line 373, in init
docker-unstructured-unstructured-api-1 | self._handle = _dlopen(self._name, mode)
docker-unstructured-unstructured-api-1 | OSError: /home/notebook-user/.local/lib/python3.8/site-packages/torch/lib/../../nvidia/curand/lib/libcurand.so.10: invalid ELF header
docker-unstructured-unstructured-api-1 |
docker-unstructured-unstructured-api-1 | During handling of the above exception, another exception occurred:
docker-unstructured-unstructured-api-1 |
docker-unstructured-unstructured-api-1 | Traceback (most recent call last):
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/bin/uvicorn", line 8, in
docker-unstructured-unstructured-api-1 | sys.exit(main())
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 1130, in call
docker-unstructured-unstructured-api-1 | return self.main(*args, **kwargs)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 1055, in main
docker-unstructured-unstructured-api-1 | rv = self.invoke(ctx)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
docker-unstructured-unstructured-api-1 | return ctx.invoke(self.callback, **ctx.params)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
docker-unstructured-unstructured-api-1 | return __callback(*args, **kwargs)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/main.py", line 410, in main
docker-unstructured-unstructured-api-1 | run(
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/main.py", line 578, in run
docker-unstructured-unstructured-api-1 | server.run()
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/server.py", line 61, in run
docker-unstructured-unstructured-api-1 | return asyncio.run(self.serve(sockets=sockets))
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
docker-unstructured-unstructured-api-1 | return loop.run_until_complete(main)
docker-unstructured-unstructured-api-1 | File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/server.py", line 68, in serve
docker-unstructured-unstructured-api-1 | config.load()
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/config.py", line 473, in load
docker-unstructured-unstructured-api-1 | self.loaded_app = import_from_string(self.app)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/importer.py", line 21, in import_from_string
docker-unstructured-unstructured-api-1 | module = importlib.import_module(module_str)
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/importlib/init.py", line 127, in import_module
docker-unstructured-unstructured-api-1 | return _bootstrap._gcd_import(name[level:], package, level)
docker-unstructured-unstructured-api-1 | File "", line 1014, in _gcd_import
docker-unstructured-unstructured-api-1 | File "", line 991, in _find_and_load
docker-unstructured-unstructured-api-1 | File "", line 975, in _find_and_load_unlocked
docker-unstructured-unstructured-api-1 | File "", line 671, in _load_unlocked
docker-unstructured-unstructured-api-1 | File "", line 843, in exec_module
docker-unstructured-unstructured-api-1 | File "", line 219, in _call_with_frames_removed
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/prepline_general/api/app.py", line 11, in
docker-unstructured-unstructured-api-1 | from .general import router as general_router
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/prepline_general/api/general.py", line 29, in
docker-unstructured-unstructured-api-1 | from unstructured_inference.models.chipper import MODEL_TYPES as CHIPPER_MODEL_TYPES
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/models/chipper.py", line 5, in
docker-unstructured-unstructured-api-1 | import torch
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 228, in
docker-unstructured-unstructured-api-1 | _load_global_deps()
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 189, in _load_global_deps
docker-unstructured-unstructured-api-1 | _preload_cuda_deps(lib_folder, lib_name)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 155, in _preload_cuda_deps
docker-unstructured-unstructured-api-1 | ctypes.CDLL(lib_path)
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/ctypes/init.py", line 373, in init
docker-unstructured-unstructured-api-1 | self._handle = _dlopen(self._name, mode)
docker-unstructured-unstructured-api-1 | OSError: /home/notebook-user/.local/lib/python3.8/site-packages/nvidia/curand/lib/libcurand.so.10: invalid ELF header
Being able to use the fast mode of Unstructured via the api is very important for our use case.
Our users interact in "interactive-mode" and wait on the same page while the document processes. None of them have uploaded a document that really needed OCR and are they are ok with using other services for that before uploading to our service.
Expose chunking_strategy
as an API parameter now that https://github.com/Unstructured-IO/unstructured/pull/1304/files has merged.
Also support related args: multipage_sections
, combine_under_n_chars
and new_after_n_chars
.
docker+unstructured-api locally hosts
Access to fetch at 'http://223.240.77.64:8000/general/v0/general' from origin 'http://localhost' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
VM15891 QGOe:98 POST http://223.240.77.64:8000/general/v0/general net::ERR_FAILED
If I don't specify the ocr_languages, it will default to "eng", but if the text is in a non-English language, it may result in garbled output. Is there a way to provide automatic language detection?
We're exploring using the unstructured API at work.
We're running quay.io/unstructured-io/unstructured-api:c9b74d4
on a "Pro" (private service) Render instance (i.e. 4GB RAM)
We're using the service to process PDFs with the following parameters strategy=hi_res
, pdf_infer_table_structure=true
and skip_infer_table_types=[]
. We're also using parallel mode via UNSTRUCTURED_PARALLEL_MODE_ENABLED=true
(using the defaults for the other environment vars).
We've seen the service fall over several times due to OOM, and looking at metrics it looks as if there are resources not being freed after processing runs.
Each spike represents a processing run, with about 10 minutes between each.
Hi, the table parsing doesn't seem to work at all in my case.
I tried with multiple files (.pdf, .jpeg, .docx...)
It returns most cells as UncategorizedText
and a few as Title
.
I call the API using the following parameters :
data = aiohttp.FormData()
data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
data.add_field('ocr_languages', "fra")
data.add_field('strategy', "ocr_only" if file.filename.lower().endswith(".jpeg") or file.filename.lower().endswith(".jpg") or file.filename.lower().endswith(".png") else "auto")
data.add_field('include_page_breaks', "true")
data.add_field('pdf_infer_table_structure', "true")
and
async with session.post(
"http://unstructured-api:8000/general/v0/general",
headers={'accept': 'application/json'},
data=data
) as response:
Thanks !
The pipeline_api should be able to return a text/csv response instead of json if the response_type passed through pipeline_api is "text/csv".
In this case, the the pipeline_api should call convert_to_csv before returning the result.
Definition of Done
In certain docker environments (e.g. google cloud run) this will fail with
env: 'bash ': No such file or directory
env: use -[v]S to pass options in shebang lines
Container called exit(127).
Note the space after bash
hey, first of all, I want to express my gratitude for this project!
I think it's amazing and I'm almost done packaging it for nixos. (if you are interested in the developments, please let me know your handle and I would be happy to tag you in the PRs. So far I've just packaged unstructured, so I'm missing packaging unstructured-api and paddle-OCR. All the rest is done, it's almost there).
While testing the generously made public available server, it seems that you are blocking requests made from localhost. I've got a small server that I like to test locally with. I noticed that the requests work when the server is deployed but they don't work on my local environment.
I'm wondering if there is any way to disable that, no worries if there isn't.
We need to match the change in Unstructured here: Unstructured-IO/unstructured#1400
This will include:
languages
paramNote the ocr_languages param is deprecated.
I am running this in Python and my first PDF worked fine with the API. The first PDF was 38 pages.
I am trying now with a PDF which is 98 pages and I receive the error: "Receive unexpected status code 400 from the API.".
Any idea why?
Hello,
i'm using docker to run unstructured. So I need to use the partition_via_api method to extract data from pdf.
The extraction give me some element of table but only with plain text. I need to build pandas dataframe with the tables information.
Maybe the option infer_table_structure could help me, but how to use this option with partition_via_api methods?
The documentation is not so clear about that.
Thanks,
API users are hitting this error on certain files.
PdfStreamError: Stream has ended unexpectedly
File "prepline_general/api/general.py", line 686, in pipeline_1
list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
File "prepline_general/api/general.py", line 607, in response_generator
response = pipeline_api(
File "prepline_general/api/general.py", line 278, in pipeline_api
pdf = PdfReader(file)
File "pypdf/_reader.py", line 332, in __init__
self.read(stream)
File "pypdf/_reader.py", line 1554, in read
self._find_eof_marker(stream)
File "pypdf/_reader.py", line 1625, in _find_eof_marker
line = read_previous_line(stream)
File "pypdf/_utils.py", line 268, in read_previous_line
raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
In our efforts to refine the Dockerfile in this PR we found that the container successfully parse some document types but for .pdf
and .jpg
doc types it downloads the model and then raises: Could not initialise NNPACK! Reason: Unsupported hardware
.
How to reproduce:
make docker-build
make docker-start-api
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/layout-parser-paper.pdf'
Definition on done:
.pdf
, .jpg
, .pptx
, .html
and .eml
documents in the sample-docs folder. The output should be a bunch of Unstructured elements.Thanks for your efforts, I have one question with regard to data privacy, Do you store any data/text sent to your API?
respnse 500 inner error,there is no detail information ,i can not figout it
As in the subject line, I have received this error a few times across the course of the last few hours.
Full code
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-370-f9d5eba144d9> in <module>
9 )
10
---> 11 chev_docs_1 = loader.load()
/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in load(self)
84 def load(self) -> List[Document]:
85 """Load file."""
---> 86 elements = self._get_elements()
87 if self.mode == "elements":
88 docs: List[Document] = list()
/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in _get_elements(self)
267
268 def _get_elements(self) -> List:
--> 269 return get_elements_from_api(
270 file_path=self.file_path,
271 api_key=self.api_key,
/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in get_elements_from_api(file_path, file, api_url, api_key, **unstructured_kwargs)
202 from unstructured.partition.api import partition_via_api
203
--> 204 return partition_via_api(
205 filename=file_path,
206 file=file,
/opt/anaconda3/lib/python3.8/site-packages/unstructured/partition/api.py in partition_via_api(filename, content_type, file, file_filename, api_url, api_key, **request_kwargs)
88 files = [
89 ("files", (metadata_filename, file, content_type)), # type: ignore
---> 90 ]
91 response = requests.post(
92 api_url,
ValueError: Receive unexpected status code 504 from the API.
loader = UnstructuredAPIFileLoader(
file_path = tmpFilePath,
mode = "paged",
api_key = 'XXX',
content_type = 'multipart/form-data',
ocr_languages="eng+chi_sim"
)
docs = loader.load()
ValueError: Receive unexpected status code 400 from the API.
Following the tests done for partition
in Unstructured main repo here, the idea is to create some simple tests that make use of the documents in sample-docs
to execute most of the capabilities of the api:
Definition of done:
make test
(update if necessary).Hi
I checked github and the documentation for a way to return the full text from any document (I don't need partitioning), and I couldn't find anyone mentioned a solution
I can loop through the json output and get the "text" attribute in my code but I thought to ask you if this is already available as an option in unstructured-api
In our efforts to refine the Dockerfile in this PR we found that both the container and the web app for the API successfully parse some document types but for .doc
, .docx
and .ppt
raise ValueError: Invalid file. File type not support in partition
. The filetype at the point of this error is FileType.UNK
inside detect_filetype(filename, file)
in unstructured.partition.auto.partition
This happens despite counting with unstructured v0.5.0 installed.
How to reproduce:
make docker-build
&& make docker-start-api
make run-web-app
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/fake.doc'
(OR using @sample-docs/fake.docx
or @sample-docs/fake-power-point.ppt
.Definition on done:
.doc
, docx
, .ppt
, .html
and .eml
documents in the sample-docs folder (web app and docker container). The output should be a bunch of Unstructured elements.UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte
(23 additional frame(s) were not displayed)
...
File "prepline_general/api/general.py", line 686, in pipeline_1
list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
File "prepline_general/api/general.py", line 607, in response_generator
response = pipeline_api(
File "prepline_general/api/general.py", line 418, in pipeline_api
raise e
File "prepline_general/api/general.py", line 396, in pipeline_api
elements = partition(
When using the API and selecting the "auto" strategy for processing PDF files, an error occurs stating that the strategy is invalid. According to the documentation, there should be three available strategies: hi_res, fast, and auto.
Error: "Invalid strategy: auto. Must be one of ['fast', 'hi_res']"
Can't OCR extract text from images stored in a .doc file, while it can do so from a PDF file? Why is that?what should i set the args
Is there a parallel parsing method available to improve the speed of OCR recognition?
Hello Team,
What type of OCR is implemented in the docker version?
I did a test for an image between the docker version and the hosted API and the result is very different, the docker version does not extract a valid text most of the time
With gzip compressed files support in https://github.com/Unstructured-IO/unstructured-api-tools , make this support explicit in unstructured-api.
Definition of Done
gz_uncompressed_content_type
)gz_uncompressed_content_type
)(document-processing) root@wbj:/etc/unstructured-api# sudo make run-web-app
PYTHONPATH=/etc/unstructured-api uvicorn prepline_general.api.app:app --reload --log-config logger_config.yaml
2023-07-07 10:43:57,554 uvicorn.error INFO Will watch for changes in these directories: ['/etc/unstructured-api']
2023-07-07 10:43:57,555 uvicorn.error INFO Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
2023-07-07 10:43:57,555 uvicorn.error INFO Started reloader process [2805] using WatchFiles
[nltk_data] Error loading punkt: <urlopen error [Errno 111] Connection
[nltk_data] refused>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [Errno 111] Connection refused>
2023-07-07 10:43:59,556 uvicorn.error INFO Started server process [2807]
2023-07-07 10:43:59,557 uvicorn.error INFO Waiting for application startup.
2023-07-07 10:43:59,557 uvicorn.error INFO Application startup complete.
2023-07-07 10:44:18,178 127.0.0.1:53472 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-07-07 10:44:18,178 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 290, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 241, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 169, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/usr/local/lib/python3.8/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/etc/unstructured-api/prepline_general/api/general.py", line 465, in pipeline_1
list(response_generator(is_multipart=False))[0]
File "/etc/unstructured-api/prepline_general/api/general.py", line 424, in response_generator
response = pipeline_api(
File "/etc/unstructured-api/prepline_general/api/general.py", line 239, in pipeline_api
elements = partition(
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/auto.py", line 180, in partition
elements = partition_pdf(
File "/usr/local/lib/python3.8/dist-packages/unstructured/documents/elements.py", line 119, in wrapper
elements = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/unstructured/file_utils/filetype.py", line 519, in wrapper
elements = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 82, in partition_pdf
return partition_pdf_or_image(
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 131, in partition_pdf_or_image
return _partition_pdf_with_pdfminer(
File "/usr/local/lib/python3.8/dist-packages/unstructured/utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 244, in _partition_pdf_with_pdfminer
elements = _process_pdfminer_pages(
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 301, in _process_pdfminer_pages
element = element_from_text(_text)
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text.py", line 151, in element_from_text
elif is_possible_narrative_text(text):
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text_type.py", line 222, in sentence_count
sentences = sent_tokenize(text)
File "/usr/local/lib/python3.8/dist-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
return _sent_tokenize(text)
File "/usr/local/lib/python3.8/dist-packages/nltk/tokenize/init.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 876, in open
return find(path, path + [""]).open()
File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
when i run “ import nltk” in paython3, it come a new error
import nltk
Traceback (most recent call last):
File "", line 1, in
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/init.py", line 153, in
from nltk.translate import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/translate/init.py", line 24, in
from nltk.translate.meteor_score import meteor_score as meteor
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/translate/meteor_score.py", line 13, in
from nltk.corpus import WordNetCorpusReader, wordnet
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/corpus/init.py", line 64, in
from nltk.corpus.reader import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/corpus/reader/init.py", line 106, in
from nltk.corpus.reader.panlex_lite import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/corpus/reader/panlex_lite.py", line 15, in
import sqlite3
File "/root/.pyenv/versions/3.8.17/lib/python3.8/sqlite3/init.py", line 23, in
from sqlite3.dbapi2 import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/sqlite3/dbapi2.py", line 27, in
from _sqlite3 import *
ModuleNotFoundError: No module named '_sqlite3'
Many users are receiving this error. We already support .gz
files, so this should be straightforward.
Per this gist: https://gist.github.com/cragwolfe/7789a3653c1dad2178c65014f0132233
the unstructured
library is returning "auto" results for "fast" and something different for "hi_res" (which is good).
However, requesting "hi_res" from the API is currently also returning fast results, as documented in
Unstructured-IO/unstructured#1150 .
I'm packaging this for nixos, and it would make things for package maintainer on various distribution much easier.
I understand if this is low priority though.
Thank you for this project!
This returns json:
curl 'http://localhost:8000/general/v0/general' --header 'Accept: application/json' --form files=@sample-docs/layout-parser-paper-fast.pdf --form output_format="text/csv"
And this returns csv:
curl 'http://localhost:8000/general/v0/general' --form files=@sample-docs/layout-parser-paper-fast.pdf --form output_format="text/csv"
I think we want some sort of indication that the Accept header is overriding the output_type
, maybe an error when both are present with different filetypes?
Images and PDF support the ocr_languages parameter in unstructured. Is this supported by the API as well?
Did you consider poetry for packaging ?
I would like to package this for nixos (a linux based distribution), and poetry makes building the application reproducible, which makes packaging for linux distribution much easier.
Just wondering if you have anything against it or if there are requirements that poetry wouldn't meet.
It would be nice and more efficient to have the option of passing the file as a downloadable url rather than a blob so I don't have to download the file in my application server and then send it to unstructured (this is more efficient and avoids the 32mb body limit on some hosting platforms.
In our efforts to refine the Dockerfile in this PR we found that both the web app successfully parse some document types but for .txt
it raises: TypeError: cannot use a string pattern on a bytes-like object
in both the container and web app.
How to reproduce:
make docker-build
&& make docker-start-api
make run-web-app
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/fake-text.txt'
Definition on done:
.txt
, .html
and .eml
documents in the sample-docs folder (web app and docker container. The output should be a bunch of Unstructured elements.I tried uploading a 50 page PDF using the API, and received this error?
TypeError: init() got an unexpected keyword argument 'detection_class_prob'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-199-16b05c99d0ba> in <module>
10 )
11
---> 12 docs = loader.load()
/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in load(self)
84 def load(self) -> List[Document]:
85 """Load file."""
---> 86 elements = self._get_elements()
87 if self.mode == "elements":
88 docs: List[Document] = list()
/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in _get_elements(self)
267
268 def _get_elements(self) -> List:
--> 269 return get_elements_from_api(
270 file_path=self.file_path,
271 api_key=self.api_key,
/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in get_elements_from_api(file_path, file, api_url, api_key, **unstructured_kwargs)
202 from unstructured.partition.api import partition_via_api
203
--> 204 return partition_via_api(
205 filename=file_path,
206 file=file,
/opt/anaconda3/lib/python3.8/site-packages/unstructured/partition/api.py in partition_via_api(filename, content_type, file, file_filename, api_url, api_key, **request_kwargs)
86 "metadata_filename must be specified as well.",
87 )
---> 88 files = [
89 ("files", (metadata_filename, file, content_type)), # type: ignore
90 ]
/opt/anaconda3/lib/python3.8/site-packages/unstructured/staging/base.py in elements_from_json(filename, text, encoding)
117 """Loads a list of elements from a JSON file or a string."""
118 exactly_one(filename=filename, text=text)
--> 119
120 if filename:
121 with open(filename, encoding=encoding) as f:
/opt/anaconda3/lib/python3.8/site-packages/unstructured/staging/base.py in dict_to_elements(element_dict)
100 metadata=metadata,
101 ),
--> 102 )
103
104 return elements
/opt/anaconda3/lib/python3.8/site-packages/unstructured/staging/base.py in isd_to_elements(isd)
75 def isd_to_elements(isd: List[Dict[str, Any]]) -> List[Element]:
76 """Converts an Initial Structured Data (ISD) dictionary to a list of elements."""
---> 77 elements: List[Element] = []
78
79 for item in isd:
/opt/anaconda3/lib/python3.8/site-packages/unstructured/documents/elements.py in from_dict(cls, input_dict)
171 if isinstance(self.filename, pathlib.Path):
172 self.filename = str(self.filename)
--> 173
174 if self.filename is not None:
175 file_directory, filename = os.path.split(self.filename)
Anyone seen this before?
Hi, I m using the following request to the API in order to extract some tables from a PDF file:
curl --location --request POST 'http://localhost:8000/general/v0/general' \
--form 'strategy="hi_res"' \
--form 'pdf_infer_table_structure="true"' \
--form 'files=@"TelecomArgentina_Report 2020.pdf"' \
--form 'ocr_languages="eng"' \
--form 'skip_infer_table_types=""'
The request fails after 14 mins, I see the following on the logs:
2023-08-09 06:52:34,237 172.17.0.1:46490 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-08-09 06:52:34,238 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/applications.py", line 289, in __call__
await super().__call__(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 192, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/notebook-user/prepline_general/api/general.py", line 591, in pipeline_1
list(response_generator(is_multipart=False))[0]
File "/home/notebook-user/prepline_general/api/general.py", line 535, in response_generator
response = pipeline_api(
File "/home/notebook-user/prepline_general/api/general.py", line 339, in pipeline_api
elements = partition(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/auto.py", line 221, in partition
elements = partition_pdf(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/documents/elements.py", line 222, in wrapper
elements = func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/file_utils/filetype.py", line 628, in wrapper
elements = func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 95, in partition_pdf
return partition_pdf_or_image(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 184, in partition_pdf_or_image
layout_elements = _partition_pdf_or_image_local(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 248, in _partition_pdf_or_image_local
layout = process_data_with_model(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 344, in process_data_with_model
layout = process_file_with_model(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 389, in process_file_with_model
else DocumentLayout.from_file(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 109, in from_file
page = PageLayout.from_image(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 315, in from_image
page.get_elements_with_detection_model()
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 235, in get_elements_with_detection_model
elements = self.get_elements_from_layout(inferred_layout)
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 246, in get_elements_from_layout
elements = [
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 247, in <listcomp>
get_element_from_block(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 414, in get_element_from_block
element.text = element.extract_text(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layoutelement.py", line 32, in extract_text
text = super().extract_text(
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 216, in extract_text
text = aggregate_by_block(self, image, objects, ocr_strategy)
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 316, in aggregate_by_block
text = ocr(text_region, image, languages=ocr_languages)
File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 272, in ocr
return agent.detect(cropped_image)
File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 122, in detect
res = self._detect(image)
File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 89, in _detect
res["text"] = pytesseract.image_to_string(
File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 423, in image_to_string
return {
File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 426, in <lambda>
Output.STRING: lambda: run_and_get_output(*args),
File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 288, in run_and_get_output
run_tesseract(**kwargs)
File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 264, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 1425')
If I run the same query but with the fast strategy, then everything works fine but the results are not acceptable.
I took a look on the tesseract repo but could not find anything relevant. Would be glad for any help.
I am running the api locally through docker, all on default settings.
In our efforts to refine the Dockerfile in this PR we found that both the web app successfully parse some document types but for pptx
it raises: AttributeError: 'SpooledTemporaryFile' object has no attribute 'seekable'
(same should happen for main branch but has not been tested). The same occurs for the container parsing .docx
files.
How to reproduce:
make docker-build
&& make docker-start-api
make run-web-app
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/fake.doc'
(OR using @sample-docs/fake-power-point.ppt
or @sample-docs/fake.docx
.Definition on done:
.pptx
, .docx
, .html
and .eml
documents in the sample-docs folder (web app and docker container. The output should be a bunch of Unstructured elements.A file that either has the .pdf extension or has the Content-Type: application/pdf
multipart/form header that is not truly a .pdf file (such as plain text), results in a 500 with a stack trace, shown below.
Instead, a 400 should be returned with a message indicating the issue. This should be the case for strategy=fast
or strategy=hi_res
.
In the future, the app should be able to process a payload based on inspection rather than filename (or Content-Type: multipart/form header) given the presence of another form parameter (e.g., infer-content-type-from-file
).
File "/Users/cragwolfe/r/unstructured-api/prepline_general/api/general.py", line 291, in pipeline_1
response = pipeline_api(
File "/Users/cragwolfe/r/unstructured-api/prepline_general/api/general.py", line 98, in pipeline_api
elements = partition(
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/auto.py", line 82, in partition
return partition_pdf(
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 50, in partition_pdf
return partition_pdf_or_image(
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 111, in partition_pdf_or_image
return _partition_pdf_with_pdfminer(
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/utils.py", line 40, in wrapper
return func(*args, **kwargs)
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 213, in _partition_pdf_with_pdfminer
elements = _process_pdfminer_pages(
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 240, in _process_pdfminer_pages
for i, page in enumerate(PDFPage.get_pages(fp)):
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/pdfminer/pdfpage.py", line 151, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 752, in __init__
raise PDFSyntaxError("No /Root object! - Is this really a PDF?")
pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?
Hello,
Is there a way to use paddle ocr backend instead of tessearct? I actually want to speed up things for table extraction and noticed that the main bottleneck is ocr. I actually have a gpu and want to maximise its usage. What is the recommended to approach this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.