unstructured-io / unstructured-api Goto Github PK

View Code? Open in Web Editor NEW

401.0 401.0 79.0 39.23 MB

License: Apache License 2.0

Makefile 2.59% Dockerfile 1.13% Jupyter Notebook 25.25% Python 62.32% Shell 8.58% HTML 0.06% Rich Text Format 0.07%

unstructured-api's People

Contributors

Stargazers

Watchers

Forkers

thezax ccaiccie admariner kravetsmic bdonkey washanhanzi dkarlovi ytoml wheaterw hubayirp vishesh04 kill136 stevegyutyan droidcraft amritacitylight mharrvic madlitz bbonning4 tynguyen liamvdv davidhenia city-light glorat fater-ai zhaopufeng chostyouwang 01-ai andrerinaldi touristshaun ndimares zhongzhikeji changmillet healthmemmo trybaseplate hongjingzhou willianpatrick gaojing33 lihuibng praneethvasarla g-parki zakariamehbi amazingdevs intrinsiclabsai wayum999 baehenrys nghia-pisa sahib-singh-shipsy homeant alexliesenfeld sygujo1 grewizard11 jomanw tkanhe-karini zylhub bentleylong wuyunlai desygner omikader davidbakerrobinson keshavaspanda tianyao-0315 volodymyrrudyi hamiltonmultimedia suryatmodulus charityeverett andreped sheimi unwildered tamaradidproduct ankitbavadiya enth77 fireflower-ai alosultan call-center-together

unstructured-api's Issues

invalid ELF header

I'm using the dockerized version, unfortunately the container fails after a few seconds, here are the logs:
docker-unstructured-unstructured-api-1 | Traceback (most recent call last):
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 168, in _load_global_deps
docker-unstructured-unstructured-api-1 | ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/ctypes/init.py", line 373, in init
docker-unstructured-unstructured-api-1 | self._handle = _dlopen(self._name, mode)
docker-unstructured-unstructured-api-1 | OSError: /home/notebook-user/.local/lib/python3.8/site-packages/torch/lib/../../nvidia/curand/lib/libcurand.so.10: invalid ELF header
docker-unstructured-unstructured-api-1 |
docker-unstructured-unstructured-api-1 | During handling of the above exception, another exception occurred:
docker-unstructured-unstructured-api-1 |
docker-unstructured-unstructured-api-1 | Traceback (most recent call last):
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/bin/uvicorn", line 8, in
docker-unstructured-unstructured-api-1 | sys.exit(main())
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 1130, in call
docker-unstructured-unstructured-api-1 | return self.main(*args, **kwargs)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 1055, in main
docker-unstructured-unstructured-api-1 | rv = self.invoke(ctx)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
docker-unstructured-unstructured-api-1 | return ctx.invoke(self.callback, **ctx.params)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
docker-unstructured-unstructured-api-1 | return __callback(*args, **kwargs)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/main.py", line 410, in main
docker-unstructured-unstructured-api-1 | run(
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/main.py", line 578, in run
docker-unstructured-unstructured-api-1 | server.run()
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/server.py", line 61, in run
docker-unstructured-unstructured-api-1 | return asyncio.run(self.serve(sockets=sockets))
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
docker-unstructured-unstructured-api-1 | return loop.run_until_complete(main)
docker-unstructured-unstructured-api-1 | File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/server.py", line 68, in serve
docker-unstructured-unstructured-api-1 | config.load()
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/config.py", line 473, in load
docker-unstructured-unstructured-api-1 | self.loaded_app = import_from_string(self.app)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/importer.py", line 21, in import_from_string
docker-unstructured-unstructured-api-1 | module = importlib.import_module(module_str)
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/importlib/init.py", line 127, in import_module
docker-unstructured-unstructured-api-1 | return _bootstrap._gcd_import(name[level:], package, level)
docker-unstructured-unstructured-api-1 | File "", line 1014, in _gcd_import
docker-unstructured-unstructured-api-1 | File "", line 991, in _find_and_load
docker-unstructured-unstructured-api-1 | File "", line 975, in _find_and_load_unlocked
docker-unstructured-unstructured-api-1 | File "", line 671, in _load_unlocked
docker-unstructured-unstructured-api-1 | File "", line 843, in exec_module
docker-unstructured-unstructured-api-1 | File "", line 219, in _call_with_frames_removed
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/prepline_general/api/app.py", line 11, in
docker-unstructured-unstructured-api-1 | from .general import router as general_router
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/prepline_general/api/general.py", line 29, in
docker-unstructured-unstructured-api-1 | from unstructured_inference.models.chipper import MODEL_TYPES as CHIPPER_MODEL_TYPES
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/models/chipper.py", line 5, in
docker-unstructured-unstructured-api-1 | import torch
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 228, in
docker-unstructured-unstructured-api-1 | _load_global_deps()
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 189, in _load_global_deps
docker-unstructured-unstructured-api-1 | _preload_cuda_deps(lib_folder, lib_name)
docker-unstructured-unstructured-api-1 | File "/home/notebook-user/.local/lib/python3.8/site-packages/torch/init.py", line 155, in _preload_cuda_deps
docker-unstructured-unstructured-api-1 | ctypes.CDLL(lib_path)
docker-unstructured-unstructured-api-1 | File "/usr/local/lib/python3.8/ctypes/init.py", line 373, in init
docker-unstructured-unstructured-api-1 | self._handle = _dlopen(self._name, mode)
docker-unstructured-unstructured-api-1 | OSError: /home/notebook-user/.local/lib/python3.8/site-packages/nvidia/curand/lib/libcurand.so.10: invalid ELF header

Fast Parameter

Being able to use the fast mode of Unstructured via the api is very important for our use case.

Our users interact in "interactive-mode" and wait on the same page while the document processes. None of them have uploaded a document that really needed OCR and are they are ok with using other services for that before uploading to our service.

feat: add chunking strategy API params

Expose chunking_strategy as an API parameter now that https://github.com/Unstructured-IO/unstructured/pull/1304/files has merged.

Also support related args: multipage_sections, combine_under_n_chars and new_after_n_chars.

service locally version CORS domain error，help me

docker+unstructured-api locally hosts

Access to fetch at 'http://223.240.77.64:8000/general/v0/general' from origin 'http://localhost' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
VM15891 QGOe:98 POST http://223.240.77.64:8000/general/v0/general net::ERR_FAILED

bug/parallel mode: unsupported operand types

I don't have the file that caused this, but element.metadata.page_number is apparently None.

Can the language type of a PDF be automatically detected?

If I don't specify the ocr_languages, it will default to "eng", but if the text is in a non-English language, it may result in garbled output. Is there a way to provide automatic language detection?

Memory leak

We're exploring using the unstructured API at work.

We're running quay.io/unstructured-io/unstructured-api:c9b74d4 on a "Pro" (private service) Render instance (i.e. 4GB RAM)

We're using the service to process PDFs with the following parameters strategy=hi_res, pdf_infer_table_structure=true and skip_infer_table_types=[]. We're also using parallel mode via UNSTRUCTURED_PARALLEL_MODE_ENABLED=true (using the defaults for the other environment vars).

We've seen the service fall over several times due to OOM, and looking at metrics it looks as if there are resources not being freed after processing runs.

Each spike represents a processing run, with about 10 minutes between each.

Table parsing doesn't work

Hi, the table parsing doesn't seem to work at all in my case.
I tried with multiple files (.pdf, .jpeg, .docx...)

It returns most cells as UncategorizedText and a few as Title.

I call the API using the following parameters :

data = aiohttp.FormData()
data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
data.add_field('ocr_languages', "fra")
data.add_field('strategy', "ocr_only" if file.filename.lower().endswith(".jpeg") or file.filename.lower().endswith(".jpg") or file.filename.lower().endswith(".png") else "auto")
data.add_field('include_page_breaks', "true")
data.add_field('pdf_infer_table_structure', "true")

and

async with session.post(
                "http://unstructured-api:8000/general/v0/general",
                headers={'accept': 'application/json'},
                data=data
            ) as response:

Thanks !

Ability to return text/csv instead of json

The pipeline_api should be able to return a text/csv response instead of json if the response_type passed through pipeline_api is "text/csv".

In this case, the the pipeline_api should call convert_to_csv before returning the result.

Definition of Done

Unittests adding confirming the new functionality
Plus a smoketest
README is updated indicating this capability

Docker entry point error

https://github.com/Unstructured-IO/unstructured-api/blob/c31465029047fcfc67e401e5b8b75fae95493400/scripts/app-start.sh#L1C17-L1C17

In certain docker environments (e.g. google cloud run) this will fail with

env: 'bash ': No such file or directory
env: use -[v]S to pass options in shebang lines
Container called exit(127).

Note the space after bash

https://github.com/Unstructured-IO/unstructured-api.git

unblocking requests from localhost

hey, first of all, I want to express my gratitude for this project!
I think it's amazing and I'm almost done packaging it for nixos. (if you are interested in the developments, please let me know your handle and I would be happy to tag you in the PRs. So far I've just packaged unstructured, so I'm missing packaging unstructured-api and paddle-OCR. All the rest is done, it's almost there).

While testing the generously made public available server, it seems that you are blocking requests made from localhost. I've got a small server that I like to test locally with. I noticed that the requests work when the server is deployed but they don't work on my local environment.

I'm wondering if there is any way to disable that, no worries if there isn't.

chore: Add languages param to api

We need to match the change in Unstructured here: Unstructured-IO/unstructured#1400

This will include:

Adding support for the new languages param
Updating the unit test
Updating the docs

Note the ocr_languages param is deprecated.

Receive unexpected status code 400 from the API.

I am running this in Python and my first PDF worked fine with the API. The first PDF was 38 pages.

I am trying now with a PDF which is 98 pages and I receive the error: "Receive unexpected status code 400 from the API.".

Any idea why?

Extract table structure with partition_via_api method

Hello,
i'm using docker to run unstructured. So I need to use the partition_via_api method to extract data from pdf.
The extraction give me some element of table but only with plain text. I need to build pandas dataframe with the tables information.
Maybe the option infer_table_structure could help me, but how to use this option with partition_via_api methods?
The documentation is not so clear about that.

Thanks,

PdfStreamError: Stream has ended unexpectedly

API users are hitting this error on certain files.

PdfStreamError: Stream has ended unexpectedly
  File "prepline_general/api/general.py", line 686, in pipeline_1
    list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
  File "prepline_general/api/general.py", line 607, in response_generator
    response = pipeline_api(
  File "prepline_general/api/general.py", line 278, in pipeline_api
    pdf = PdfReader(file)
  File "pypdf/_reader.py", line 332, in __init__
    self.read(stream)
  File "pypdf/_reader.py", line 1554, in read
    self._find_eof_marker(stream)
  File "pypdf/_reader.py", line 1625, in _find_eof_marker
    line = read_previous_line(stream)
  File "pypdf/_utils.py", line 268, in read_previous_line
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)

bug: `partition_pdf_or_image` raises `Unsupported hardware` inside docker container

In our efforts to refine the Dockerfile in this PR we found that the container successfully parse some document types but for .pdf and .jpg doc types it downloads the model and then raises: Could not initialise NNPACK! Reason: Unsupported hardware.

How to reproduce:

make docker-build
make docker-start-api
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/layout-parser-paper.pdf'

Definition on done:

Successfully curl POST the .pdf, .jpg, .pptx, .html and .eml documents in the sample-docs folder. The output should be a bunch of Unstructured elements.

Data Privacy

Thanks for your efforts, I have one question with regard to data privacy, Do you store any data/text sent to your API?

API resonse 500 error

respnse 500 inner error，there is no detail information ，i can not figout it

UnboundLocalError: local variable 'pdf' referenced before assignment

See attached. There's a logic bug where pdf isn't set. Quick glance at the code - this may mean a file has type pdf but the extension is not .pdf.

ValueError: Receive unexpected status code 504 from the API.

As in the subject line, I have received this error a few times across the course of the last few hours.

Full code

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-370-f9d5eba144d9> in <module>
      9 )
     10 
---> 11 chev_docs_1 = loader.load()

/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in load(self)
     84     def load(self) -> List[Document]:
     85         """Load file."""
---> 86         elements = self._get_elements()
     87         if self.mode == "elements":
     88             docs: List[Document] = list()

/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in _get_elements(self)
    267 
    268     def _get_elements(self) -> List:
--> 269         return get_elements_from_api(
    270             file_path=self.file_path,
    271             api_key=self.api_key,

/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in get_elements_from_api(file_path, file, api_url, api_key, **unstructured_kwargs)
    202         from unstructured.partition.api import partition_via_api
    203 
--> 204         return partition_via_api(
    205             filename=file_path,
    206             file=file,

/opt/anaconda3/lib/python3.8/site-packages/unstructured/partition/api.py in partition_via_api(filename, content_type, file, file_filename, api_url, api_key, **request_kwargs)
     88         files = [
     89             ("files", (metadata_filename, file, content_type)),  # type: ignore
---> 90         ]
     91         response = requests.post(
     92             api_url,

ValueError: Receive unexpected status code 504 from the API.

400 error，i can not figure out it

loader = UnstructuredAPIFileLoader(
file_path = tmpFilePath,
mode = "paged",
api_key = 'XXX',
content_type = 'multipart/form-data',
ocr_languages="eng+chi_sim"
)
docs = loader.load()
ValueError: Receive unexpected status code 400 from the API.

Add test script(s) for global coverage > 95%

Following the tests done for partition in Unstructured main repo here, the idea is to create some simple tests that make use of the documents in sample-docs to execute most of the capabilities of the api:

Definition of done:

Test coverage > 95% target make test(update if necessary).
Include all document types in the tests even though coverage goal has been achieved.

Return just text (not json)

I checked github and the documentation for a way to return the full text from any document (I don't need partitioning), and I couldn't find anyone mentioned a solution

I can loop through the json output and get the "text" attribute in my code but I thought to ask you if this is already available as an option in unstructured-api

bug: DOC|DOCX and PPT filetypes are UNK in web-app/docker-container

In our efforts to refine the Dockerfile in this PR we found that both the container and the web app for the API successfully parse some document types but for .doc, .docx and .ppt raise ValueError: Invalid file. File type not support in partition. The filetype at the point of this error is FileType.UNK inside detect_filetype(filename, file) in unstructured.partition.auto.partition This happens despite counting with unstructured v0.5.0 installed.

How to reproduce:

For the image/container: make docker-build && make docker-start-api
For the web app: make run-web-app
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/fake.doc' (OR using @sample-docs/fake.docx or @sample-docs/fake-power-point.ppt.

Definition on done:

Successfully curl POST the .doc, docx, .ppt, .html and .eml documents in the sample-docs folder (web app and docker container). The output should be a bunch of Unstructured elements.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte
(23 additional frame(s) were not displayed)
...
  File "prepline_general/api/general.py", line 686, in pipeline_1
    list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
  File "prepline_general/api/general.py", line 607, in response_generator
    response = pipeline_api(
  File "prepline_general/api/general.py", line 418, in pipeline_api
    raise e
  File "prepline_general/api/general.py", line 396, in pipeline_api
    elements = partition(

"auto" is invalid strategy for pdfs

When using the API and selecting the "auto" strategy for processing PDF files, an error occurs stating that the strategy is invalid. According to the documentation, there should be three available strategies: hi_res, fast, and auto.

Error: "Invalid strategy: auto. Must be one of ['fast', 'hi_res']"

PDF Unable to be processed by API because of ".pdf does not appear to be a valid PDF"

I passed in a PDF file that is of kind ".pdf" and our api returned an error stating the file wasn't a valid PDF. However, the PDF is able to be processed by Unstructured's python library as a pdf and is identified by our file detection system as a PDF.

doc can not extract text form image

Can't OCR extract text from images stored in a .doc file, while it can do so from a PDF file? Why is that?what should i set the args

the speed of OCR slowly

Is there a parallel parsing method available to improve the speed of OCR recognition?

OCR type in api docker version

Hello Team,

What type of OCR is implemented in the docker version?

I did a test for an image between the docker version and the hosted API and the result is very different, the docker version does not extract a valid text most of the time

Ability to accept gzip compressed files

With gzip compressed files support in https://github.com/Unstructured-IO/unstructured-api-tools , make this support explicit in unstructured-api.

Definition of Done

Unittests adding confirming the new functionality (both with and without the form parameter gz_uncompressed_content_type)
Plus a couple of smoketests )(both with and without the form parameter gz_uncompressed_content_type)
README is updated indicating this capability with a couple of sample curl commands.

hi，when i post pdf files，it come up a error

(document-processing) root@wbj:/etc/unstructured-api# sudo make run-web-app
PYTHONPATH=/etc/unstructured-api uvicorn prepline_general.api.app:app --reload --log-config logger_config.yaml
2023-07-07 10:43:57,554 uvicorn.error INFO Will watch for changes in these directories: ['/etc/unstructured-api']
2023-07-07 10:43:57,555 uvicorn.error INFO Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
2023-07-07 10:43:57,555 uvicorn.error INFO Started reloader process [2805] using WatchFiles
[nltk_data] Error loading punkt: <urlopen error [Errno 111] Connection
[nltk_data] refused>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [Errno 111] Connection refused>
2023-07-07 10:43:59,556 uvicorn.error INFO Started server process [2807]
2023-07-07 10:43:59,557 uvicorn.error INFO Waiting for application startup.
2023-07-07 10:43:59,557 uvicorn.error INFO Application startup complete.
2023-07-07 10:44:18,178 127.0.0.1:53472 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-07-07 10:44:18,178 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 290, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 241, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 169, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/usr/local/lib/python3.8/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/etc/unstructured-api/prepline_general/api/general.py", line 465, in pipeline_1
list(response_generator(is_multipart=False))[0]
File "/etc/unstructured-api/prepline_general/api/general.py", line 424, in response_generator
response = pipeline_api(
File "/etc/unstructured-api/prepline_general/api/general.py", line 239, in pipeline_api
elements = partition(
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/auto.py", line 180, in partition
elements = partition_pdf(
File "/usr/local/lib/python3.8/dist-packages/unstructured/documents/elements.py", line 119, in wrapper
elements = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/unstructured/file_utils/filetype.py", line 519, in wrapper
elements = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 82, in partition_pdf
return partition_pdf_or_image(
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 131, in partition_pdf_or_image
return _partition_pdf_with_pdfminer(
File "/usr/local/lib/python3.8/dist-packages/unstructured/utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 244, in _partition_pdf_with_pdfminer
elements = _process_pdfminer_pages(
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/pdf.py", line 301, in _process_pdfminer_pages
element = element_from_text(_text)
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text.py", line 151, in element_from_text
elif is_possible_narrative_text(text):
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
File "/usr/local/lib/python3.8/dist-packages/unstructured/partition/text_type.py", line 222, in sentence_count
sentences = sent_tokenize(text)
File "/usr/local/lib/python3.8/dist-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
return _sent_tokenize(text)
File "/usr/local/lib/python3.8/dist-packages/nltk/tokenize/init.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 876, in open
return find(path, path + [""]).open()
File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt/PY3/english.pickle

Searched in:
- '/root/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''

when i run “ import nltk” in paython3， it come a new error

import nltk
Traceback (most recent call last):
File "", line 1, in
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/init.py", line 153, in
from nltk.translate import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/translate/init.py", line 24, in
from nltk.translate.meteor_score import meteor_score as meteor
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/translate/meteor_score.py", line 13, in
from nltk.corpus import WordNetCorpusReader, wordnet
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/corpus/init.py", line 64, in
from nltk.corpus.reader import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/corpus/reader/init.py", line 106, in
from nltk.corpus.reader.panlex_lite import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/site-packages/nltk/corpus/reader/panlex_lite.py", line 15, in
import sqlite3
File "/root/.pyenv/versions/3.8.17/lib/python3.8/sqlite3/init.py", line 23, in
from sqlite3.dbapi2 import *
File "/root/.pyenv/versions/3.8.17/lib/python3.8/sqlite3/dbapi2.py", line 27, in
from _sqlite3 import *
ModuleNotFoundError: No module named '_sqlite3'

how can i fix this problem

File type application/zip is not supported.

Many users are receiving this error. We already support .gz files, so this should be straightforward.

"fast" output returned for "hi_res"

Per this gist: https://gist.github.com/cragwolfe/7789a3653c1dad2178c65014f0132233
the unstructured library is returning "auto" results for "fast" and something different for "hi_res" (which is good).

However, requesting "hi_res" from the API is currently also returning fast results, as documented in
Unstructured-IO/unstructured#1150 .

Would you consider making github releases for each version ?

I'm packaging this for nixos, and it would make things for package maintainer on various distribution much easier.
I understand if this is low priority though.
Thank you for this project!

bug/output_format param does not work when Accept header is sent

This returns json:

curl 'http://localhost:8000/general/v0/general' --header 'Accept: application/json' --form files=@sample-docs/layout-parser-paper-fast.pdf --form output_format="text/csv"

And this returns csv:

curl 'http://localhost:8000/general/v0/general' --form files=@sample-docs/layout-parser-paper-fast.pdf --form output_format="text/csv"

I think we want some sort of indication that the Accept header is overriding the output_type, maybe an error when both are present with different filetypes?

Question: Is there a way to use the ocr_languages parameter with the API

Images and PDF support the ocr_languages parameter in unstructured. Is this supported by the API as well?

Test issue 2

test issue 3

consider packaging with poetry

Did you consider poetry for packaging ?
I would like to package this for nixos (a linux based distribution), and poetry makes building the application reproducible, which makes packaging for linux distribution much easier.
Just wondering if you have anything against it or if there are requirements that poetry wouldn't meet.

Pass the file as a URL and not as a blob

It would be nice and more efficient to have the option of passing the file as a downloadable url rather than a blob so I don't have to download the file in my application server and then send it to unstructured (this is more efficient and avoids the 32mb body limit on some hosting platforms.

{'detail': 'API key is invalid, please provide a valid API key in the header.'}

bug: TXT-parse fails for web-app/docker-container

In our efforts to refine the Dockerfile in this PR we found that both the web app successfully parse some document types but for .txt it raises: TypeError: cannot use a string pattern on a bytes-like object in both the container and web app.

How to reproduce:

For the image/container: make docker-build && make docker-start-api
For the web app: make run-web-app
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/fake-text.txt'

Definition on done:

Successfully curl POST the .txt, .html and .eml documents in the sample-docs folder (web app and docker container. The output should be a bunch of Unstructured elements.

Test issue

TypeError: init() got an unexpected keyword argument 'detection_class_prob'

I tried uploading a 50 page PDF using the API, and received this error?

TypeError: init() got an unexpected keyword argument 'detection_class_prob'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-199-16b05c99d0ba> in <module>
     10 )
     11 
---> 12 docs = loader.load()

/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in load(self)
     84     def load(self) -> List[Document]:
     85         """Load file."""
---> 86         elements = self._get_elements()
     87         if self.mode == "elements":
     88             docs: List[Document] = list()

/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in _get_elements(self)
    267 
    268     def _get_elements(self) -> List:
--> 269         return get_elements_from_api(
    270             file_path=self.file_path,
    271             api_key=self.api_key,

/opt/anaconda3/lib/python3.8/site-packages/langchain/document_loaders/unstructured.py in get_elements_from_api(file_path, file, api_url, api_key, **unstructured_kwargs)
    202         from unstructured.partition.api import partition_via_api
    203 
--> 204         return partition_via_api(
    205             filename=file_path,
    206             file=file,

/opt/anaconda3/lib/python3.8/site-packages/unstructured/partition/api.py in partition_via_api(filename, content_type, file, file_filename, api_url, api_key, **request_kwargs)
     86                 "metadata_filename must be specified as well.",
     87             )
---> 88         files = [
     89             ("files", (metadata_filename, file, content_type)),  # type: ignore
     90         ]

/opt/anaconda3/lib/python3.8/site-packages/unstructured/staging/base.py in elements_from_json(filename, text, encoding)
    117     """Loads a list of elements from a JSON file or a string."""
    118     exactly_one(filename=filename, text=text)
--> 119 
    120     if filename:
    121         with open(filename, encoding=encoding) as f:

/opt/anaconda3/lib/python3.8/site-packages/unstructured/staging/base.py in dict_to_elements(element_dict)
    100                     metadata=metadata,
    101                 ),
--> 102             )
    103 
    104     return elements

/opt/anaconda3/lib/python3.8/site-packages/unstructured/staging/base.py in isd_to_elements(isd)
     75 def isd_to_elements(isd: List[Dict[str, Any]]) -> List[Element]:
     76     """Converts an Initial Structured Data (ISD) dictionary to a list of elements."""
---> 77     elements: List[Element] = []
     78 
     79     for item in isd:

/opt/anaconda3/lib/python3.8/site-packages/unstructured/documents/elements.py in from_dict(cls, input_dict)
    171         if isinstance(self.filename, pathlib.Path):
    172             self.filename = str(self.filename)
--> 173 
    174         if self.filename is not None:
    175             file_directory, filename = os.path.split(self.filename)

Anyone seen this before?

Internal server error - Extracting tables from a PDF file

Hi, I m using the following request to the API in order to extract some tables from a PDF file:

curl --location --request POST 'http://localhost:8000/general/v0/general' \
--form 'strategy="hi_res"' \
--form 'pdf_infer_table_structure="true"' \
--form 'files=@"TelecomArgentina_Report 2020.pdf"' \
--form 'ocr_languages="eng"' \
--form 'skip_infer_table_types=""'

The request fails after 14 mins, I see the following on the logs:

2023-08-09 06:52:34,237 172.17.0.1:46490 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-08-09 06:52:34,238 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 192, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/notebook-user/prepline_general/api/general.py", line 591, in pipeline_1
    list(response_generator(is_multipart=False))[0]
  File "/home/notebook-user/prepline_general/api/general.py", line 535, in response_generator
    response = pipeline_api(
  File "/home/notebook-user/prepline_general/api/general.py", line 339, in pipeline_api
    elements = partition(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/auto.py", line 221, in partition
    elements = partition_pdf(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/documents/elements.py", line 222, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/file_utils/filetype.py", line 628, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 95, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 184, in partition_pdf_or_image
    layout_elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/utils.py", line 43, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 248, in _partition_pdf_or_image_local
    layout = process_data_with_model(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 344, in process_data_with_model
    layout = process_file_with_model(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 389, in process_file_with_model
    else DocumentLayout.from_file(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 109, in from_file
    page = PageLayout.from_image(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 315, in from_image
    page.get_elements_with_detection_model()
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 235, in get_elements_with_detection_model
    elements = self.get_elements_from_layout(inferred_layout)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 246, in get_elements_from_layout
    elements = [
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 247, in <listcomp>
    get_element_from_block(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 414, in get_element_from_block
    element.text = element.extract_text(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layoutelement.py", line 32, in extract_text
    text = super().extract_text(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 216, in extract_text
    text = aggregate_by_block(self, image, objects, ocr_strategy)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 316, in aggregate_by_block
    text = ocr(text_region, image, languages=ocr_languages)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 272, in ocr
    return agent.detect(cropped_image)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 122, in detect
    res = self._detect(image)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 89, in _detect
    res["text"] = pytesseract.image_to_string(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 423, in image_to_string
    return {
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 426, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 288, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 264, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 1425')

If I run the same query but with the fast strategy, then everything works fine but the results are not acceptable.
I took a look on the tesseract repo but could not find anything relevant. Would be glad for any help.
I am running the api locally through docker, all on default settings.

bug: filetypes as `SpooledTemporaryFile` for web-app api

In our efforts to refine the Dockerfile in this PR we found that both the web app successfully parse some document types but for pptx it raises: AttributeError: 'SpooledTemporaryFile' object has no attribute 'seekable' (same should happen for main branch but has not been tested). The same occurs for the container parsing .docx files.

How to reproduce:

For the image/container: make docker-build && make docker-start-api
For the web app: make run-web-app
curl -X 'POST' 'http://localhost:8000/general/v0.0.4/general' -F 'files=@sample-docs/fake.doc' (OR using @sample-docs/fake-power-point.ppt or @sample-docs/fake.docx.

Definition on done:

Successfully curl POST the .pptx, .docx, .html and .eml documents in the sample-docs folder (web app and docker container. The output should be a bunch of Unstructured elements.

Posting a bad .pdf results in a 500

A file that either has the .pdf extension or has the Content-Type: application/pdf multipart/form header that is not truly a .pdf file (such as plain text), results in a 500 with a stack trace, shown below.

Instead, a 400 should be returned with a message indicating the issue. This should be the case for strategy=fast or strategy=hi_res.

In the future, the app should be able to process a payload based on inspection rather than filename (or Content-Type: multipart/form header) given the presence of another form parameter (e.g., infer-content-type-from-file).

  File "/Users/cragwolfe/r/unstructured-api/prepline_general/api/general.py", line 291, in pipeline_1
    response = pipeline_api(
  File "/Users/cragwolfe/r/unstructured-api/prepline_general/api/general.py", line 98, in pipeline_api
    elements = partition(
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/auto.py", line 82, in partition
    return partition_pdf(
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 50, in partition_pdf
    return partition_pdf_or_image(
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 111, in partition_pdf_or_image
    return _partition_pdf_with_pdfminer(
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/utils.py", line 40, in wrapper
    return func(*args, **kwargs)
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 213, in _partition_pdf_with_pdfminer
    elements = _process_pdfminer_pages(
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 240, in _process_pdfminer_pages
    for i, page in enumerate(PDFPage.get_pages(fp)):
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/pdfminer/pdfpage.py", line 151, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/Users/cragwolfe/.pyenv/versions/unstapi/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 752, in __init__
    raise PDFSyntaxError("No /Root object! - Is this really a PDF?")
pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?

Using paddle ocr backend instead of tesseract for hi_res models

Hello,

Is there a way to use paddle ocr backend instead of tessearct? I actually want to speed up things for table extraction and noticed that the main bottleneck is ocr. I actually have a gpu and want to maximise its usage. What is the recommended to approach this?

unstructured-io / unstructured-api Goto Github PK

unstructured-api's People

Contributors

Stargazers

Watchers

Forkers

unstructured-api's Issues

how can i fix this problem

Recommend Projects

Recommend Topics

Recommend Org