Comments (10)
We are able to run Unstructured in a container on AWS Lambda without issue (or, well, there are issues, but we can work around them.)
Things to consider (sorry that these points are a bit............unstructured):
- If your container image is large (e.g. 1GB or more), each time after you deploy your container (e.g. when you deploy your Lambda function), on the initial (first) invocation after, it can take quite a while (e.g. 30 seconds) for Lambda to download your container image from AWS ECR. This only happens on the first invocation after deployment of a new image. After that, Lambda will have cached the image. It looks like you install NLTK models and magick libs into the container, so your image is likely around 1 GB.
-
- Since you mention that it's only the first invocation that's slow, this initial downloading/caching from ECR is probably what you're seeing
-
- Consider using multi-stage builds to reduce the number of new layers from the container Lambda has to download from ECR.
- I noticed you're not pre-downloading/installing any of the models that Unstructured uses to partition/read/parse PDF files. Unless you're using the "fast" strategy, it's likely that Unstructured will try to use a model to parse PDF files. Unstructured lazy-loads these models, so they're not downloaded until Unstructured needs them, which will take some time the first time — That's assuming you've configured your environment to download these models into
/tmp
with enough disk space — the only writeable location in Lambda. Also, note that Unstructureds current default modeldetectron2onnx
won't work in AWS Lambda because of an underlying issue in onnxruntime->pytorch->cpuinfo. Use the model-override parameter to specify chipper instead-->though this will further balloon your container image. Consider using only thefast
strategy instead. e.g.partition(file, strategy="fast")
— make sure to add tesseract to your container image.
If you're already accounting for the ECR download/caching time, one other thing you can try is to run a "fake" partition script during the build of your container image. This will help "warm up" any libraries/dependencies which may want to run some initial first-time setup tasks (like building/caching fonts, or downloading models). For example, in the same way you "warm up" the NLTK libraries, you could add a RUN step:
COPY XXXXX.pdf /tmp/XXXXX.pdf
RUN [ "python3", "-c", "from unstructured.partition.auto import partition; partition(filename='/tmp/XXXXX.pdf')" ]
But, this will potentially exacerbate the first point about the container image size.
from unstructured.
@cds-code can you describe how you are running unstructured
in AWS Lambda
?
from unstructured.
Im running a docker image in AWS Lambda
FROM public.ecr.aws/docker/library/python:3.11.6-slim-bookworm
RUN apt-get update && apt-get install -y \
# unstructured package requirements for file type detection
libmagic-mgc libmagic1 \
RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]
RUN [ "python3", "-c", "import nltk; nltk.download('averaged_perceptron_tagger', download_dir='/usr/local/nltk_data')" ]
pip install
unstructured
unstructured[pdf]
unstructured[docx]
unstructured[xlsx]
unstructured[pptx]
unstructured[md]
from unstructured.partition.auto import partition
partition(filename="XXXXX.pdf")
from unstructured.
Have you accounted for spin-up (cold-start) time of the Lambda instance? Like only start timing after receiving the first response?
Also, can you provide some specific timings?
from unstructured.
And how much memory is allocated to the Lambda instance?
from unstructured.
I have the same problem in AWS Batch running on fargate. I allocated 2 vCPUs and 4 GB of memory
from unstructured.
Have you accounted for spin-up (cold-start) time of the Lambda instance? Like only start timing after receiving the first response?
Also, can you provide some specific timings?
Does not contain Lambda instance cold-start time. just partition(filename="XXXXX.pdf") .
The memory of Lambda instance is 1024MB, 2048MB, 4096MB, and the execution time is the same.
When the program is executed three times in a loop. Only the first time was very long.
It seems that the initialization of unstructured takes a lot of time in AWS Lambda environment
```
| 2024-04-24T05:10:30.306Z | --1--- 2024-04-24 05:09:52.577591 ⇒loop 1
| 2024-04-24T05:10:30.306Z | --2--- 2024-04-24 05:10:07.032373 ⇒loop 1
| 2024-04-24T05:10:30.306Z | --1--- 2024-04-24 05:10:07.032521 ⇒loop 2
| 2024-04-24T05:10:30.306Z | --2--- 2024-04-24 05:10:08.106393 ⇒loop 2
| 2024-04-24T05:10:30.306Z | --1--- 2024-04-24 05:10:08.106490 ⇒loop 3
| 2024-04-24T05:10:30.306Z | --2--- 2024-04-24 05:10:08.840540 ⇒loop 3
from unstructured.
@cds-code what "extras" are you installing with NM, I see you posted that above :)unstructured
? Like where you do pip install unstructured[x]
, what x
do you use?
from unstructured.
Also, just out of curiosity, can you give me a sense of the cold-start times you've seen?
from unstructured.
This works for me thanks.
COPY XXXXX.pdf /tmp/XXXXX.pdf
RUN [ "python3", "-c", "from unstructured.partition.auto import partition; partition(filename='/tmp/XXXXX.pdf')" ]
Does unstructured itself have an initial load method to Implement the above function?
from unstructured.
Related Issues (20)
- bug/HTMLTitle doesn't have `type` attribute HOT 2
- bug/docx parse table without row.grid_cols_before or row.grid_cols_after HOT 1
- feat/Retain text indentations in PDF files HOT 1
- feat/Excluding Specific Types
- Parsing HTML files HOT 5
- Salesforce/ source connector - Not able to ingest salesforce files HOT 1
- Local API Error: `by_similarity` Chunking Strategy Not Recognized HOT 1
- LangChain + Unstructured: Failed to load file ${filePath} using unstructured loader. HOT 2
- bug/language specification does not work for PaddleOCR agent HOT 1
- feat/skip ocr for certain element types HOT 4
- Add ability to pass pipeline param to Elasticsearch connector HOT 1
- bug/Failure to recognize footer and page number ,incorrectly classifies as a Narrative text HOT 1
- Issue in partition_html and chunk_by_title HOT 3
- Max retries exceeded. Unstructured API is stuck. HOT 5
- TypeError: UnstructuredClient.__init__() got an unexpected keyword argument 'retry_connection_errors' HOT 4
- Bump to `deltalake>=0.18.x`
- feat/table element coordinates
- Sensitive data security issues HOT 2
- Unstructured paid API stuck again HOT 1
- bug/pdf extraction error when strategy not set HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unstructured.