This repository provides a guide and the necessary scripts to deploy Tesseract OCR in an AWS Lambda function using a Docker container to build the required binaries and the Serverless Framework for deployment.
Tesseract is a popular OCR (Optical Character Recognition) engine. pytesseract
is a Python wrapper for Tesseract, but since AWS Lambda runs on Amazon Linux, pytesseract
alone won't work out of the box. We need to compile Tesseract for Amazon Linux and include it in our deployment package.
By following this guide, you will build Tesseract within an Amazon Linux Docker container, create a ZIP file with the necessary binaries and dependencies, and then deploy it to AWS Lambda as a layer.
- Docker
- Serverless Framework configured
When preparing to build Tesseract within a Docker container, it is essential to select an appropriate base image that aligns with your target AWS Lambda execution environment. The Dockerfile provided in this repository starts with a specific version of the Lambda base image for Python. Here is an example:
FROM public.ecr.aws/lambda/python:3.9-arm64
Ensure the Python version in the Dockerfile (python:3.9-arm64) matches the AWS Lambda runtime to maintain code compatibility and leverage consistent runtime features. Using the same version avoids issues with library support specific to Python 3.9.
The arm64 tag indicates an ARM64 architecture, optimal for certain AWS Lambda workloads due to its cost-to-performance benefits. Confirm that all dependencies, including Tesseract, support ARM64. For x86_64 workloads, adjust the Dockerfile accordingly.
When building Tesseract OCR in Docker, you have the option to include language-specific .traineddata
files according to your requirements. However, keep in mind that AWS Lambda layers have a size limit of 50MB. The .traineddata
files can be quite large, and including too many can exceed this limit.
To customize the languages included in your build, you will need to modify the ENV TESSDATA_VERSION
environment variable in the Dockerfile provided in this repository. You have three options for the source of the .traineddata
files, which vary in file size and accuracy:
tessdata
: This is the standard version with a balance between size and accuracy.tessdata_fast
: These are smaller, less accurate files optimized for speed and size.tessdata_best
: These files are larger and more accurate but may cause you to reach the Lambda layer size limit more quickly.
Here's an excerpt from the Dockerfile where you can set the environment variable:
# Dockerfile
...
# Set the TESSDATA_VERSION to the desired tessdata repository
ENV TESSDATA_VERSION="tessdata"
# Alternatively, set TESSDATA_VERSION to "tessdata_fast" or "tessdata_best" for different versions
...
After setting TESSDATA_VERSION
, you should list the languages you want to include in your Docker build environment. For example:
# Dockerfile
...
# Set the TESSDATA_LANGUAGES to the languages you want to include, separated by spaces
ENV TESSDATA_LANGUAGES="eng spa deu"
...
In the build script build_tesseract.sh
, include a loop that downloads the specified .traineddata
files for each language you set in TESSDATA_LANGUAGES
. Here's an example snippet you could use:
#!/bin/bash
# build_tesseract.sh
# Loop through each language in the TESSDATA_LANGUAGES environment variable
for lang in $TESSDATA_LANGUAGES; do
# Download the respective .traineddata file
wget "https://github.com/tesseract-ocr/${TESSDATA_VERSION}/raw/main/${lang}.traineddata" -P path_to_tessdata_dir
done
# Continue with the rest of the build process
...
Remember to replace path_to_tessdata_dir
with the actual path to the tessdata directory in your build environment.
By adjusting the TESSDATA_VERSION
and TESSDATA_LANGUAGES
environment variables, you can control which .traineddata
files are included in your Tesseract build and manage the overall size of your Lambda layer. Always test your final build to ensure that it fits within the Lambda layer size limit and meets your performance expectations.
Before building Tesseract, ensure you have Docker installed and running.
- Clone this repository to your local machine:
git clone https://github.com/leonbeckert/tesseract-lambda.git
cd tesseract-lambda
- Inside the repository, you will find a build script build_tesseract.sh. Make sure it's executable:
chmod +x build_tesseract.sh
- Use the following Docker commands to build the Tesseract binaries:
docker build -t tesseract .
docker run -v $PWD/tesseract:/tmp/build --entrypoint /tmp/build_tesseract.sh tesseract
- After the build process is complete, the Tesseract binaries will be located in the tesseract folder as
tesseract.zip
.
Once Tesseract has been built and the tesseract.zip
file has been created inside the new tesseract
directory, follow these steps to create your Lambda Layer:
- Navigate to the
tesseract
folder created by the Docker process and extract the contents oftesseract.zip
:
cd tesseract
unzip tesseract.zip -d /path/to/your/lambda-layer/tesseract-layer
Replace /path/to/your/lambda-layer with the actual path to your Lambda Layer directory. This step will create a tesseract-layer directory within your Lambda Layer directory with all the necessary binaries and files for Tesseract.
- In your serverless.yml file, define the layer with its path and specify the runtimes that are compatible with the Tesseract binaries. Below is an example configuration:
# serverless.yml
functions:
myFunction:
handler: handler.myHandler
layers:
- {Ref: TesseractLambdaLayer}
layers:
tesseract:
path: lambda_layers/tesseract-layer
compatibleRuntimes:
- python3.9
In this configuration, myFunction is your AWS Lambda function that will use the layer. The layer itself is defined under layers, where you set the path to the extracted tesseract binaries and specify python3.9 as the compatible runtime.
After adding the Lambda Layer, you need to set an environment variable in your Python script to point to the Tesseract data:
import os
os.environ['TESSDATA_PREFIX'] = '/opt/tessdata'
After configuring your environment and script, deploy your AWS Lambda function with the Serverless Framework:
serverless deploy
or shorter
sls deploy
The Serverless Framework will handle the creation and deployment of the layer. It will also attach the layer to your Lambda function as specified in the serverless.yml file.
If you encounter any issues during deployment or execution:
- Ensure you have the right permissions set for Serverless Framework.
- Verify that the Python versions in your build_tesseract.sh and serverless.yml match.
- Check the Lambda function's logs for any runtime errors.
This AWS Lambda Tesseract deployment guide is largely based on the concepts and procedures outlined in the blog post "Tesseract on AWS Lambda: OCR as a Service" by Typless. The content here has been adapted, improved upon, and updated to fix some errors and suit the current deployment strategies and best practices. I extend my heartfelt thanks to the Typless team for their original and valuable contribution to the community.