Giter VIP home page Giter VIP logo

nlp_on_aws_lambda's Introduction

Spacy and Textacy on AWS Lambda (Applies to other packages too)

How to install and use spacy and textacy on aws lambda (works with python 3.8)

As spacy is a huge package, installing it on lambda has a lot of issues. We will use layers to install and use these packages. Also see https://aws.amazon.com/about-aws/whats-new/2018/11/aws-lambda-now-supports-custom-runtimes-and-layers/

In a nutshell

As of now, the lambda function deployment zip can be at max 50MB large, you can check limits here: https://docs.aws.amazon.com/lambda/latest/dg/limits.html This means that installing spacy packages which are typically >250MB in size is impossible. Therefore we try to make the function size as small as possible by offloading dependencies on to layers. Further optimization is done by removing non-English models and removing data that is not needed at runtime like cache files. The spacy data models are downloaded and then loaded separately at runtime from /tmp as that directory has a higher limit of 512MB

Steps are

  • Create a new clean virtualenv with python3.8
  • In the virtualenv, install the package using pip, ie. run pip install spacy
  • Check all dependencies using pip freeze and put them in a requirements.txt file
  • Remove dependencies that are satisfied by other layers, in my case, I have removed scipy as AWS provides these for us using a layer. The assumption is that the one provided by AWS will be maintained so one less thing to maintain and hopefully they'll optimize for size and speed of loading both? See details
  • Using a lambda docker image, install all requirements without further deps - See https://medium.com/@qtangs/creating-new-aws-lambda-layer-for-python-pandas-library-348b126e9f3e for instructions. Essentially, do sudo docker run --rm -v $(pwd):/foo -w /foo lambci/lambda:build-python3.8 pip install -r requirements.txt --no-deps -t python
  • In the generated folder, get a list of all __pycache__ directories and remove them. Something like find . -type d | grep __pycache__ would give you the candidates
  • You may need to strip the .so binaries using strip. Do find . -name '*.so' | sudo xargs strip if needed.
  • Optionally remove all directories with test in them, however networkx notoriously depends on tests, so don't remove tests under networkx (While we are not packaging numpy here, but if you do, then make sure you keep numpy/testing too)
  • Remove all languages except en from spacy/lang. This is the most crucial step as it reduces the size significantly
  • Creat the layer zip: zip -r9 spacyLayer.zip .
  • Upload to s3 or directly upload to layers from the console

Commands in a nutshell are

docker run -e CFLAGS="-g0 -Wl,--strip-all -Os -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib" --rm -v $(pwd):/foo -w /foo lambci/lambda:build-python3.8 pip install -r requirements_compile.txt --no-deps --no-cache-dir --compile --global-option="build_ext"  --global-option="-j 4" -t python
docker run --rm -v $(pwd):/foo -w /foo lambci/lambda:build-python3.8 pip install -r requirements.txt --no-deps --no-cache-dir  -t python
cd python
find . -name '*.so' | sudo xargs strip
find . -name 'test*' -type d | grep -v networkx | grep -v numpy | sudo xargs rm -rf
sudo rm -rf numpy/tests
find . -type d | grep __pycache__ | sudo xargs rm -rf
# cleanup python/spacy/lang
cd spacy/lang
find . -name lemmatizer* | grep -v en | sudo xargs rm -rf
cd ../../..
zip -r9 spacyLayer2.1.0_38.zip .
aws s3 cp spacyLayer2.1.0_38.zip s3://...
  • Your lambda function needs to use the scipy layer mentioned above and your new layer you just created, both, with the scipy layer coming first (as spacy depends on that)
  • In your lambda code, you need to install the data, e.g.en_core_web_sm before loading spacy, so do
import os
import tarfile
import urllib.request

# Check to download only once
if not os.path.isdir("/tmp/en_core_web_sm-2.0.0"):
    urllib.request.urlretrieve(
        "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz",
        "/tmp/en_core_web_sm-2.0.0.tar.gz",
    )
    # Extract all data
    with open("/tmp/en_core_web_sm-2.0.0.tar.gz", "rb") as f:
        t = tarfile.open(fileobj=f)
        t.extractall(path="/tmp/")
    # Cleanup
    os.remove("/tmp/en_core_web_sm-2.0.0.tar.gz")
nlp = spacy.load("/tmp/en_core_web_sm-2.0.0/en_core_web_sm/en_core_web_sm-2.0.0")

Pytorch layer

https://course.fast.ai/deployment_aws_lambda.html

Shrinking layers

https://github.com/szelenka/shrink-linalg

Contact me: [email protected] or [email protected]

nlp_on_aws_lambda's People

Contributors

theashworld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

weisisheng

nlp_on_aws_lambda's Issues

textacy not working

Hello,

Thanks for your post, it was very helpful for me. I manage to use spacy and many libraries, but textacy I cannot. How did you manage to get it work (like the title of your article suggests) ?

I have this error while trying to use textacy on the AWS lambda console

[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'spacy.lang.xx'

Any suggestion will be welcome !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.