Giter VIP home page Giter VIP logo

lightgbm-dask-testing's Introduction

Testing lightgbm.dask

GitHub Actions status

This repository can be used to test and develop changes to LightGBM's Dask integration. It contains the following useful features:

  • make recipes for building a local development image with lightgbm installed from a local copy, and Jupyter Lab running for interactive development
  • Jupyter notebooks for testing lightgbm.dask against a LocalCluster (multi-worker, single-machine) and a dask_cloudprovider.aws.FargateCluster (multi-worker, multi-machine)
  • make recipes for publishing a custom container image to ECR Public repository, for use with AWS Fargate

Contents

Getting Started

To begin, clone a copy of LightGBM to a folder LightGBM at the root of this repo. You can do this however you want, for example:

git clone --recursive [email protected]:microsoft/LightGBM.git LightGBM

If you're developing a reproducible example for an issue or you're testing a potential pull request, you probably want to clone LightGBM from your fork, instead of the main repo.


Develop in Jupyter

This section describes how to test a version of LightGBM in Jupyter.

1. Build the notebook image

Run the following to build an image that includes lightgbm, all its dependencies, and a JupyterLab setup.

make notebook-image

The first time you run this, it will take a few minutes as this project needs to build a base image with LightGBM's dependencies and needs to compile the LightGBM C++ library.

Every time after that, make notebook-image should run very quickly.

2. Run a notebook locally

Start up Jupyter Lab! This command will run Jupyter Lab in a container using the image you built with make notebook-image.

make start-notebook

Navigate to http://127.0.0.1:8888/lab in your web browser.

The command make start-notebook mounts your current working directory into the running container. That means that even though Jupyter Lab is running inside the container, changes that you make in it will be saved on your local filesystem even after you shut the container down. So you can edit and create notebooks and other code in there with confidence!

When you're done with the notebook, stop the container by running the following from another shell:

make stop-notebook

Test with a LocalCluster

To test lightgbm.dask on a LocalCluster, run the steps in "Develop in Jupyter", then try out local.ipynb or your own notebooks.


Test with a FargateCluster

There are some problems with Dask code which only arise in a truly distributed, multi-machine setup. To test for these sorts of issues, I like to use dask-cloudprovider.

The steps below describe how to test a local copy of LightGBM on a FargateCluster from dask-cloudprovider.

1. Build the cluster image

Build an image that can be used for the scheduler and works in the Dask cluster you'll create on AWS Fargate. This image will have your local copy of LightGBM installed in it.

make cluster-image

2. Install and configure the AWS CLI

For the rest of the steps in this section, you'll need access to AWS resources. To begin, install the AWS CLI if you don't already have it.

pip install --upgrade awscli

Next, configure your shell to make authenticated requests to AWS. If you've never done this, you can see the AWS CLI docs.

The rest of this section assums that the shell variables AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID have been sett.

I like to set these by keeping them in a file

# file: aws.env
AWS_SECRET_ACCESS_KEY=your-key-here
AWS_ACCESS_KEY_ID=your-access-key-id-here

and then sourcing that file

set -o allexport
source aws.env
set +o allexport

3. Push the cluster image to ECR

To use the cluster image in the containers you spin up on Fargate, it has to be available in a container registry. This project uses the free AWS Elastic Container Registry (ECR) Public. For more information on ECR Public, see the AWS docs.

The command below will create a new repository on ECR Public, store the details of that repository in a file ecr-details.json, and push the cluster image to it. The cluster image will not contain your credentials, notebooks, or other local files.

make push-image

This may take a few minutes to complete.

4. Run the AWS notebook

Follow the steps in "Develop in Jupyter" to get a local Jupyter Lab running. Open aws.ipynb. That notebook contains sample code that uses dask-cloudprovider to provision a Dask cluster on AWS Fargate.

You can view the cluster's current state and its logs by navigating to the Elastic Container Service (ECS) section of the AWS console.

5. Clean Up

As you work on whatever experiment you're doing, you'll probably find yourself wanting to repeat these steps multiple times.

To remove the image you pushed to ECR Public and the repository you created there, run the following

make delete-repo

Then, repeat the steps above to rebuild your images and test again.


Run LightGBM unit tests

This repo makes it easy to run lightgbm's Dask unit tests in a containerized setup.

make lightgbm-unit-tests

Pass variable DASK_VERSION to use a different version of dask / distributed.

make lightgbm-unit-tests \
    -e DASK_VERSION=2023.4.0

Profile LightGBM code

runtime profiling

To try to identify expensive parts of the code path for lightgbm, you can run its examples under cProfile (link) and then visualize those profiling results with snakeviz (link).

make profile

Then navigate to http://0.0.0.0:8080/snakeviz/%2Fprofiling-output in your web browser.

memory profiling

To summarize memory allocations in typical uses of LightGBM, and to attribute those memory allocations to particular codepaths, you can run its examples under memray (link).

make profile-memory-usage

That will generate a bunch of HTML files. View them in your browser by running the following, then navigating to localhost:1234.

python -m http.server \
    --directory ./profiling-output/memory-usage \
    1234

Useful Links

lightgbm-dask-testing's People

Contributors

jameslamb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lightgbm-dask-testing's Issues

cannot build cluster image directly

From a totally clean repo, and with 0 images built yet...

make clean

...attempting to build the cluster image fails.

make cluster-image

Unable to find image 'lightgbm-dask-testing-notebook-base:2022.7.0' locally
docker: Error response from daemon: pull access denied for lightgbm-dask-testing-notebook-base, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
make: *** [LightGBM/lib_lightgbm.so] Error 125

It should be possible to build the cluster image directly.

Benefit of fixing this

cluster-image is used by as the main, lightweight image in this project for tasks like "run LightGBM's unit tests" and "run LightGBM under memory profiling".

Such tasks shouldn't require that you also build the images with Jupyter notebook support.

reduce image sizes

The images built in this project have Dockerfiles that were optimized for build time without much concern for image size, because I mainly expected that this project would be used for local, interactive development.

However, excessive image sizes can cause issues even if you never need to push the image to a registry.

image

This project can be used to build 3 images:

make notebook-image
make base-image
make cluster-image

These are all currently 2+ GB on disk.

Things that might help

  • adopting some of the tips from https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html
  • walking back through the dependencies of these images (by examining FROM statements), pulling the contents of those upstream images into this project's Dockerfiles, and then removing things that are unnecessary
  • using conda clean --all
  • removing the LightGBM repo (which includes intermediate files in LightGBM/python-package/compile) after installation

How to test changes

make base-image
docker images \
    --filter=reference='lightgbm-dask-testing*' \
    --format "table {{.Repository}}\t{{.CreatedSince}}\t{{.Size}}"

This will produce a table like the following.

REPOSITORY                   CREATED          SIZE
lightgbm-dask-testing-base   12 minutes ago   2.13G

[ci] run CI on a schedule

This project contains some unpinned dependencies, which means that changes in the project's dependencies can break the project.

Today, continuous integration (CI) jobs are only run when on commits in pull requests.

As a result, this situation is possible:

  • today: everything is working
  • 6 weeks from today: open a pull request and things totally unrelated to your PR are broken

That situation is not ideal because it makes maintenance work on the project "lumpy" instead of spreading it more smoothly over time. It also means that the project can break and remain broken for a while without maintainers knowing, extra likely in a project like this with very very few users and not much development activity.

How to fix this issue

Add a schedule to https://github.com/jameslamb/lightgbm-dask-testing/blob/main/.github/workflows/main.yml, running that pipeline once a week. Any day of the week / time is fine.

See https://docs.github.com/en/actions/reference/events-that-trigger-workflows#schedule for details on how to do this.

and / or consult the example at https://github.com/uptake/pkgnet/blob/96ac2f1636f80a943f8a4fd6f2e93f870ff27ebb/.github/workflows/smoke-tests.yaml#L4-L10

Notes

This will be even more valuable if / when #20 is added to the project's CI.

[ci] test image builds in CI

This project's continuous integration (CI) setup does not currently test that this project's images can be built successfully. That means that it's possible to merge a pull request that breaks the project. That's bad ๐Ÿ˜ฌ

How to fix this issue

Add a step to https://github.com/jameslamb/lightgbm-dask-testing/blob/main/.github/workflows/main.yml that builds this project's images, and fails if docker build fails.

This submission should use make targets and not raw docker run commands, so that it also serves as a test of the code that the documentation recommends running when using the project.

building cluster image is broken

Running make base-image cluster-image, building the cluster image fails with the following error.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 1.3.1 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.20.109 which is incompatible.

allow customizing Python version

It should be possible to pass the desired Python version through as configuration.

This should be implemented similar to the way that DASK_VERSION is passed through this project.

Trying to re-build an image with the same Python version + Dask version should not actually re-run docker build.

add profiling

Would be cool to add steps here making it easy to profile the memory usage of the lightgbm Python package.

Some options:

It could be interesting to profile both the unit tests and more expensive training scripts.

References

https://anticdent.org/profiling-placement-in-docker.html

rewrite .dockerignore

As of this writing, this project's .dockerignore is written like "allow all files, except those matching these patterns".

# unnecessary LightGBM files
LightGBM/.appveyor.yml
LightGBM/build-cran-package.sh
LightGBM/build_r.R
LightGBM/.editorconfig
LightGBM/.github/*
LightGBM/helpers/*
LightGBM/.nuget/*
LightGBM/pmml/*
LightGBM/R-package/*
# key material
*.env
*.pem
*.pub
*.rdp
*_rsa
# exclusions
!image.env

It would be better and less error-prone to change that format to "ignore all files, except those matching these patterns".

Like this:

*
!python-package/lightgbm
!python-package/setup.py
!python-package/LICENSE
!python-package/README.rst
# etc., etc.

This should be written to ensure that only files needed to build the Python package are copied into this project's images.

How this improves lightgbm-dask-testing

  • keeps image sizes small, by reducing the risk of unnecessary files being added to images
  • increases the possibility of cache hits in local builds and pushes to remote registries

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.