Giter VIP home page Giter VIP logo

ipa-arash's Introduction

Abstract

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs.

To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase.

1 Project Setup Steps (Automated Artifact Evaluation - For JSys Reviewers)

  1. Go to the infrastructure for the guide to set up the K8S cluster and related depandancies, the complete installtion takes ~30 minutes.

  2. After downloading ipa data explained in 1 the log of the experiments presented in the paper will be avialable in the directory data/results/final to draw the figures in the paper go to experiments/runner/notebooks to draw each figure presented in the paper. Each figure is organized in a different Jupyter notebook e.g. to draw the figure 8 of the paper pipeline figure experiments/runner/notebooks/paper-fig8-e2e-video.ipynb. The notebooks for the results presented in the revised version of the manuscaripts with the new accuracy measure starts with the paper-revision prefix.

  3. If you don't want to use the logs and want to check the main paper e2e experiments (E.g. paper's figure 8) do the following steps. IPA use config yaml files for running experiments, the config files used in the paper are stored in the data/configs/final folder. Depending on whether you want to regenerate the initial version or the revised version of the manuscript do one of these routes:

For the initial submitted version results:

  1. Go to the experiments/runner and run source run.sh, this will take ~7 hours since each of the 20 experiments is conducted on a 20 minute load (20 * 20 = 400 minutes ~ 7 hours). The results and logs will be saved under ipa/data/results/final/20 and the final figure will be in the ipa/data/figures under the name of metaseries-20-video.pdf
  2. Go to the experiments/runner/notebooks/Jsys-reviewers.ipynb notebook to see the generated figure is same as the paper-fig8-e2e-video.ipynb that was generated from the downloaded log. Due to the K8S and distributed scheduling uncertainties there might be slight differences in the figures as shown below figures (for a sample run of the artifact evaluation) but the general trend should be the same.

For generating the revised version results:

  1. Go to the experiments/runner and run source run-revised.sh, this will take ~7 hours since each of the 20 experiments is conducted on a 20 minute load (20 * 20 = 400 minutes ~ 7 hours). The results and logs will be saved under ipa/data/results/final/21 and the final figure will be in the ipa/data/figures under the name of metaseries-21-video.pdf
  2. Go to the experiments/runner/notebooks/Jsys-reviewers-revised.ipynb notebook to see the generated figure is same as the paper-revision-fig8-e2e-video.ipynb that was generated from the downloaded log. Due to the K8S and distributed scheduling uncertainties there might be slight differences in the figures as shown below figures (for a sample run of the artifact evaluation) but the general trend should be the same.

Initial submission reproducibility

paper figure
Figure 8 in the paper
artifact evaluation
Sample artifact evaluation figure

Revised version submission reproducibility

paper figure
Figure 8 in the paper
artifact evaluation
Sample artifact evaluation figure

Experiment console

A typical log of an IPA run session:

experiment

Kubernetes pod autoscaling

Pods being added/deleted by IPA autoconfiguration module:

log

2 Modules Overview

Here is the mapping between code modules and the IPA description in the paper:

mapping

  1. Model Loader and Object Store: At the entry, IPA loads models to an object storage for cluster wide access of models in containers. IPA uses Minio Object Store.

  2. Pipeline System: IPA inference pipeline management system uses a combination of open source technologies and self made modules. A forked version of MLServer available in here is used as the backend of the serving platform of model containers and queues. Each of the five inference pipelines introduced in the paper are availalbe in pipelines folder. The pipelines containers are available in pipelines/mlserver-centralized, the containers of queue and router are also available in queue and router. The router is the central request distributer for making the connections between model containers. Queue is also the central queue for stage of the infernece pipeline.

  3. Adapter This folder contains the optimizer/adapter.py which is the apater module that periodically checks the state of the Kuberntes cluster and modifies the state of the cluster through Kubernetes Python API. The logic of the Gurobi solver and simulating the pipeline are also available in other files in the same folder.

  4. External Load generation module: This module is responsible for generating different load patterns in the paper. It uses load patterns from the Twitter trace dataset explained in the paper.

  5. Monitoring The monitoring deamon uses Prometheus timeseries database for scrapping the incoming load in the inference pipeline.

  6. Other Modules The code for other modules presented in the paper are available in the following folders:

    1. Offline profiler of latency of models under different model variants and core assignments
    2. LSTM load predictor
    3. Preprocessing of the Twitter dataset

3 Project Setup Steps (Manual)

  1. Go to the infrastructure for the guide to set up the K8S cluster and related depandancies, the complete installtion takes ~30 minutes.

  2. IPA use config yaml files for running experiments, the config files used in the paper are stored in the data/configs/final folder.

  3. To run a specific experiment and pipelines refer to the relevant yaml file in the data/configs/final folder. Set the metaseries and series field of the experiment for tagging this experiment. After setting the approperiate configs refer go to the experiments/runner and run the relevant config file e.g.:

conda activate central
python runner_script.py --config-name sample-audio-qa

The log of the experiments are now available at results/<metaseries>/<series> of the experiments.

Note: For now we have provided all the configs used for the video pipelines for the artifact evaluation (explained in 1) and samples from other pipelines for intetersted users who wish to setup larger clusters for running the rest of the experiements. We are currently working on making the same automation for the video pipeline explained earlier for the rest of the inference pipelines.

ipa-arash's People

Contributors

saeid93 avatar razavi1371 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.