Abstract

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs.

To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase.

1 Project Setup Steps (Automated Artifact Evaluation - For JSys Reviewers)

Go to the infrastructure for the guide to set up the K8S cluster and related depandancies, the complete installtion takes ~30 minutes.
After downloading ipa data explained in 1 the log of the experiments presented in the paper will be avialable in the directory data/results/final to draw the figures in the paper go to experiments/runner/notebooks to draw each figure presented in the paper. Each figure is organized in a different Jupyter notebook e.g. to draw the figure 8 of the paper pipeline figure experiments/runner/notebooks/paper-fig8-e2e-video.ipynb. The notebooks for the results presented in the revised version of the manuscaripts with the new accuracy measure starts with the paper-revision prefix.
If you don't want to use the logs and want to check the main paper e2e experiments (E.g. paper's figure 8) do the following steps. IPA use config yaml files for running experiments, the config files used in the paper are stored in the data/configs/final folder. Depending on whether you want to regenerate the initial version or the revised version of the manuscript do one of these routes:

For the initial submitted version results:

Go to the experiments/runner and run source run.sh, this will take ~7 hours since each of the 20 experiments is conducted on a 20 minute load (20 * 20 = 400 minutes ~ 7 hours). The results and logs will be saved under ipa/data/results/final/20 and the final figure will be in the ipa/data/figures under the name of metaseries-20-video.pdf
Go to the experiments/runner/notebooks/Jsys-reviewers.ipynb notebook to see the generated figure is same as the paper-fig8-e2e-video.ipynb that was generated from the downloaded log. Due to the K8S and distributed scheduling uncertainties there might be slight differences in the figures as shown below figures (for a sample run of the artifact evaluation) but the general trend should be the same.

For generating the revised version results:

Go to the experiments/runner and run source run-revised.sh, this will take ~7 hours since each of the 20 experiments is conducted on a 20 minute load (20 * 20 = 400 minutes ~ 7 hours). The results and logs will be saved under ipa/data/results/final/21 and the final figure will be in the ipa/data/figures under the name of metaseries-21-video.pdf
Go to the experiments/runner/notebooks/Jsys-reviewers-revised.ipynb notebook to see the generated figure is same as the paper-revision-fig8-e2e-video.ipynb that was generated from the downloaded log. Due to the K8S and distributed scheduling uncertainties there might be slight differences in the figures as shown below figures (for a sample run of the artifact evaluation) but the general trend should be the same.

Initial submission reproducibility

Figure 8 in the paper

Sample artifact evaluation figure

Revised version submission reproducibility

Figure 8 in the paper

Sample artifact evaluation figure

Experiment console

A typical log of an IPA run session:

Kubernetes pod autoscaling

Pods being added/deleted by IPA autoconfiguration module:

2 Modules Overview

Here is the mapping between code modules and the IPA description in the paper:

Model Loader and Object Store: At the entry, IPA loads models to an object storage for cluster wide access of models in containers. IPA uses Minio Object Store.
Pipeline System: IPA inference pipeline management system uses a combination of open source technologies and self made modules. A forked version of MLServer available in here is used as the backend of the serving platform of model containers and queues. Each of the five inference pipelines introduced in the paper are availalbe in pipelines folder. The pipelines containers are available in pipelines/mlserver-centralized, the containers of queue and router are also available in queue and router. The router is the central request distributer for making the connections between model containers. Queue is also the central queue for stage of the infernece pipeline.
Adapter This folder contains the optimizer/adapter.py which is the apater module that periodically checks the state of the Kuberntes cluster and modifies the state of the cluster through Kubernetes Python API. The logic of the Gurobi solver and simulating the pipeline are also available in other files in the same folder.
External Load generation module: This module is responsible for generating different load patterns in the paper. It uses load patterns from the Twitter trace dataset explained in the paper.
Monitoring The monitoring deamon uses Prometheus timeseries database for scrapping the incoming load in the inference pipeline.
Other Modules The code for other modules presented in the paper are available in the following folders:
1. Offline profiler of latency of models under different model variants and core assignments
2. LSTM load predictor
3. Preprocessing of the Twitter dataset

3 Project Setup Steps (Manual)

Go to the infrastructure for the guide to set up the K8S cluster and related depandancies, the complete installtion takes ~30 minutes.
IPA use config yaml files for running experiments, the config files used in the paper are stored in the data/configs/final folder.
To run a specific experiment and pipelines refer to the relevant yaml file in the data/configs/final folder. Set the metaseries and series field of the experiment for tagging this experiment. After setting the approperiate configs refer go to the experiments/runner and run the relevant config file e.g.:

conda activate central
python runner_script.py --config-name sample-audio-qa

The log of the experiments are now available at results/<metaseries>/<series> of the experiments.

Note: For now we have provided all the configs used for the video pipelines for the artifact evaluation (explained in 1) and samples from other pipelines for intetersted users who wish to setup larger clusters for running the rest of the experiements. We are currently working on making the same automation for the video pipeline explained earlier for the rest of the inference pipelines.

arashasg / ipa-arash Goto Github PK

ipa-arash's Introduction

Abstract

1 Project Setup Steps (Automated Artifact Evaluation - For JSys Reviewers)

Initial submission reproducibility

Revised version submission reproducibility

Experiment console

Kubernetes pod autoscaling

2 Modules Overview

3 Project Setup Steps (Manual)

ipa-arash's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent