alpa-projects / mms Goto Github PK
View Code? Open in Web Editor NEWAlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)
The pickle binary contains the profiling results which can be loaded by ProfilingDatabase
in alpa_serve/profiling.py
.
Content:
model_name | batch_size | dp | op | pp |
---|---|---|---|---|
bert-1.3b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8 |
bert-2.6b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8, 16, 32 |
bert-6.7b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8, 16, 32 |
moe-1.3b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8, 16 |
moe-2.4b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8, 16 |
moe-7.1b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8, 16 |
moe-10.2b | 1, 2, 4, 8, 16 | 1 | 1, 2, 4, 8 | 1, 2, 4, 8, 16 |
I found the simulator is not very accurate for some cases in our goodput experiment.
checkout this branch https://github.com/alpa-projects/mms/tree/inaccurate
python3 gen_data_goodput.py --mode simulate
cat *.tsv
results:
selective replication, goodput=0.330
model parallelism, goodput=0.599
python3 gen_data_goodput.py --mode run
cat *.tsv
results:
selective replication, goodput=0.325
model parallelism, goodput=0.456
The simulator is very accurate for selective replication, but not accurate for pipeline parallelism.
HI, guys, I have a question.
Accoding to the paper,
We evaluate the two placements when the requests to each model follow an independent Poisson process with an arrival rate of 1.5 request/s. Fig. 2a shows the cumulative distribution function (CDF) and average of request latency (which includes the GPU execution time and queuing delay). Model parallel placement reduces the average latency of the simple placement from 0.70s to 0.55s, a 1.3× speedup. The speedup comes from the better burst tolerance: when a burst arrives that exceeds the capability of a single GPU, simple placement must begin queuing requests. However, as long as the other model does not receive many requests, the model parallel placement can use both GPUs to serve the requests for the popular model via statistical multiplexing of the GPUs
.
If I understand correctly, the latency makes more sense for comparison under fixed and the same pressure: Given fixed number of concurrent clients (or threads) that continuously send data, then compare the time delay (or equivalently, throughput, which is equal to num_clients/avg_latency_in_ms) between Simple Placement and Model Parallel. Specifically, assuming A == B && A0==A1
, and A0, A1, B0, and B1 each occupy less than 50% of a GPU utils when runing, the latency theoretically remains constant(except for the first request), and they should share the same throughput and latency for concurrent request == 2.
In this example, Model Parallel(A0 =>A1, B0 =>B1) can support at most 4 clients and Simple Placement (A0A1 =>B0B1) for 2 without caching. However,
Sequential A0 -> A1 (card 0) => (card 1)B0 -> B1 can also support at most 4 concurrent clients in fact.
Environment:
Python Version: 3.9
Server Hardware: Two V100 GPUs
Ray Version: 2.8.0
Alpa Version: 1.0.0.dev0
Description:
While attempting to run illustrative_example.py in parallel mode on a server equipped with 2 V100 GPUs, I encountered a ValueError related to the run_controller function. The issue seems to stem from an existing actor name conflict in Ray.
I would appreciate any insights or solutions to resolve this error. Thank you for your support.
Hello, Lianmin Zheng:
I would like to ask how the controller works.
Can the controller be combined with the idea of the batch, for example, if the controller sends ten requests to a group1, can these requests be optimized with the idea of the batch, so that the overall latency is shorter?
We can port the batching logic in #30 to the new fast simulator.
The assumption of the fast simulator is that all GPUs are FIFO streams, which is compatible with batching.
We can port the group selection, batching, and dropping logic here
mms/alpa_serve/simulator/controller.py
Lines 556 to 580 in efee5dc
Hi.
I am interested in your nice work.
I want to get a parallel configuration for my server.
I read your codes but it is hard to find some documents or steps for Alpa-serve (not Alpa).
Can you give some advice to run alpa-serve system on a server?
(To get a parallel configuration, how to use alpa-serve?)
I already installed pre-requisite package (ray and other python packages).
get_scores
with rayA declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.