alpa-projects / mms Goto Github PK

View Code? Open in Web Editor NEW

70.0 70.0 10.0 1.88 MB

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)

Python 66.29% Jupyter Notebook 33.46% Shell 0.25%

mms's People

Contributors

Stargazers

Watchers

Forkers

machinelearningsystem dlzou youhe-jiang naphjohn parrotsky pgarec tirupatihemanth weihao97

mms's Issues

Profiling Results for Simulation

The pickle binary contains the profiling results which can be loaded by ProfilingDatabase in alpa_serve/profiling.py.

Content:

model_name	batch_size	dp	op	pp
bert-1.3b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8
bert-2.6b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8, 16, 32
bert-6.7b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8, 16, 32
moe-1.3b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8, 16
moe-2.4b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8, 16
moe-7.1b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8, 16
moe-10.2b	1, 2, 4, 8, 16	1	1, 2, 4, 8	1, 2, 4, 8, 16

Simulator is not accurate for long pipeline + many models

I found the simulator is not very accurate for some cases in our goodput experiment.

Reproduce

checkout this branch https://github.com/alpa-projects/mms/tree/inaccurate

Simulator

python3 gen_data_goodput.py --mode simulate
cat *.tsv

results:

selective replication, goodput=0.330
model parallelism, goodput=0.599

Real system

python3 gen_data_goodput.py --mode run
cat *.tsv

results:

selective replication, goodput=0.325
model parallelism, goodput=0.456

The simulator is very accurate for selective replication, but not accurate for pipeline parallelism.

Comparison between Simple Placement and Model Parallel

HI, guys, I have a question.
Accoding to the paper,
We evaluate the two placements when the requests to each model follow an independent Poisson process with an arrival rate of 1.5 request/s. Fig. 2a shows the cumulative distribution function (CDF) and average of request latency (which includes the GPU execution time and queuing delay). Model parallel placement reduces the average latency of the simple placement from 0.70s to 0.55s, a 1.3× speedup. The speedup comes from the better burst tolerance: when a burst arrives that exceeds the capability of a single GPU, simple placement must begin queuing requests. However, as long as the other model does not receive many requests, the model parallel placement can use both GPUs to serve the requests for the popular model via statistical multiplexing of the GPUs.

If I understand correctly, the latency makes more sense for comparison under fixed and the same pressure: Given fixed number of concurrent clients (or threads) that continuously send data, then compare the time delay (or equivalently, throughput, which is equal to num_clients/avg_latency_in_ms) between Simple Placement and Model Parallel. Specifically, assuming A == B && A0==A1, and A0, A1, B0, and B1 each occupy less than 50% of a GPU utils when runing, the latency theoretically remains constant(except for the first request), and they should share the same throughput and latency for concurrent request == 2.

In this example, Model Parallel(A0 =>A1, B0 =>B1) can support at most 4 clients and Simple Placement (A0A1 =>B0B1) for 2 without caching. However,
Sequential A0 -> A1 (card 0) => (card 1)B0 -> B1 can also support at most 4 concurrent clients in fact.

Error Running "illustrative_example.py" in Parallel Mode: "ValueError" in "run_controller"

Environment:
Python Version: 3.9
Server Hardware: Two V100 GPUs
Ray Version: 2.8.0
Alpa Version: 1.0.0.dev0

Description:
While attempting to run illustrative_example.py in parallel mode on a server equipped with 2 V100 GPUs, I encountered a ValueError related to the run_controller function. The issue seems to stem from an existing actor name conflict in Ray.

I would appreciate any insights or solutions to resolve this error. Thank you for your support.

Combined with batch

Hello, Lianmin Zheng:
I would like to ask how the controller works.
Can the controller be combined with the idea of the batch, for example, if the controller sends ten requests to a group1, can these requests be optimized with the idea of the batch, so that the overall latency is shorter？

Batching

We can port the batching logic in #30 to the new fast simulator.

The assumption of the fast simulator is that all GPUs are FIFO streams, which is compatible with batching.
We can port the group selection, batching, and dropping logic here

mms/alpa_serve/simulator/controller.py

Lines 556 to 580 in efee5dc

 # Select group id 

 g_id = -1 

 min_device_clock = inf 

 for j in m_id2g_id[m_id]: 

 if j < 0: 

 break 

 tmp = device_clocks[j][num_stages[j] - 1] 

 if tmp < min_device_clock: 

 min_device_clock = tmp 

 g_id = j 

 if g_id < 0: 

 finish[i] = tstamp 

 good[i] = False 

 continue 

 t = tstamp 

 for k in range(num_stages[g_id]): 

 t = max(t, device_clocks[g_id][k]) + stage_latency[m_id][g_id][k] 

 tmp_time[k] = t 

 finish_time = t + fixed_overhead 

 group_num_requests[g_id] += 1 

 if finish_time - tstamp <= slo:

[Question] How to run alpa-serve?

Hi.

I am interested in your nice work.

I want to get a parallel configuration for my server.

I read your codes but it is hard to find some documents or steps for Alpa-serve (not Alpa).

Can you give some advice to run alpa-serve system on a server?
(To get a parallel configuration, how to use alpa-serve?)

I already installed pre-requisite package (ray and other python packages).

Randomized Search

Workload

rate distribution
- Uniform
- Power law (x = 0.2, 0.5, 0.8)

Initial Solution

Enumerate group partitions
- Equal size, equal op and pp
- Equal size, unequal op or pp
- Add some unequal group partitions?
  - Partition groups according to the rate
Place models on groups
- greedy: place a model with the minimum bursty tolerance(?) on a group with the most available memory
- whether to fill all memory

Search

Framework

MCTS
- place models one by one.
Genetic algorithm. (#population size, mutation ratio, cross-over ratio, #iter)
- cross over
- mutation (mutate one model, swap two models)
MCMC (simulated annealing)
- mutation

Operators

Mutation
- Mutate model placement
  - Mutate one model on a group
  - Swap two models on two groups
- Mutate group partition
  - Mutate op x pp of a single group
  - How to get unequal-sized groups by mutation?
Cross over
- Cross over two solutions

Others

Parallelize get_scores with ray

Uneven device group sizes - experiments log

bert-1.3b * 8 + bert-6.7b * 8, mem_budget = 14 GB, total_rate = 70, slo_scales = 4
- Mixed Greedy: 0.837 [4-4-4-4] (got with approximate evaluation when training, precise evaluation gives 0.661)
- Separated Greedy: 0.852 [2-2-2-2-8]

	# Select group id
	g_id = -1
	min_device_clock = inf
	for j in m_id2g_id[m_id]:
	if j < 0:
	break
	tmp = device_clocks[j][num_stages[j] - 1]
	if tmp < min_device_clock:
	min_device_clock = tmp
	g_id = j

	if g_id < 0:
	finish[i] = tstamp
	good[i] = False
	continue

	t = tstamp
	for k in range(num_stages[g_id]):
	t = max(t, device_clocks[g_id][k]) + stage_latency[m_id][g_id][k]
	tmp_time[k] = t

	finish_time = t + fixed_overhead
	group_num_requests[g_id] += 1

	if finish_time - tstamp <= slo: