Comments (7)
A sample program to explain what I mean:
#include <cuda_runtime.h>
#include <iostream>
__global__ void subKernel(int *data) {
printf("Data before sub kernel: %d\n", *data);
(*data) -= 1;
}
__global__ void addKernel(int *data) {
printf("Data before add kernel: %d\n", *data);
(*data) += 1;
}
struct UserData {
int data;
bool op;
};
__global__ void launchChildKernelFromDevice(void *_userData) {
UserData *userData = (UserData *)_userData;
bool op = userData->op;
if (op) {
addKernel<<<1, 1>>>((int*)userData);
} else {
subKernel<<<1, 1>>>((int*)userData);
}
}
int main() {
cudaStream_t stream;
cudaStreamCreate(&stream);
UserData *userData;
cudaMallocHost(&userData, sizeof(UserData));
userData->data = 10;
userData->op = true;
// run add kernel for sanity check
cudaStreamSynchronize(stream);
std::cout << "Data before kernel: " << userData->data << std::endl;
launchChildKernelFromDevice<<<1, 1, 0, stream>>>(userData);
cudaStreamSynchronize(stream);
std::cout << "Data after kernel: " << userData->data << std::endl;
cudaGraph_t graph;
cudaGraphExec_t instance;
// Begin graph capture
cudaStreamSynchronize(stream);
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
// Use cuda host function to launch child kernel
launchChildKernelFromDevice<<<1, 1, 0, stream>>>(userData);
// End graph capture
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaStreamSynchronize(stream);
printf("Data after graph: %d\n", userData->data);
// Run the graph
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);
printf("Data after graph replay: %d\n", userData->data);
userData->op = false;
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);
printf("Data after graph replay with different op: %d\n", userData->data);
cudaGraphExecDestroy(instance);
cudaGraphDestroy(graph);
cudaStreamDestroy(stream);
cudaFree(userData);
return 0;
}
from flashinfer.
Hi @AgrawalAmey , thanks for bringing this up, I have some ideas about the CUDA graph integration with flashinfer:
The kernels to be executed can be determined before the a decode/prefill step (for all layers) by analyze the shapes, we can compile the CUDA Graph for all possible combinations (not too many) ahead of time, and dispatch to one of them according to the shapes.
Regarding dynamic parallelism:
introduce a launcher kernel which would factor in the input metadata and launch the actual the actual cuda kernel using dynamic parallelism
It sounds tricky to me because the required shared memory size/grid size varies for different schedules.
from flashinfer.
Hi @yzh119!
I have one implementation in sarathi-serve which tries to list different combinations, and capture them. But with increasing batch size and big variance in input sequences, the number of possibilities seemed explode. Plus, prefill + decode requests clubbed together makes it further more challenging. The memory cost of cuda graphs becomes too high as the number of combinations increases.
The child kernel/dynamic parallelism proposal is aimed to solve the challenge with different grid size etc. Essentially, the launcher kernel will be triggered with a single warp. Inside the launcher kernel, we can determine all the launch params and launch the actual attention kernel.
from flashinfer.
Thanks for your explaination, that's sounds reasonable.
To proceed, I'd love to write some documentations on our dispatching rules and see if we can describe them in dynamic parallelism. Before that I have to make #75 done because it will affect our dispatching strategy.
I'll be glad to follow up next week and we can schedule a meeting on zoom (you can drop me an email at [email protected]).
from flashinfer.
Yes, that would be great, I will send out a when2meet link on email, thank you!
from flashinfer.
Hi, @AgrawalAmey, will your sarathi or sarathi-serve be open-sourced?
from flashinfer.
Hey @ZSL98, we are working with the vLLM team to get Sarathi-Serve scheduler support inside vLLM
from flashinfer.
Related Issues (20)
- Compare Append Kernel's Results with Xformers HOT 2
- Shared-prefix rope issue HOT 1
- [Install] Build error on main branch
- [LoRA] Roadmap of LoRA operators HOT 1
- Vllm support
- TypeError: get_cu_file_str() missing 1 required positional argument: 'idtype' HOT 1
- Support torch 2.3 HOT 3
- Support MLA (Multi-Head Latency Attention) in DeepSeek-v2
- 能否支持Volta/Tesla架构?
- Compilation fails due to "-Wno-switch-bool" nvcc flag
- multiple definition of `cuda::__3::pipeline...
- Circular import error when importing built-from-source flashinfer HOT 1
- CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-nl8se4dx/flashinfer-0.0.4+cu118torch2.2/include/flashinfer/attention/decode.cuh: line 871 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) HOT 2
- Can BatchDecodeWithPaddedKVCache be used in cascade inference? HOT 1
- Qwen1.5-32B failed: BatchPrefillWithPagedKVCachePyTorchWrapper failed to dispatch group_size 5
- [Feature request] Support attention logits cap with tanh HOT 5
- [Bug report] BatchPrefillWithPagedKVCachePyTorchWrapper failed to dispatch group_size 3 HOT 2
- build raise "cub::BlockAdjacentDifference<__nv_bool, 1024, 1, 1, 860>" has no member "SubtractLeft" HOT 8
- [Q&A] Any palns for different dtypes for Q (query) and KV (kv-cache)? HOT 4
- [Q&A] Cutlass and contributing HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flashinfer.