Thanks for creating these awesome kernels! I am trying to get flashinfer kernels to wo

A sample program to explain what I mean: <div class="snippet-clipboard-content not

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Make flashinfer kernels cuda graphs friendly about flashinfer HOT 7 OPEN

AgrawalAmey commented on June 15, 2024 2

Make flashinfer kernels cuda graphs friendly

from flashinfer.

Comments (7)

AgrawalAmey commented on June 15, 2024 1

A sample program to explain what I mean:

#include <cuda_runtime.h>
#include <iostream>


__global__ void subKernel(int *data) {
    printf("Data before sub kernel: %d\n", *data);
    (*data) -= 1;
}

__global__ void addKernel(int *data) {
    printf("Data before add kernel: %d\n", *data);
    (*data) += 1;
}

struct UserData {
    int data;
    bool op;
};

__global__ void launchChildKernelFromDevice(void *_userData) {
    UserData *userData = (UserData *)_userData;
    bool op = userData->op;

    if (op) {
        addKernel<<<1, 1>>>((int*)userData);
    } else {
        subKernel<<<1, 1>>>((int*)userData);
    }
}

int main() {
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    UserData *userData;
    cudaMallocHost(&userData, sizeof(UserData));

    userData->data = 10;
    userData->op = true;

    // run add kernel for sanity check

    cudaStreamSynchronize(stream);
    std::cout << "Data before kernel: " << userData->data << std::endl;
    launchChildKernelFromDevice<<<1, 1, 0, stream>>>(userData);
    cudaStreamSynchronize(stream);
    std::cout << "Data after kernel: " << userData->data << std::endl;

    cudaGraph_t graph;
    cudaGraphExec_t instance;

    // Begin graph capture
    cudaStreamSynchronize(stream);
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

    // Use cuda host function to launch child kernel
    launchChildKernelFromDevice<<<1, 1, 0, stream>>>(userData);

    // End graph capture
    cudaStreamEndCapture(stream, &graph);
    cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
    
    cudaStreamSynchronize(stream);

    printf("Data after graph: %d\n", userData->data);

    // Run the graph
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);

    printf("Data after graph replay: %d\n", userData->data);

    userData->op = false;
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);

    printf("Data after graph replay with different op: %d\n", userData->data);

    cudaGraphExecDestroy(instance);
    cudaGraphDestroy(graph);
    cudaStreamDestroy(stream);
    cudaFree(userData);

    return 0;
}

from flashinfer.

yzh119 commented on June 15, 2024

Hi @AgrawalAmey , thanks for bringing this up, I have some ideas about the CUDA graph integration with flashinfer:

The kernels to be executed can be determined before the a decode/prefill step (for all layers) by analyze the shapes, we can compile the CUDA Graph for all possible combinations (not too many) ahead of time, and dispatch to one of them according to the shapes.

Regarding dynamic parallelism:

introduce a launcher kernel which would factor in the input metadata and launch the actual the actual cuda kernel using dynamic parallelism

It sounds tricky to me because the required shared memory size/grid size varies for different schedules.

from flashinfer.

AgrawalAmey commented on June 15, 2024

Hi @yzh119!

I have one implementation in sarathi-serve which tries to list different combinations, and capture them. But with increasing batch size and big variance in input sequences, the number of possibilities seemed explode. Plus, prefill + decode requests clubbed together makes it further more challenging. The memory cost of cuda graphs becomes too high as the number of combinations increases.

The child kernel/dynamic parallelism proposal is aimed to solve the challenge with different grid size etc. Essentially, the launcher kernel will be triggered with a single warp. Inside the launcher kernel, we can determine all the launch params and launch the actual attention kernel.

from flashinfer.

yzh119 commented on June 15, 2024

Thanks for your explaination, that's sounds reasonable.

To proceed, I'd love to write some documentations on our dispatching rules and see if we can describe them in dynamic parallelism. Before that I have to make #75 done because it will affect our dispatching strategy.

I'll be glad to follow up next week and we can schedule a meeting on zoom (you can drop me an email at [email protected]).

from flashinfer.

AgrawalAmey commented on June 15, 2024

Yes, that would be great, I will send out a when2meet link on email, thank you!

from flashinfer.

ZSL98 commented on June 15, 2024

Hi, @AgrawalAmey, will your sarathi or sarathi-serve be open-sourced?

from flashinfer.

AgrawalAmey commented on June 15, 2024

Hey @ZSL98, we are working with the vLLM team to get Sarathi-Serve scheduler support inside vLLM

from flashinfer.

Make flashinfer kernels cuda graphs friendly about flashinfer HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent