Giter VIP home page Giter VIP logo

Comments (5)

abuccts avatar abuccts commented on May 22, 2024

there's a size issue indeed, can you try whether 857a8ba fixes your case?

for the CUBLAS_STATUS_NOT_INITIALIZED issue, can you run successfully with smaller batch? what's the total memory size of your gpu, it may be insufficient for this large batch gemm

from superbenchmark.

nishshah0 avatar nishshah0 commented on May 22, 2024

It does work for smaller batch sizes. With batch=64, A, B and C matrices are each 8 GB with total of 24 GB. GPU I am running on has 80GB of memory. So it is well below its capacity. I was able verify using a different tool that batch = 64 does work. But it does not with superbench. It will be good to get if fixed in superbench as I can leverage FP8 support.

from superbenchmark.

abuccts avatar abuccts commented on May 22, 2024

I just tried 857a8ba on DGX H100 (driver 525.85.12, CUDA 12.0, Docker image superbench/superbench:v0.7.0-cuda11.8) and confirmed it works fine with batch size 64 and 128 by running /opt/superbench/bin/cublaslt_gemm -b [64/128] -m 8192 -k 8192 -n 8192 -i 1000 -t [bf16/fp16].

I was able verify using a different tool that batch = 64 does work

Does this tool leverage cutlass, cublas, or cublaslt?

But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED

I don't think cublaslt_gemm will call cublasCreate, do you mean cublasLtCreate?

Because CUBLAS_STATUS_NOT_INITIALIZED may indicate errors in CUDA Runtime API called by the cuBLASLt routine or hardware setup and I cannot reproduce the issue, can you set CUBLASLT_LOG_LEVEL=5 when running cublaslt_gemm to see whether you can get more information from your environment?

It would also be great if you can share more hardware/software information or try with above driver/cuda/image versions.

from superbenchmark.

nishshah0 avatar nishshah0 commented on May 22, 2024

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

  1. When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
  2. added support to specify device from commandline this way I can script around and control which device is used.
  3. Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes.
cublaslt_gemm_upgrade.patch

from superbenchmark.

abuccts avatar abuccts commented on May 22, 2024

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

  1. When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
  2. added support to specify device from commandline this way I can script around and control which device is used.
  3. Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes. cublaslt_gemm_upgrade.patch

Hi @nishshah-msft, I have created #503 to merge the size fix.

For your patch:

  1. the PR will also refine error message with func name and line no, thanks for your contribution!
  2. you can always use CUDA_VISIBLE_DEVICES to specify the device you want to use in any cuda program, adding an argument in command line seems unnecessary
  3. if you run the benchmark through superbench cli instead of executing the binary directly, you will get parsed results with correct meanings, see cublaslt-gemm metrics

from superbenchmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.