When I run microbenchmark cublaslt_gemm with B=64, M=8192, K=8192, N=8192, it fails in

there's a size issue indeed, can you try whether <a class="commit-link" data-hovercard

I just tried <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="h

cublaslt_gemm microbenchmark fails with running with large matrix sizes. about superbenchmark HOT 5 CLOSED

microsoft commented on May 22, 2024

cublaslt_gemm microbenchmark fails with running with large matrix sizes.

from superbenchmark.

Comments (5)

abuccts commented on May 22, 2024

there's a size issue indeed, can you try whether 857a8ba fixes your case?

for the CUBLAS_STATUS_NOT_INITIALIZED issue, can you run successfully with smaller batch? what's the total memory size of your gpu, it may be insufficient for this large batch gemm

from superbenchmark.

nishshah0 commented on May 22, 2024

It does work for smaller batch sizes. With batch=64, A, B and C matrices are each 8 GB with total of 24 GB. GPU I am running on has 80GB of memory. So it is well below its capacity. I was able verify using a different tool that batch = 64 does work. But it does not with superbench. It will be good to get if fixed in superbench as I can leverage FP8 support.

from superbenchmark.

abuccts commented on May 22, 2024

I just tried 857a8ba on DGX H100 (driver 525.85.12, CUDA 12.0, Docker image superbench/superbench:v0.7.0-cuda11.8) and confirmed it works fine with batch size 64 and 128 by running /opt/superbench/bin/cublaslt_gemm -b [64/128] -m 8192 -k 8192 -n 8192 -i 1000 -t [bf16/fp16].

I was able verify using a different tool that batch = 64 does work

Does this tool leverage cutlass, cublas, or cublaslt?

But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED

I don't think cublaslt_gemm will call cublasCreate, do you mean cublasLtCreate?

superbenchmark/superbench/benchmarks/micro_benchmarks/cublaslt_gemm/cublaslt_utils.cc

Line 8 in 857a8ba

checkCublasStatus(cublasLtCreate(&handle));

Because CUBLAS_STATUS_NOT_INITIALIZED may indicate errors in CUDA Runtime API called by the cuBLASLt routine or hardware setup and I cannot reproduce the issue, can you set CUBLASLT_LOG_LEVEL=5 when running cublaslt_gemm to see whether you can get more information from your environment?

It would also be great if you can share more hardware/software information or try with above driver/cuda/image versions.

from superbenchmark.

nishshah0 commented on May 22, 2024

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
added support to specify device from commandline this way I can script around and control which device is used.
Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes.
cublaslt_gemm_upgrade.patch

from superbenchmark.

abuccts commented on May 22, 2024

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.

added support to specify device from commandline this way I can script around and control which device is used.

Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes. cublaslt_gemm_upgrade.patch

Hi @nishshah-msft, I have created #503 to merge the size fix.

For your patch:

the PR will also refine error message with func name and line no, thanks for your contribution!
you can always use CUDA_VISIBLE_DEVICES to specify the device you want to use in any cuda program, adding an argument in command line seems unnecessary
if you run the benchmark through superbench cli instead of executing the binary directly, you will get parsed results with correct meanings, see cublaslt-gemm metrics

from superbenchmark.

cublaslt_gemm microbenchmark fails with running with large matrix sizes. about superbenchmark HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent