Comments (5)
there's a size issue indeed, can you try whether 857a8ba fixes your case?
for the CUBLAS_STATUS_NOT_INITIALIZED
issue, can you run successfully with smaller batch? what's the total memory size of your gpu, it may be insufficient for this large batch gemm
from superbenchmark.
It does work for smaller batch sizes. With batch=64, A, B and C matrices are each 8 GB with total of 24 GB. GPU I am running on has 80GB of memory. So it is well below its capacity. I was able verify using a different tool that batch = 64 does work. But it does not with superbench. It will be good to get if fixed in superbench as I can leverage FP8 support.
from superbenchmark.
I just tried 857a8ba on DGX H100 (driver 525.85.12, CUDA 12.0, Docker image superbench/superbench:v0.7.0-cuda11.8) and confirmed it works fine with batch size 64 and 128 by running /opt/superbench/bin/cublaslt_gemm -b [64/128] -m 8192 -k 8192 -n 8192 -i 1000 -t [bf16/fp16]
.
I was able verify using a different tool that batch = 64 does work
Does this tool leverage cutlass, cublas, or cublaslt?
But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED
I don't think cublaslt_gemm
will call cublasCreate
, do you mean cublasLtCreate
?
Because CUBLAS_STATUS_NOT_INITIALIZED
may indicate errors in CUDA Runtime API called by the cuBLASLt routine or hardware setup and I cannot reproduce the issue, can you set CUBLASLT_LOG_LEVEL=5
when running cublaslt_gemm
to see whether you can get more information from your environment?
It would also be great if you can share more hardware/software information or try with above driver/cuda/image versions.
from superbenchmark.
Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.
I am attaching the patch which has changes that helped me use the utility better:
- When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
- added support to specify device from commandline this way I can script around and control which device is used.
- Added header printf to let user know what each output means, instead of looking at source file.
If you find it useful feel free to check it in. I am unable to push the branch with this changes.
cublaslt_gemm_upgrade.patch
from superbenchmark.
Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.
I am attaching the patch which has changes that helped me use the utility better:
- When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
- added support to specify device from commandline this way I can script around and control which device is used.
- Added header printf to let user know what each output means, instead of looking at source file.
If you find it useful feel free to check it in. I am unable to push the branch with this changes. cublaslt_gemm_upgrade.patch
Hi @nishshah-msft, I have created #503 to merge the size fix.
For your patch:
- the PR will also refine error message with func name and line no, thanks for your contribution!
- you can always use
CUDA_VISIBLE_DEVICES
to specify the device you want to use in any cuda program, adding an argument in command line seems unnecessary - if you run the benchmark through superbench cli instead of executing the binary directly, you will get parsed results with correct meanings, see cublaslt-gemm metrics
from superbenchmark.
Related Issues (20)
- Found no NVIDIA driver on your system. HOT 5
- V0.8.0 Release Plan
- V0.8.0 Test Plan
- [Enhancement] - Add HPL random generator to gemm-flops with ROCm
- pytorch cannot find libopen-orted-mpir.so HOT 2
- Run benchmark failed (superbenchmark-0.8.0) HOT 2
- superbench failed at default most typical run config HOT 8
- why is it probing for nviida when running on MI? HOT 7
- Some test does not support CS 8.9(RTX 4080/4090) HOT 2
- A question about Hived HOT 5
- cublaslt_gemm fp8 does not work with RTX 40 HOT 4
- sb deploy fails due to permission issue, HOT 10
- Superbench result contains null characters. HOT 1
- out-of-date reference link HOT 1
- V0.9.0 Test Plan
- Error: parsing sudo passwords containing special symbols HOT 1
- V0.10.0 Release Plan
- Urgent: while executing the superbench, it's failing (UBUNTU) HOT 2
- default gpu_burn test fails with cp error HOT 2
- Not able to test IB-TRAFFIC for NDR(H-100) Cluster HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from superbenchmark.