broncotc / bitsandbytes-rocm Goto Github PK

License: MIT License

Makefile 1.48% Python 51.34% C++ 11.92% Cuda 29.64% C 4.00% Shell 1.61%

bitsandbytes-rocm's Introduction

bitsandbytes

The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

Resources:

TL;DR

Requirements Linux distribution (Ubuntu, MacOS, etc.) + CUDA >= 10.0. LLM.int8() requires Turing or Ampere GPUs. Installation: pip install bitsandbytes

Using 8-bit optimizer:

Comment out optimizer: #torch.optim.Adam(....)
Add 8-bit optimizer of your choice bnb.optim.Adam8bit(....) (arguments stay the same)
Replace embedding layer if necessary: torch.nn.Embedding(..) -> bnb.nn.Embedding(..)

Using 8-bit Inference:

Comment out torch.nn.Linear: #linear = torch.nn.Linear(...)
Add bnb 8-bit linear light module: linear = bnb.nn.Linear8bitLt(...) (base arguments stay the same)
There are two modes:
- Mixed 8-bit training with 16-bit main weights. Pass the argument has_fp16_weights=True (default)
- Int8 inference. Pass the argument has_fp16_weights=False
To use the full LLM.int8() method, use the threshold=k argument. We recommend k=6.0.

# LLM.int8()
linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0)
# inputs need to be fp16
out = linear(x.to(torch.float16))

Features

8-bit Matrix multiplication with mixed precision decomposition
LLM.int8() inference
8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB (saves 75% memory)
Stable Embedding Layer: Improved stability through better initialization, and normalization
8-bit quantization: Quantile, Linear, and Dynamic quantization
Fast quantile estimation: Up to 100x faster than other algorithms

Requirements & Installation

Requirements: anaconda, cudatoolkit, pytorch

Hardware requirements:

LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).
8-bit optimizers and quantization: NVIDIA Maxwell GPU or newer (>=GTX 9XX).

Supported CUDA versions: 10.2 - 11.7

The bitsandbytes library is currently only supported on Linux distributions. Windows is not supported at the moment.

The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the "Get Started" instructions on the official website.

Using bitsandbytes

Using Int8 Matrix Multiplication

For straight Int8 matrix multiplication with mixed precision decomposition you can use bnb.matmul(...). To enable mixed precision decomposition, use the threshold parameter:

bnb.matmul(..., threshold=6.0)

For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see this blog post.

Using the 8-bit Optimizers

With bitsandbytes 8-bit optimizers can be used by changing a single line of code in your codebase. For NLP models we recommend also to use the StableEmbedding layers (see below) which improves results and helps with stable 8-bit optimization. To get started with 8-bit optimizers, it is sufficient to replace your old optimizer with the 8-bit optimizer in the following way:

import bitsandbytes as bnb

# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent


torch.nn.Embedding(...) ->  bnb.nn.StableEmbedding(...) # recommended for NLP models

Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:

# parameter tensors with less than 16384 values are optimized in 32-bit
# it is recommended to use multiplies of 4096
adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)

Change Bits and other Hyperparameters for Individual Parameters

If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the GlobalOptimManager. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our guide for more details

Fairseq Users

To use the Stable Embedding Layer, override the respective build_embedding(...) function of your model. Make sure to also use the --no-scale-embedding flag to disable scaling of the word embedding layer (nor replaced with layer norm). You can use the optimizers by replacing the optimizer in the respective file (adam.py etc.).

Release and Feature History

For upcoming features and changes and full history see Patch Notes.

Errors

RuntimeError: CUDA error: no kernel image is available for execution on the device. Solution
_fatbinwrap.. Solution

Compile from source

To compile from source, please follow the compile_from_source.md instructions.

License

The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.

We thank Fabio Cannizzo for his work on FastBinarySearch which we use for CPU quantization.

How to cite us

If you found this library and found LLM.int8() useful, please consider citing our work:

@article{dettmers2022llmint8,
  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:2208.07339},
  year={2022}
}

For 8-bit optimizers or quantization routines, please consider citing the following work:

@article{dettmers2022optimizers,
  title={8-bit Optimizers via Block-wise Quantization},
  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
  journal={9th International Conference on Learning Representations, ICLR},
  year={2022}
}

bitsandbytes-rocm's People

Contributors

Stargazers

Watchers

Forkers

0cc4m karlwancl jinsihou19 andrewcharlwood xzuyn rdsutter st1vms

bitsandbytes-rocm's Issues

Is this still viable under ROCm 5.6.0?

No updates in an extremely long time with all the modern ROCm releases I just wanted to know.

CUDA Setup failed despite GPU being available (RX 6900XT)

CUDA SETUP: Setup Failed!
CUDA SETUP: Setup Failed!
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone [email protected]:TimDettmers/bitsandbytes.git
cd bitsandbytes
<make_cmd here, commented out>
python setup.py install
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/dockerx/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "/dockerx/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg/bitsandbytes/autograd/_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "/dockerx/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg/bitsandbytes/functional.py", line 15, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/dockerx/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg/bitsandbytes/cextension.py", line 67, in <module>
    raise RuntimeError('''
RuntimeError: 
        CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs aboveto fix your environment!
        If you cannot find any issues and suspect a bug, please open an issue with detals about your environment:
        https://github.com/TimDettmers/bitsandbytes/issues

no matching member function for call to 'BlockedToStriped'

Hi, I'm using Hygon DCU Z100 (It's a AMD-like GPU) in CentOS 7.6 with ROCm 4.0.1.
I tried to get this work for stanford alpaca extension for 8bit adam, but I came across problems:

make CUDA_VERSION=gfx1030 hip results in the following:

which: no nvcc in (/public/software/compiler/dtk-22.10.1/bin:/public/software/compiler/dtk-22.10.1/llvm/bin:/public/software/compiler/dtk-22.10.1/hip/bin:/public/software/compiler/dtk-22.10.1/hip/bin/hipify:/public/software/compiler/dtk-22.10.1/miopen/bin:/opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/bin:/opt/hpc/software/mpi/hpcx/v2.11.0/hcoll/bin:/opt/hpc/software/mpi/hpcx/v2.11.0/ucx_without_rocm/bin:/opt/rh/devtoolset-7/root/usr/bin:/opt/hpc/setfreq:/opt/gridview/slurm/bin:/opt/gridview/slurm/sbin:/opt/gridview/munge/bin:/opt/gridview/munge/sbin:/opt/clusconf/sbin:/opt/clusconf/bin:/work/home/ac842t8oeo/miniconda3/envs/alpaca/bin:/work/home/ac842t8oeo/miniconda3/condabin:/opt/hpc/setfreq:/usr/lib64/qt-3.3/bin:/work/home/ac842t8oeo/perl5/bin:/opt/gridview/slurm/bin:/opt/gridview/slurm/sbin:/opt/gridview/munge/bin:/opt/gridview/munge/sbin:/opt/clusconf/sbin:/opt/clusconf/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/work/home/ac842t8oeo/.local/bin:/work/home/ac842t8oeo/bin)
# /usr/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx1030 -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/include -I /public/software/compiler/dtk-22.10.1/include -I/public/software/compiler/dtk-22.10.1/llvm/include -I/public/software/compiler/dtk-22.10.1/hip/include -I/public/software/compiler/dtk-22.10.1/miopen/include -I /opt/rocm/hipcub/include -o /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/build/ops.o -D NO_CUBLASLT /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/ops.cu
# /usr/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx1030 -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/include -I /public/software/compiler/dtk-22.10.1/include -I/public/software/compiler/dtk-22.10.1/llvm/include -I/public/software/compiler/dtk-22.10.1/hip/include -I/public/software/compiler/dtk-22.10.1/miopen/include -I /opt/rocm/hipcub/include -o /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/build/kernels.o -D NO_CUBLASLT /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/kernels.cu
/public/software/compiler/dtk-22.10.1/bin/hipcc -std=c++14 -c -fPIC --offload-arch=gfx1030 -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/include -I /public/software/compiler/dtk-22.10.1/include -I/public/software/compiler/dtk-22.10.1/llvm/include -I/public/software/compiler/dtk-22.10.1/hip/include -I/public/software/compiler/dtk-22.10.1/miopen/include -I /opt/rocm/hipcub/include -o /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/build/ops.o -D NO_CUBLASLT /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/ops.cu
/public/software/compiler/dtk-22.10.1/bin/hipcc -std=c++14 -c -fPIC --offload-arch=gfx1030 -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc -I /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/include -I /public/software/compiler/dtk-22.10.1/include -I/public/software/compiler/dtk-22.10.1/llvm/include -I/public/software/compiler/dtk-22.10.1/hip/include -I/public/software/compiler/dtk-22.10.1/miopen/include -I /opt/rocm/hipcub/include -o /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/build/kernels.o -D NO_CUBLASLT /work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/kernels.cu

/work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/kernels.cu:1874:40: error: no matching member function for call to 'BlockedToStriped'  
    BlockExchange(temp_storage.exchange).BlockedToStriped(local_col_absmax_values);  
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
/work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/kernels.cu:1899:26: note: in instantiation of function template specialization 'kgetColRowStats<__half, 64, 4, 16, 256, 0>' requested here
template __global__ void kgetColRowStats<half, 64, 4, 16, 64*4, 0>(half * __restrict__ A, float *rowStats, float *colStats, int * nnz_count_row, float nnz_threshold, int rows, int cols, int tiledRows, int tiledCols);                         
/opt/rocm/hipcub/include/hipcub/rocprim/block/block_exchange.hpp:91:10: note: candidate function template not viable: requires 2 arguments, but 1 was provided    
    void BlockedToStriped(InputT (&input_items)[ITEMS_PER_THREAD],         
/work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/kernels.cu:1874:40: error: no matching member function for call to 'BlockedToStriped'  
    BlockExchange(temp_storage.exchange).BlockedToStriped(local_col_absmax_values);  
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
/work/home/ac842t8oeo/bnb-zhr/bitsandbytes-rocm/csrc/kernels.cu:1900:26: note: in instantiation of function template specialization 'kgetColRowStats<__half, 64, 4, 16, 256, 1>' requested here
template __global__ void kgetColRowStats<half, 64, 4, 16, 64*4, 1%3E(half * __restrict__ A, float *rowStats, float *colStats, int * nnz_count_row, float nnz_threshold, int rows, int cols, int tiledRows, int tiledCols);                         
/opt/rocm/hipcub/include/hipcub/rocprim/block/block_exchange.hpp:91:10: note: candidate function template not viable: requires 2 arguments, but 1 was provided    
    void BlockedToStriped(InputT (&input_items)[ITEMS_PER_THREAD],         
2 errors generated when compiling for gfx1030.
make: *** [Makefile:113: hip] Error 1

Any assistance in this matter would be much appreciated, and thanks for your time!

Makefile:10: WARNING: CUDA_VERSION not set

Apologies for bugging you with what is likely a silly error on my part, but hopefully something here will be useful to people doing similar things going forward:

Issue: after using CD to get to the bitsandbytes-rocm main directory and using the command "make hip", I got the error (after redacting my personal $path information for brevity and readability):

Makefile:10: WARNING: CUDA_VERSION not set. Call make with CUDA string, for example: make cuda11x CUDA_VERSION=115 or make cpuonly CUDA_VERSION=CPU
/usr/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx1030 -I bitsandbytes-rocm/csrc
-I bitsandbytes-rocm/include
-o bitsandbytes-rocm/build/ops.o
-D NO_CUBLASLT bitsandbytes-rocm/csrc/ops.cu
make: /usr/bin/hipcc: No such file or directory
make: *** [Makefile:107: hip] Error 127

To clarify, that was a raw "make" command, with no arguments passed. I did not adjust the Makefile, as I wasn't sure if I should specify a CUDA version to emulate, although I did attempt to use GFX 1030 as the "CUDA version" in the Makefile, though after it was unsuccessful I reverted the change. I'm not terribly familiar with makefiles so I may have had a syntax error.

System Hardware:
RX 6700XT (GFX 1030)
Ryzen 9 5900X
32 GB RAM
Relevant files loaded onto a PCIe gen 3.0 SSD

System information:
Garuda Linux (Arch Linux derivative), kernel: 6.2.2-zen1-1-zen (64-bit)
Python: 3.10.9
Error occurred in a venv with ROCm dependencies manually installed via pip (notably torch, and torchvision rocm variants, version 5.2), and I attempted to install bitsandbytes-rocm, after I used command accelerate to run a machine learning related workload. In addition, to deal with an issue related to GFX_Version, I ran the below command to manually set ROCm software into a functional state in this venv

export HSA_OVERRIDE_GFX_VERSION=10.3.0

Other notable information:
After installing the AUR provided packages related to ROCm outside of this venv, my GPU is listed as gfx1031in a fresh terminal. I attempted to build this just from the venv, and installed the official AUR packages after that failed, and ran into the same issue.
I wrote some simple pytorch code and confirmed that ROCm is functioning as intended.

Any assistance in this matter would be much appreciated, and I hope to create an ample data trail to help future users avoid the same issues. Thanks for your time.

hipErrorNoBinaryForGpu

I have an RX 6700 XT and I am on Manjaro OS
I am attempting to get this fork working for Stable Diffusion Dreambooth extension for 8bit adam
Some users said they used this fork to get it working
But I do not think it works for me atm.
I am looking for help.

make CUDA_VERSION=gfx1030 hip results in the following

which: no nvcc in (/home/theiceleopard/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/var/lib/flatpak/exports/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl) /opt/rocm/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx1030 -I /home/theiceleopard/temp/bitsandbytes-rocm/csrc -I /home/theiceleopard/temp/bitsandbytes-rocm/include -o /home/theiceleopard/temp/bitsandbytes-rocm/build/ops.o -D NO_CUBLASLT /home/theiceleopard/temp/bitsandbytes-rocm/csrc/ops.cu Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. /opt/rocm/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx1030 -I /home/theiceleopard/temp/bitsandbytes-rocm/csrc -I /home/theiceleopard/temp/bitsandbytes-rocm/include -o /home/theiceleopard/temp/bitsandbytes-rocm/build/kernels.o -D NO_CUBLASLT /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu:2461:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] __global__ void kspmm_coo_very_sparse_naive(int *max_count, int *max_idx, int *offset_rowidx, int *rowidx, int *colidx, half *values, T *B, half *out, float * __restrict__ const dequant_stats, int nnz, int rowsA, int rowsB, int colsB) ^ /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu:2461:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu:2461:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu:2461:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu:2461:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] /home/theiceleopard/temp/bitsandbytes-rocm/csrc/kernels.cu:2461:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 6 warnings generated when compiling for gfx1030. /usr/bin/hipcc -fPIC -static /home/theiceleopard/temp/bitsandbytes-rocm/build/ops.o /home/theiceleopard/temp/bitsandbytes-rocm/build/kernels.o -o /home/theiceleopard/temp/bitsandbytes-rocm/build/link.so /usr/bin/g++ -std=c++14 -D__HIP_PLATFORM_AMD__ -DBUILD_CUDA -shared -fPIC -I /opt/rocm/include -I /home/theiceleopard/temp/bitsandbytes-rocm/csrc -I /home/theiceleopard/temp/bitsandbytes-rocm/include /home/theiceleopard/temp/bitsandbytes-rocm/build/ops.o /home/theiceleopard/temp/bitsandbytes-rocm/build/kernels.o /home/theiceleopard/temp/bitsandbytes-rocm/csrc/common.cpp /home/theiceleopard/temp/bitsandbytes-rocm/csrc/cpu_ops.cpp /home/theiceleopard/temp/bitsandbytes-rocm/csrc/pythonInterface.c -L/opt/rocm-5.3.0/lib -L/opt/rocm-5.3.0/llvm/bin/../lib/clang/15.0.0/lib/linux -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -lgcc_s -lgcc -lpthread -lm -lrt -lamdhip64 -lhipblas -lhipsparse -lclang_rt.builtins-x86_64 -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc -o ./bitsandbytes/libbitsandbytes_hip_nocublaslt.so

sudo CUDA_VERSION=gfx1030 python setup.py install results in the following

`libs: ['libbitsandbytes_hip_nocublaslt.so']
running install
/usr/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
/usr/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
writing bitsandbytes.egg-info/PKG-INFO
writing dependency_links to bitsandbytes.egg-info/dependency_links.txt
writing entry points to bitsandbytes.egg-info/entry_points.txt
writing top-level names to bitsandbytes.egg-info/top_level.txt
reading manifest file 'bitsandbytes.egg-info/SOURCES.txt'
adding license file 'LICENSE'
adding license file 'NOTICE.md'
writing manifest file 'bitsandbytes.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying bitsandbytes/libbitsandbytes_hip_nocublaslt.so -> build/lib/bitsandbytes
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/bitsandbytes
copying build/lib/bitsandbytes/init.py -> build/bdist.linux-x86_64/egg/bitsandbytes
copying build/lib/bitsandbytes/libbitsandbytes_hip_nocublaslt.so -> build/bdist.linux-x86_64/egg/bitsandbytes
copying build/lib/bitsandbytes/utils.py -> build/bdist.linux-x86_64/egg/bitsandbytes
copying build/lib/bitsandbytes/functional.py -> build/bdist.linux-x86_64/egg/bitsandbytes
creating build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/sgd.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/init.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/optimizer.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/adagrad.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/adamw.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/rmsprop.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/lars.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/lamb.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/optim/adam.py -> build/bdist.linux-x86_64/egg/bitsandbytes/optim
copying build/lib/bitsandbytes/cextension.py -> build/bdist.linux-x86_64/egg/bitsandbytes
creating build/bdist.linux-x86_64/egg/bitsandbytes/autograd
copying build/lib/bitsandbytes/autograd/init.py -> build/bdist.linux-x86_64/egg/bitsandbytes/autograd
copying build/lib/bitsandbytes/autograd/_functions.py -> build/bdist.linux-x86_64/egg/bitsandbytes/autograd
copying build/lib/bitsandbytes/debug_cli.py -> build/bdist.linux-x86_64/egg/bitsandbytes
creating build/bdist.linux-x86_64/egg/bitsandbytes/nn
copying build/lib/bitsandbytes/nn/init.py -> build/bdist.linux-x86_64/egg/bitsandbytes/nn
copying build/lib/bitsandbytes/nn/modules.py -> build/bdist.linux-x86_64/egg/bitsandbytes/nn
copying build/lib/bitsandbytes/main.py -> build/bdist.linux-x86_64/egg/bitsandbytes
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/init.py to init.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/functional.py to functional.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/sgd.py to sgd.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/init.py to init.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/optimizer.py to optimizer.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/adagrad.py to adagrad.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/adamw.py to adamw.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/rmsprop.py to rmsprop.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/lars.py to lars.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/lamb.py to lamb.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/optim/adam.py to adam.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/cextension.py to cextension.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/autograd/init.py to init.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/autograd/_functions.py to _functions.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/debug_cli.py to debug_cli.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/nn/init.py to init.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/nn/modules.py to modules.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/bitsandbytes/main.py to main.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying bitsandbytes.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitsandbytes.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitsandbytes.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitsandbytes.egg-info/entry_points.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitsandbytes.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
bitsandbytes.pycache.cextension.cpython-310: module references file
creating 'dist/bitsandbytes-0.35.4-py3.10.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing bitsandbytes-0.35.4-py3.10.egg
removing '/usr/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg' (and everything under it)
creating /usr/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg
Extracting bitsandbytes-0.35.4-py3.10.egg to /usr/lib/python3.10/site-packages
bitsandbytes 0.35.4 is already the active version in easy-install.pth
Installing debug_cuda script to /usr/bin

Installed /usr/lib/python3.10/site-packages/bitsandbytes-0.35.4-py3.10.egg
Processing dependencies for bitsandbytes==0.35.4
Finished processing dependencies for bitsandbytes==0.35.4`

python -m bitsandbytes results in the following in a different terminal right after "python install"

/home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//debuginfod.archlinux.org'), PosixPath('https')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/etc/gtk-2.0/gtkrc')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/home/theiceleopard/.gtkrc'), PosixPath('/etc/gtk/gtkrc')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/Sessions/1')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/Windows/1')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('@/tmp/.ICE-unix/1114,unix/StormyGamingRig'), PosixPath('local/StormyGamingRig')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Seat0')} warn( /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Session1')} warn( CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')} warn( WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)! CUDA SETUP: Loading binary /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so... /home/theiceleopard/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable. warn( "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" zsh: IOT instruction (core dumped) python -m bitsandbytes

But in the same terminal after python install

"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" zsh: IOT instruction (core dumped) python -m bitsandbytes

Int8 Matmul not supported on gfx1030?

Attempting to use this library on a gfx1030 (6800XT) with the huggingface transformers results in:

python -m bitsandbytes
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ DEBUG INFORMATION +++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable

SUCCESS!
Installation was successful!

Trying to load a simple huggingface transformer results in:

=============================================
ERROR: Your GPU does not support Int8 Matmul!
=============================================

python3: /dockerx/temp/bitsandbytes-rocm/csrc/ops.cu:347: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t *, const int8_t *, void *, float *, int, int, int) [FORMATB = 3, DTYPE_OUT = 32, SCALE_ROWS = 0]: Assertion `false' failed.
Aborted (core dumped)

I am using Rocm 5.4.0 (I updated the library paths in the makefile to point to 5.4)