Giter VIP home page Giter VIP logo

bitsandbytes's Introduction

bitsandbytes's People

Contributors

lsz-fb avatar sirrob1997 avatar timdettmers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bitsandbytes's Issues

import bitsandbytes as bnb 错误

import bitsandbytes as bnb
出现如下
OSError: /home/anaconda3/envs/ner/lib/python3.6/site-packages/bitsandbytes/libbitsandbytes.so: undefined symbol: __fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37

您好,请问这该怎么解决啊

bnb.optim.AdamW

Hey @TimDettmers,

Awesome library! bnb.optim.Adam saved me from having to use model parallelism 😍

Do you think it would be easy to also add a bnb.optim.AdamW version for https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW ?

Happy to give it a try if you think it's easily feasible :-)

Optimizer2State: Unsafe use of eval() in __init__

Optimizer2State class accepts strings in the optional betas parameter during initialization. The string value is passed to eval() without prior validation, potentially leading to execution of arbitrary code.

class Optimizer2State(Optimizer8bit):
def __init__(self, optimizer_name, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
weight_decay=0.0, optim_bits=32, args=None,
min_8bit_size=4096, percentile_clipping=100, block_wise=True, max_unorm=0.0,
skip_zeros=False):
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if isinstance(betas, str):
betas = eval(betas)
print(betas, 'parsed')

bnb.optim.Adam, bnb.optim.Adam8bit and bnb.optim.Adam32bit exhibit the same behaviour.

#!/usr/bin/env python3

hello = "exec(\"import os;os.system('/usr/bin/id');\")"

try:
    from bitsandbytes.optim.optimizer import Optimizer2State
    Optimizer2State('test', 'test', betas=hello)
except:
    pass

try:
    import bitsandbytes as bnb
    bnb.optim.Adam('test', betas=hello)
except:
    pass

try:
    import bitsandbytes as bnb
    bnb.optim.Adam8bit('test', betas=hello)
except:
    pass

try:
    import bitsandbytes as bnb
    bnb.optim.Adam32bit('test', betas=hello)
except:
    pass
$ id
uid=1000(asdf) gid=1000(asdf) groups=1000(asdf)
$ ./test.py 
uid=1000(asdf) gid=1000(asdf) groups=1000(asdf)
None parsed
uid=1000(asdf) gid=1000(asdf) groups=1000(asdf)
None parsed
uid=1000(asdf) gid=1000(asdf) groups=1000(asdf)
None parsed
uid=1000(asdf) gid=1000(asdf) groups=1000(asdf)
None parsed

Quantization functions test fail on Pascal

The following tests fail on Pascal:

tests/test_functional.py::test_estimate_quantiles[float] FAILED
tests/test_functional.py::test_estimate_quantiles[half] FAILED
tests/test_functional.py::test_quantile_quantization FAILED

My guess is this is probably due to atomicAdd for floats working differently.

Is AdamW8bit compatible with OSS in fairscale?

Thank you for the nice project.

When I use AdamW8bit optimizer, i could save the GPU memory.
However, when i combined the optimizer with OSS in fairscale,
the GPU memory is not reduced.

Is not this library compatible with OSS in fairscale. or another issue?

Feature request: Please, add implementation for Novograd algorithm

8-bit optimizer crashes when fine-tuning gpt2-large

Using the bnb.optim.Adam8bit optimizer in place of torch.optim.Adam causes a crash after a handful of batches:

12it [00:22, 1.82s/it]Error an illegal memory access was encountered at line 198 in file /home/alyssa/gpt_math/bitsandbytes/csrc/ops.cu

I am fine-tuning Huggingface's version of the gpt2-large model on an Ampere 3090 GPU with CUDA version 11.6 and nVidia driver version 510.73.05. I have tried compiling bitsandbytes on my machine from source, and the set_optim_to_run_embedding_in_fp32 trick from huggingface/transformers#14819; neither of them affected the behavior. Running with the standard pytorch Adam optimizer works fine. nvidia-smi shows 16 GB of memory used on a GPU with 24 GB, so it shouldn't be running out of RAM or anywhere close to that.

'NoneType' object has no attribute 'cdequantize_blockwise_cpu_fp32'

I am trying to train GPT-J with 8bit weights. It's working well on GPU. But When I try to use it on CPU, it gives this error

'NoneType' object has no attribute 'cdequantize_blockwise_cpu_fp32'

I have used dequantize_blockwise from bitsandbytes.functional. Following is the class in which its used:

class DequantizeAndLinear(torch.autograd.Function):

    def forward(ctx, input: torch.Tensor, weights_quantized: torch.ByteTensor,
                absmax: torch.FloatTensor, code: torch.FloatTensor, bias: torch.FloatTensor):
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        ctx.save_for_backward(input, weights_quantized, absmax, code)
        ctx._has_bias = bias is not None
        return F.linear(input, weights_deq, bias)

    def backward(ctx, grad_output: torch.Tensor):
        assert not ctx.needs_input_grad[1] and not ctx.needs_input_grad[2] and not ctx.needs_input_grad[3]
        input, weights_quantized, absmax, code = ctx.saved_tensors
        # grad_output: [*batch, out_features]
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        grad_input = grad_output @ weights_deq
        grad_bias = grad_output.flatten(0, -2).sum(dim=0) if ctx._has_bias else None
        return grad_input, None, None, None, grad_bias

Is it possible to run it on CPUor should I have to run it only GPU ?

is it work on Win platform?

I am have OSError: [WinError 193] %1 is not a valid Win32 application in lib = ct.cdll.LoadLibrary(os.path.dirname(file) + '/libbitsandbytes.so') in functional.py. WAIDW?

Support for Tesla Architecture

First of all, great work!

Secondly, I can see that you specify that Maxwell Architecture is necessary, and I am wondering if

  1. it's possible to do 8-bit optimization on Tesla Architecture
  2. there are plans to implement it

I ask because Kaggle and Colab notebooks use Tesla Architectures (P100, K80), and I'm sure those communities, myself included, would be interested in using bitsandbytes

The code uses more GPU memory with Multi-scale Vision Transformers

Hi,

Thanks for the great work! I'm currently trying to apply your code to vision transformers, specifically, on this code base:
https://github.com/facebookresearch/SlowFast/tree/main/projects/mvit
When using torch.optim.SGD(momentum=0.9), the code consumes 9221MiB GPU memory during training. After changing it to use bnb.optim.SGD8bit() with the same arguments, it consumes even a bit more GPU memory of 9235MiB. Do you have any idea why this would happen? Thank you! My CUDA version is 10.2 and torch version is 1.9.1.

Best,
Junwei

undefined symbol: __fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37

(torch1.8-py3.8) jiaofangkai@dell-PowerEdge-T640:/home/share/jiaofangkai$ python check_bnb_install.py
Traceback (most recent call last):
  File "check_bnb_install.py", line 1, in <module>
    import bitsandbytes as bnb
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/site-packages/bitsandbytes/__init__.py", line 5, in <module>
    from .optim import adam
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/site-packages/bitsandbytes/optim/__init__.py", line 5, in <module>
    from .adam import Adam, Adam8bit, Adam32bit
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/site-packages/bitsandbytes/optim/adam.py", line 6, in <module>
    from bitsandbytes.optim.optimizer import Optimizer2State
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 6, in <module>
    import bitsandbytes.functional as F
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/site-packages/bitsandbytes/functional.py", line 13, in <module>
    lib = ct.cdll.LoadLibrary(os.path.dirname(__file__) + '/libbitsandbytes.so')
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/ctypes/__init__.py", line 459, in LoadLibrary
    return self._dlltype(name)
  File "/home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/ctypes/__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/share/jiaofangkai/anaconda3/envs/torch1.8-py3.8/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes.so: undefined symbol: __fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37

Hi, I have encountered similar questions to #5 . I have tested with TeslaT4 and RTX 2080Ti but both failed.

The environment are as follows:

# TeslaT4
Ubuntu 18.04.6, Tesla T4, cuda-10.1, driver vesion: 418.197.02, python=3.8, torch=1.8.1+cu101

# RTX 2080Ti
Ubuntu 20.04.3, RTX 2080Ti, cuda-10.1, driver version: 435.21, python=3.8, torch=1.8.1+cu101

python setup.py install error

(bitsandbytes) chenxin@chenxin-Nitro-AN515-52:/disk1/github/bitsandbytes$ python setup.py install
Traceback (most recent call last):
File "setup.py", line 15, in
name = f"bitsandbytes-cuda{os.environ['CUDA_VERSION']}",
File "/home/chenxin/disk1/anaconda3/envs/bitsandbytes/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'CUDA_VERSION'
(bitsandbytes) chenxin@chenxin-Nitro-AN515-52:
/disk1/github/bitsandbytes$ conda list | grep cudatoolkit
cudatoolkit 11.1.1 h6406543_8 conda-forge

errors when training to the third epoch. everytime.

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=1 : invalid argument
Traceback (most recent call last):
  File "train_pointunet.py", line 211, in <module>
    loss_seg = lossfunc_seg(outputs_seg, labels)+lossfunc_dice(outputs_seg,labels)
  File "/home/why/miniconda3/envs/3.6.8/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/why/miniconda3/envs/3.6.8/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (1) : invalid argument at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29

im very confused because in the first several epoches it works fine.

undefined symbol: __fatbinwrap_38

With some CUDA versions and on some architectures this error occurs:

Traceback (most recent call last):
  File "check_bnb_install.py", line 1, in <module>
    import bitsandbytes as bnb
  File "/miniconda/envs/pytorch_env/lib/python3.7/site-packages/bitsandbytes/__init__.py", line 5, in <module>
    from .optim import adam
  File "/miniconda/envs/pytorch_env/lib/python3.7/site-packages/bitsandbytes/optim/__init__.py", line 5, in <module>
    from .adam import Adam, Adam8bit, Adam32bit
  File "/miniconda/envs/pytorch_env/lib/python3.7/site-packages/bitsandbytes/optim/adam.py", line 5, in <module>
    from bitsandbytes.optim.optimizer import Optimizer2State
  File "/miniconda/envs/pytorch_env/lib/python3.7/site-packages/bitsandbytes/optim/optimizer.py", line 6, in <module>
    import bitsandbytes.functional as F
  File "/miniconda/envs/pytorch_env/lib/python3.7/site-packages/bitsandbytes/functional.py", line 13, in <module>
    lib = ct.cdll.LoadLibrary(os.path.dirname(__file__) + '/libbitsandbytes.so')
  File "/miniconda/envs/pytorch_env/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
    return self._dlltype(name)
  File "/miniconda/envs/pytorch_env/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /miniconda/envs/pytorch_env/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes.so: undefined symbol: __fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37

Confirmed for CUDA 10.1 for compute capability 7.5 (V100).

Llama?

New Model out. Any chance it'll be supported by you guys?

bfloat16 grads are not supported

Is there any plans to support models/grads with bfloat16 type? Bfloat gained quite the popularity lately as every ampere GPU supports the type, and eliminates the need for loss scaling compared to float16.
This is what I get when I try to initialize bnb.AdamW with a bfloat16 casted model:
ValueError: Gradient+optimizer bit data type combination not supported: grad torch.bfloat16, optimizer torch.uint8

Did you ever try MNMT systems?

As reported in the paper, for training a bi-directional transformer model on WMT14 or WMT16 the performance of 8-bit Adam stays relatively consistent with the 32-bit counterparts. I was also able to verify this on other data sources for training bi-directional models with my own setup.

However, I've also tried multiple variations of 8-bit optimizers on multilingual neural machine translation (MNMT) models in fairseq and there it seems that even with --no-scale-embedding as well as the StableEmbedding the performance is roughly 3 BLEU behind the counterparts. The --no-scale-embedding flag amounts to roughly 7 BLEU gain, while the xavier init amounts to roughly 0.4 BLEU gain. Didn't look into the effect of the layer norm of the stable embeddings yet.

Did you do any testing on that and have practical tips on getting the performance up?

no difference in memory usage

Hi.
I am training my network with bnb.optim.Adam8bit vs torch.optim.Adam but I don't see any difference in memory consumption.

Running on GTX 2080Ti (single gpu or DDP).
with cudatoolkit 11.1.74
bitsandbytes-cuda111

looking in nvidia-smi I see 9.6GB in both cases
Am I missing something here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.