qwopqwop200 / gptq-for-llama Goto Github PK
View Code? Open in Web Editor NEW4 bits quantization of LLaMA using GPTQ
License: Apache License 2.0
4 bits quantization of LLaMA using GPTQ
License: Apache License 2.0
Model: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b/tree/main
the performance of oasst-sft seems much better than llama, it would be nice to quant it into 4b and run on much lower end GPU.
could you point out if it is possible to quant it with gptq?
Below is a segment of the 7B 4bit weights generated using the line in the same environment with two different video cards. An A4000 (on the left) and an A6000 (on the right).
Notice how every 20-40bytes there is a half byte difference? These differences are always off by one, a B becomes an A and a 5 becomes a 6 etc. This issue seems to persist across all model sizes when producing weights on different cards.
No idea what is causing it.
Without reproducible builds it is hard to say if we're actually producing the same weights.
ptb_text_only uses the validation file instead of the test file. while it is still from the same dataset, and should result in similar results, makes 1 to 1 comparisons difficult.
c4 only has validation, so that is fine.
wikitext-2 uses test
Line 13 in 468c47c
ptb_text_only uses validation
Line 35 in 468c47c
c4 uses validation
Lines 59 to 61 in 468c47c
please correct me if this is intended. :)
How does someone produce a quantized model (e.g. 4bit) themselves?
What are the steps?
Presumably more than 130 GB of RAM? How much would it slow it down if using a swap file? Anything else? It seems like since GPTQ has the best results on larger models this should be looked into. It would be incredible to get almost the whole performance of the 65B model using only 16 GB vRAM.
If I follow the instructions in the readme, I'm getting an error now even though it worked a few days ago.
conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
Output:
raceback (most recent call last):
File "~/text-generation-webui/repositories/GPTQ-for-LLaMa/setup_cuda.py", line 6, in <module>
ext_modules=[cpp_extension.CUDAExtension(
File "~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
library_dirs += library_paths(cuda=True)
File "~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
File "~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
If I try to manually set CUDA_HOME=$CONDA_PREFIX/
(which wasn't necessary previously) it still doesn't work. I get this error:
running install
~/miniconda3/envs/textgen/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
~/miniconda3/envs/textgen/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
error: [Errno 2] No such file or directory: 'CUDA_HOME=~/miniconda3/envs/textgen/bin/nvcc'
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
Traceback (most recent call last):
File "llama.py", line 419, in
llama_eval(model, testloader, DEV)
File "/home/zhangjp/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "llama.py", line 121, in llama_eval
layers = model.model.decoder.layers
File "/home/zhangjp/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LLaMAModel' object has no attribute 'decoder'
I get the following error when trying to run setup.py from gptq install. I have a RTX 3090 and followed instructions from this github gist
FAILED: D:/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\TH -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\cruge\miniconda3\envs\textgen\include -IC:\Users\cruge\miniconda3\envs\textgen\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" -c D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 nvcc fatal : Unsupported gpu architecture 'compute_86' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "C:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "C:\Users\cruge\miniconda3\envs\textgen\lib\subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
On running the setup_cuda.py install, I was initially getting:
RuntimeError: The detected CUDA version (12.0) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions.
Tried being clever and manually compiling which returned the following errors:
`
/usr/local/lib64/python3.11/site-packages/torch/include/ATen/core/qualified_name.h(73): here
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type::type>::cast_op_type pybind11::detail::cast_op(make_caster&)’:
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected template-name before ‘<’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected identifier before ‘<’ token
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:123: error: expected primary-expression before ‘>’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:126: error: expected primary-expression before ‘)’ token
42 | return caster.operator typename make_caster::template cast_op_type();
|
`
The generation takes more time with each message, as if there's an overhead
For example: The second response is 11x faster than the last response. They have the same number of tokens.
The issue persists both on llama-7b and llama-13b
Running llama with: python3.10 server.py --load-in-4bit --model llama-7b-hf --cai-chat --no-stream
specs:
Gpu: RTX 3060 12GB
Cpu: Intel i5 12400f
Ram: 64GB DDR4 3200MHz
OS: Linux
I could the model compression running and the benchmark also works.
But how would I use the model for inference? It there any example? The standard things from transformers are not working with "Only supports a single token currently." That seems related to #6.
Would it be possible to adapt https://github.com/mstnegate/int4matmul_kernels to this repository, or any other kernel that would allow inference on prompts more than one token in length?
Hello, I have downloaded the official ckp consolidated.00.pth/tokenizer.model
on my computer, how should I convert the them to fit your repo, other than download from huggingface again ?
Amazing work! Thank you so much for sharing this.
Despite my attempts, I wasn't able to replicate the quantization functions without CUDA. It would be hugely helpful if users could use AMD or Apple Silicon GPUs too (which already have PyTorch support as 'mps'.
Apple Silicon may seem like an odd option, but shared memory means they are some of the only options for high-memory inference on consumer hardware. For example, it is possible to get up to 64gb of GPU-accessible memory on the Mac Studio.
Any code changes or advice to achieve this would be sincerely appreciated!
I was getting this error when running python setup_cuda.py
quant_cuda_kernel.cu(149):` error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant2MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(87): here
quant_cuda_kernel.cu(261): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant3MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(171): here
quant_cuda_kernel.cu(337): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant4MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(283): here
quant_cuda_kernel.cu(409): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant8MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(359): here
SO I changed all the integers to doubles... and it compiled with a bunch of warnings... Is this defeating the point or is this a valid solution?
Nothing runs atm so I guess not.
This isn't a bug per se, more a cry for help at this point
I'm trying to get 4bit working on oobabooga/text-generation-webui but can't get the cuda extensions in this repository to build.
I am using docker with continuumio/miniconda3 as base image, picked because the instructions for text-generation-webui setup uses conda. I already have it working with that repository and with cuda set up and gpu working there.
The setup for apt and conda is:
RUN apt-get update && apt-get install -y git software-properties-common build-essential gnupg ninja-build && apt-get clean
RUN conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
RUN conda install -c "nvidia/label/cuda-11.7.0" cuda
Last line also tried with cudatoolkit, cuda-libraries and many others. I've verified that nvcc and cuda header files are in place, and am frankly unsure what's missing.
The compile error i get is long, but (I think) the relevant sections are:
#0 3.448 No CUDA runtime is found, using CUDA_HOME='/opt/conda'
#0 3.470 running install
#0 3.470 /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
#0 3.470 warnings.warn(
#0 3.528 /opt/conda/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip
and other standards-based tools.
..........
#0 3.642 File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1780, in _get_cuda_arch_flags
#0 3.642 arch_list[-1] += '+PTX'
#0 3.642 IndexError: list index out of range
I assume the error is connected to "No CUDA runtime is found" - but after a day's search I still haven't figured out what's missing in the installation. The docker file and repo can be found at https://github.com/TheTerrasque/text-generation-webui/tree/feature/docker
The following command:
python repositories/GPTQ-for-LLaMa/llama.py /path/to/my/text-generation-webui/models/llama-7b c4 --wbits 4 --save llama-7b-4bit.pt
Fails with:
__Traceback (most recent call last):
File "/path/to/my/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 410, in
quantizers = llama_sequential(model, dataloader, DEV)
File "/usr/local/lib64/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/path/to/my/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 88, in llama_sequential
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in call_impl
return forward_call(*input, **kwargs)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 318, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in call_impl
return forward_call(*input, **kwargs)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 228, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, offset=offset)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 142, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 136, in rotate_half
return torch.cat((-x2, x1), dim=-1)
RuntimeError: Tensors must have same number of dimensions: got 3 and 4
Any thoughts on what might be wrong? That source model works just fine on its own.
I've tried downloading models converted by others and using them in text-generation-webui. The server starts fine, but whenever I click generate, I again get the same "Tensors must have same number of dimensions" error.
In case it matters, I do have two GPUs. But setting CUDA_VISIBLE_DEVICES=0 doesn't help, neither with llama.py nor text-generation-webui's server.py. Nor does --gpu_memory 5 5 or similar help in text-generation-webui.
I am not sure how to deal with this.
Python 3.10.9 on Arch Linux.
[0] # CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:09<00:00, 3.60it/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.38k/2.38k [00:00<00:00, 3.79MB/s]
Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319M/319M [01:13<00:00, 4.36MB/s]
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:14<00:00, 74.30s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.50s/it]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.5M/40.5M [00:06<00:00, 6.15MB/s]
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.63s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.17it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:00<00:00, 175kB/s]
Traceback (most recent call last):
File "/root/llama/GPTQ-for-LLaMa/llama.py", line 393, in <module>
dataloader, testloader = get_loaders(
File "/root/llama/GPTQ-for-LLaMa/datautils.py", line 111, in get_loaders
return get_c4(nsamples, seed, seqlen, model)
File "/root/llama/GPTQ-for-LLaMa/datautils.py", line 64, in get_c4
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
File "/root/llama/venv/lib/python3.10/site-packages/transformers-4.27.0.dev0-py3.10.egg/transformers/models/auto/tokenization_auto.py", line 676, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.
(venv)
[0] # pip list
Package Version Editable project location
------------------------ ------------------ ---------------------------------------------
aiohttp 3.8.4
aiosignal 1.3.1
astunparse 1.6.3
async-timeout 4.0.2
attrs 22.2.0
certifi 2022.12.7
charset-normalizer 3.1.0
datasets 2.10.1
dill 0.3.6
exceptiongroup 1.1.0
expecttest 0.1.4
filelock 3.9.0
frozenlist 1.3.3
fsspec 2023.3.0
huggingface-hub 0.13.1
hypothesis 6.68.2
idna 3.4
Jinja2 3.1.2
MarkupSafe 2.1.2
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.14
networkx 3.0
numpy 1.24.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
packaging 23.0
pandas 1.5.3
pip 22.3.1
psutil 5.9.4
pyarrow 11.0.0
python-dateutil 2.8.2
pytz 2022.7.1
PyYAML 6.0
quant-cuda 0.0.0
regex 2022.10.31
requests 2.28.2
responses 0.18.0
setuptools 65.5.0
six 1.16.0
sortedcontainers 2.4.0
sympy 1.11.1
tokenizers 0.13.2
torch 2.1.0a0+gitc7bd9b9 /root/llama/venv/lib/python3.10/site-packages
tqdm 4.65.0
transformers 4.27.0.dev0
types-dataclasses 0.6.6
typing_extensions 4.5.0
urllib3 1.26.14
wheel 0.38.4
xxhash 3.2.0
yarl 1.8.2
Getting
RuntimeError: The current installed version of g++ (12.2.1) is greater than the maximum required version by CUDA 11.7 (11.5.0). Please make sure to use an adequate version of g++ (>=6.0.0, <=11.5.0).
when trying to run
python setup_cuda.py install
Using Nobara Linux (Fedora with CUDA patched kernel). What's the best approach to proceed?
See: https://github.com/tloen/alpaca-lora/blob/main/generate.py
Tried modifying the code to look like this, but no luck initially.
from peft import PeftModel
from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig
tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LLaMAForCausalLM.from_pretrained(
"decapoda-research/llama-7b-hf",
load_in_8bit=True,
device_map="auto",
)
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")
Currently using https://huggingface.co/decapoda-research/llama-13b-hf for the .config file and https://huggingface.co/decapoda-research/llama-13b-hf-int4 for the actual model file.
On load, following the readme, I get this error.
Missing key(s) in state_dict: "model.decoder.embed_tokens.weight", "model.decoder.layers.0.self_attn.q_proj.zeros", "model.decoder.layers.0.self_attn.q_proj.scales", etc...
Using transformers Version: 4.27.0.dev0
Using torch Version: 1.13.1
Using datasets Version: 2.10.1
and on CUDA release 11.7
Everything as far as I can tell has been set up properly, with the kernel compiling working.
EDIT:
After downgrading to 1.12.1, I'm being told 3060 doesn't work with the version of pytorch. It's still giving me the error, though.
EDIT 2: Going to downgrade CUDA to 11.3 to see if it does anything.
Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model.
I wrote a small script see #40 that fetches the dataset from datasets and runs the tests.
Because the prompt and parameters were not published (see meta-llama/llama#188) I wrote a prompt myself. It is probably not very good but was the only version that was working at all.
The problem: With the 4bit 7B model I only get about 48% that means the model is not better than random..
So something is off. One or more of:
As I am new to this topic, it can very well be a problem on my end.
So I would like to get help fixing the prompt/script and would also like to see results for other models:
I have tried quantizing galactica-30b with this command:
CUDA_VISIBLE_DEVICES=0 python opt.py /models/galactica-30b --wbits 4 --save galactica-30b-4bit.pt c4
And then using it in the web UI with this one:
python server.py --listen --gptq-bits 4 --model galactica-30b --gptq-model-type opt
The results look very bad. For the prompt
The top 10 equations of all time are:
I get the completion
The top 10 equations of all time are:
- The equation that has been used the most is a simple one, namely y=x; it was also found in our earlier study on elementary functions and integrals[ A Study On Solving Equations With New Methods For Symbolic Integration And Differentiation Of Computer Algebra Systems In General Purpose Calculators By Using Microsoft Excel VBA Programming Language”] as well) but now we see its usage even more frequently than before! It should be noted here though what can not happen to this very useful function since by using MS-Excel’s own builtin “y=m*n+b/(cde...ghijklmnopqrstuvwxyz|{}~–“the user will get exactly zero result for any number he may enter into x variable!! So there must exist some kindred method which gives us results like those obtained from just mentioned formula above with no problems at least regarding division operation involved.. One such possibility might come out if you look carefully enough around your office
Am I doing something wrong?
Hi,
When running python setup_cuda.py install
I get the following error:
running install
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
error: [WinError 2] The system cannot find the file specified
I have no idea why this is happening. Any help would be appreciated.
From the research paper and the tables in the readme it looks like that group-size 64 is very effective in improving the quality of the models. Most noticable in the smaller models or in the 3bit version.
The tables suggest that group size is somehow usable but the README also states that group size can not be used with CUDA? But this whole project needs CUDA? I build an group size 64 model but I can not run the benchmark or inference.
Is group size usable? If so, how?
Packing ...
model.decoder.layers.0.self_attn.q_proj
Traceback (most recent call last):
File "/root/convert/GPTQ-for-LLaMa/llama.py", line 420, in <module>
llama_pack3(model, quantizers)
File "/root/convert/GPTQ-for-LLaMa/llama.py", line 216, in llama_pack3
qlayers[name].pack(layers[name], quantizers[name].scale, quantizers[name].zero)
File "/root/convert/GPTQ-for-LLaMa/quant.py", line 142, in pack
self.bias = linear.bias.clone()
AttributeError: 'NoneType' object has no attribute 'clone'
I am quite new to LLM. I checked the LLaMa github repo and huggingface repo. I found https://huggingface.co/decapoda-research/llama-7b-hf/tree/main has multiple .bin files, but the github repo's download.sh
seems not be the case.
Now I am quite lost as nowhere provide a guide to use it (maybe it is quite common in LLM usage?).
Could you teach me how to deal with the .bin files and save it to decapoda-research/llama-7b-hf
in your example?
FP8 would enable greater dynamic range than Int8, and less information loss during compression. It would require GPUs with 2x more memory than Int4, for those who can afford it.
When building with (textgen_3.10.venv) PS C:\g\GPTQ-for-LLaMa> python setup_cuda.py install
Creating library C:\g\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda.cp310-win_amd64.lib and object C:\g\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda.cp310-win_amd64.exp
quant_cuda.obj : error LNK2001: unresolved external symbol __imp___tls_offset_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
quant_cuda_kernel.obj : error LNK2001: unresolved external symbol __imp___tls_offset_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
quant_cuda.obj : error LNK2001: unresolved external symbol __imp___tls_index_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
quant_cuda_kernel.obj : error LNK2001: unresolved external symbol __imp___tls_index_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
build\lib.win-amd64-cpython-310\quant_cuda.cp310-win_amd64.pyd : fatal error LNK1120: 2 unresolved externals
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.34.31933\\bin\\HostX64\\x64\\link.exe' failed with exit code 1120
(textgen_3.10.venv) PS C:\g\GPTQ-for-LLaMa> python setup_cuda.py install
I am on CUDA 12.0
(textgen_3.10.venv) PS C:\g\GPTQ-for-LLaMa> nvidia-smi.exe
Sat Mar 11 17:49:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.02 Driver Version: 528.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
Running into an error:
running bdist_egg running egg_info writing quant_cuda.egg-info\PKG-INFO writing dependency_links to quant_cuda.egg-info\dependency_links.txt writing top-level names to quant_cuda.egg-info\top_level.txt reading manifest file 'quant_cuda.egg-info\SOURCES.txt' writing manifest file 'quant_cuda.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_ext error: [WinError 2] The system cannot find the file specified
First a big thanks for this amazing effort!
I was just trying to fine-tune this 4-bit model under the transformers framework. The model could be loaded successfully and the training process worked well, however, the loss just became nan
after one single loss.backward().
Here is the my code:
import sys
from pathlib import Path
sys.path.insert(0, str(Path("/efs-storage/text-generation-webui/repositories/GPTQ-for-LLaMa")))
import llama
load_quant = llama.load_quant
model = load_quant(
"/file/llama-7b-hf",
'/file/llama-7b-hf-int4/llama-7b-4bit.pt',
4
)
model.to(device)
....
for batch in tqdm(dataloader):
model.train()
input_ids = batch[0]['input_ids'].squeeze(1).to(device)
attention_mask = batch[0]['attention_mask'].squeeze(1).to(device)
tgt_labels = batch[1].squeeze(1).to(device)
loss = model(
input_ids = input_ids,
attention_mask = attention_mask,
labels=tgt_labels
).loss
model.zero_grad()
loss.backward()
optimizer.step()
torch.cuda.empty_cache()
And here is how my loss looks like after training one batch:
tensor(nan, device='cuda:1', dtype=torch.float16, grad_fn=<NllLossBackward0>)
I wonder is there any way to fine-tune the 4-bit model? thanks!
python llama_inference.py ./llama-7b-hf --wbits 4 --load ./llama-7b-4bit.pt --text "this is llama"
Loading model ...
Done.
Traceback (most recent call last):
File "/root/GPTQ-for-LLaMa/llama_inference.py", line 114, in
tokenizer = AutoTokenizer.from_pretrained(args.model)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 679, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 72, in init
self.sp_model.Load(vocab_file)
File "/opt/conda/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/opt/conda/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"
Loading model ...
Done.
Traceback (most recent call last):
File "llama_inference.py", line 115, in
generated_ids = model.generate(
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers-4.27.0.dev0-py3.8.egg/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/root/miniconda3/lib/python3.8/site-packages/transformers-4.27.0.dev0-py3.8.egg/transformers/generation/utils.py", line 2504, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
I test to install in nvidia docker, the build ninja includes incorrent sm_id like -gencode arch=compute_52,code=sm_52
# Install kernels
python setup_cuda.py install
cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14
This should be ok.
The setup_cuda.py should be changed, the sm_89 shall be 4090, is not in the parameters.
from setuptools import setup, Extension
from torch.utils import cpp_extension
nvcc_args = [
'-gencode', 'arch=compute_80,code=sm_80',
'-gencode', 'arch=compute_86,code=sm_86',
'-gencode', 'arch=compute_90,code=sm_90'
]
setup(
name='quant_cuda',
ext_modules=[cpp_extension.CUDAExtension(
'quant_cuda', ['quant_cuda.cpp', 'quant_cuda_kernel.cu'], extra_compile_args={'nvcc': nvcc_args}
)],
cmdclass={'build_ext': cpp_extension.BuildExtension}
)
Hello, while trying to run python setup_cuda.py install
, I get this error:
(venv) C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa>python setup_cuda.py install
running install
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:387: UserWarning: The detected CUDA version (11.4) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'quant_cuda' extension
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\TH -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpquant_cuda.cpp /Fobuild\temp.win-amd64-cpython-310\Release\quant_cuda.obj /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0
quant_cuda.cpp
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
Then after a long list of errors, I get this at the end:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc" -c quant_cuda_kernel.cu -o build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\TH -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --use-local-env
quant_cuda_kernel.cu
C:/Users/Username/Documents/GitHub/GPTQ-for-LLaMa/venv/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here
2 errors detected in the compilation of "quant_cuda_kernel.cu".
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.4\\bin\\nvcc.exe' failed with exit code 1
Any idea what could be causing this? I've tried installing CUDA Toolkit 11.3 and Torch 1.12.1, but they too give the same error.
There was talk of this in oobabooga/text-generation-webui#177. Creating an issue to track here.
Problem: GPTQ-for-LLaMA appears to use a relatively large amount of VRAM in addition to the model sizes. This negates some of the size reduction benefit of using low-bit quantization. A large portion of this VRAM may be due to the attention mechanism used.
Solution: A more efficient Attention mechanism would further reduce the VRAM requirements of GPTQ-for-LLaMA.
I had a quick glance at the GPTQ paper yesterday, but haven't dug into details yet.
Do you think it is possible to demonstrate a simple routine for performing quantization using this method?
For example, what is the most trivial way (not necessary to be optimal) to implement a function like this:// src - input 32-bit floats // dst - output quantized data // n - number of input floats void quantize_gptq(float * src, void * dst, int n);If I can get a prototype of this and it does not look too complex, I can try to plug it in
ggml
.
The main challenge will be to implement it efficiently with SIMD, but I need to see some initial implementation to work on.
Originally posted by @ggerganov in ggerganov/llama.cpp#9 (comment)
@qwopqwop200 This is for a related project. I thought you might be qualified to answer the question above.
Trying to get LLaMa 30B 4bit quantized to run with 12GB of vram and I'm hitting OOM since the model is a bit more than 16gb
Is it possible to use offloading to load a percentage of the model to cpu using GPTQ?
Hi @qwopqwop200,
I'm trying to convert LLaMA HF models to 4bit, all files being local (input and output)
I get this error about hf token:
python llama.py /media/alex/Daemon/AI/LLaMA-HF/llama-7b c4 --wbits 4 --save /media/alex/Daemon/AI/LLaMA-HF/llama-7b/llama-7b-4bit.pt Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:03<00:00, 8.45it/s] Traceback (most recent call last): File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 401, in <module> dataloader, testloader = get_loaders( File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/datautils.py", line 111, in get_loaders return get_c4(nsamples, seed, seqlen, model) File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/datautils.py", line 56, in get_c4 traindata = load_dataset( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1759, in load_dataset builder_instance = load_dataset_builder( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1496, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1218, in dataset_module_factory raise e1 from None File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1185, in dataset_module_factory raise e File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1158, in dataset_module_factory dataset_info = hf_api_dataset_info( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/utils/_hf_hub_fixes.py", line 152, in dataset_info return hf_api.dataset_info(repo_id, revision=revision, timeout=timeout, use_auth_token=use_auth_token) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1676, in dataset_info headers = self._build_hf_headers(token=token) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4205, in _build_hf_headers return build_hf_headers( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_headers.py", line 117, in build_hf_headers token_to_send = get_token_to_send(token) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_headers.py", line 149, in get_token_to_send raise EnvironmentError( OSError: Token is required (
token=True), but no token found. You need to provide a token or be logged in to Hugging Face with
huggingface-cli loginor
huggingface_hub.login. See https://huggingface.co/settings/tokens.
What am I missing? I'm stuck
Thank you!
Ale
Hello. I tested quantized OPT-2.7B-Erebus model with 4-bit quantization and without it
Tested with the same prompt
Got following results on RTX 4090:
Original model:
Loading OPT-2.7B-Erebus...
Loaded the model in 3.33 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 9.10 seconds (21.97 tokens/s, 200 tokens)
4-bit quantized model:
Loading OPT-2.7B-Erebus...
Loading model ...
Done.
Loaded the model in 1.00 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 57.77 seconds (3.46 tokens/s, 200 tokens)
You can notice huge 5x performance drop.
Do you have any ideas what might be causing it?
I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It seems my CPU is only using a single core and maxing it out to 100%... Is there something it's doing that's heavily serialized? ... Any way to parallelize the workflow?
I loaded successfully the 7b llama model in 4bit but when I try to generate some text this happens:
Starting the web UI...
Loading the extension "gallery"... Ok.
Loading llama-7b...
CUDA extension not installed.
Loading model ...
Done.
Loaded the model in 4.07 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
Traceback (most recent call last):
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\routes.py", line 374, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\blocks.py", line 1017, in process_api
result = await self.call_function(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\blocks.py", line 849, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\utils.py", line 453, in async_iteration
return next(iterator)
File "C:\Users\zblac\Downloads\oobabooga\text-generation-webui\modules\chat.py", line 126, in chatbot_wrapper
for reply in generate_reply(f"{prompt}{' ' if len(reply) > 0 else ''}{reply}", max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, eos_token=eos_token, stopping_string=f"\n{name1}:"):
File "C:\Users\zblac\Downloads\oobabooga\text-generation-webui\modules\text_generation.py", line 170, in generate_reply
output = eval(f"shared.model.generate({', '.join(generate_params)}){cuda}")[0]
File "", line 1, in
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
return self.sample(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2468, in sample
outputs = self(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 772, in forward
outputs = self.model(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 318, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 218, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 198, in forward
quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
NameError: name 'quant_cuda' is not defined
Hi,
I'm trying to quantise 65B on a server with 8x V100. Obviously, that's not going to fit in VRAM on any single GPU 😅
Is it possible to use more than one GPU for quantisation, or load and quantise layer-by-layer?
I've tried on CPU, but I get the error:
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
Thanks! (and great work!)
Trying to test these models in 4bit but having an issue running the benchmark on the compressed file.
The command
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --load llama7b-4bit.pt --benchmark 2048 --check
fails on my machine with the log
Benchmarking ...
Traceback (most recent call last):
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/llama.py", line 410, in <module>
benchmark(model, input_ids, check=args.check)
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/llama.py", line 309, in benchmark
out = model(
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 857, in forward
outputs = self.model.decoder(
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 629, in forward
layer_outputs = decoder_layer(
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
result = forward_call(*input, **kwargs)
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 199, in forward
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
For extra information the kernel test completes successfully
Benchmarking LLaMa-33B FC2 matvec ...
FP16: 0.0007241859436035157
4bit: 0.00010311341285705566
Verifiying kernel correctness ...
Simu: tensor([-0.5063, -0.5161, 0.7185, ..., -0.0747, -0.4014, 0.3023],
device='cuda:0')
Kern: tensor([-0.5063, -0.5161, 0.7185, ..., -0.0747, -0.4014, 0.3023],
device='cuda:0')
The benchmark also completes successfully when just using the normal HF model without 4bit conversion.
$ python server.py --load-in-4bit --model llama-13b-hf
Loading llama-13b-hf...
Loading model ...
Traceback (most recent call last):
File "/mnt/e/Projects/text-generation-webui/server.py", line 191, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/mnt/e/Projects/text-generation-webui/modules/models.py", line 119, in load_model
model = load_quant(path_to_model, Path(f"models/{pt_model}"), 4)
File "/mnt/e/Projects/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 241, in load_quant
if checkpoint.endswith('.safetensors'):
AttributeError: 'PosixPath' object has no attribute 'endswith'
Getting this issue when trying to run on Win11 under WSL2 with text generation WebUI. Behaviour is different prior to 68cfaf9, though still broken.
Output of conda list
:
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
blas 1.0 mkl
brotlipy 0.7.0 py310h7f8727e_1002
bzip2 1.0.8 h7b6447c_0
c-ares 1.18.1 h7f8727e_0
ca-certificates 2023.01.10 h06a4308_0
certifi 2022.12.7 py310h06a4308_0
cffi 1.15.1 py310h5eee18b_3
charset-normalizer 2.0.4 pyhd3eb1b0_0
cryptography 39.0.1 py310h9ce1e76_0
curl 7.88.1 h5eee18b_0
datasets 2.10.1 pypi_0 pypi
dill 0.3.6 pypi_0 pypi
expat 2.4.9 h6a678d5_0
ffmpeg 4.3 hf484d3e_0 pytorch
flit-core 3.6.0 pyhd3eb1b0_0
freetype 2.12.1 h4a9f257_0
gdbm 1.18 hd4cb3f1_4
gettext 0.21.0 hf68c758_0
giflib 5.2.1 h5eee18b_3
git 2.34.1 pl5262hc120c5b_0
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
icu 58.2 he6710b0_3
idna 3.4 py310h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
jpeg 9e h5eee18b_1
krb5 1.19.4 h568e23c_0
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libcurl 7.88.1 h91b91d3_0
libdeflate 1.17 h5eee18b_0
libedit 3.1.20221030 h5eee18b_0
libev 4.33 h7f8727e_1
libffi 3.4.2 h6a678d5_6
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libnghttp2 1.46.0 hce63b2e_0
libpng 1.6.39 h5eee18b_0
libssh2 1.10.0 h8f2d780_0
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.16.0 h27cfd23_0
libtiff 4.5.0 h6a678d5_2
libunistring 0.9.10 h27cfd23_0
libuuid 1.41.5 h5eee18b_0
libwebp 1.2.4 h11a3e52_1
libwebp-base 1.2.4 h5eee18b_1
libxml2 2.9.14 h74e7548_0
lz4-c 1.9.4 h6a678d5_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py310h7f8727e_0
mkl_fft 1.3.1 py310hd6ae3a3_0
mkl_random 1.2.2 py310h00e6091_0
multiprocess 0.70.14 pypi_0 pypi
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
numpy 1.23.5 py310hd5efca6_0
numpy-base 1.23.5 py310h8e6c178_0
openh264 2.1.1 h4ff587b_0
openssl 1.1.1t h7f8727e_0
pcre2 10.37 he7ceb23_1
perl 5.34.0 h5eee18b_2
pillow 9.4.0 py310h6a678d5_0
pip 23.0.1 py310h06a4308_0
pyarrow 11.0.0 pypi_0 pypi
pycparser 2.21 pyhd3eb1b0_0
pydub 0.25.1 pypi_0 pypi
pyopenssl 23.0.0 py310h06a4308_0
pyparsing 3.0.9 pypi_0 pypi
pysocks 1.7.1 py310h06a4308_0
python 3.10.9 h7a1cb2a_2
pytorch 1.13.1 py3.10_cpu_0 pytorch
pytorch-mutex 1.0 cpu pytorch
pytz 2022.7.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
quant-cuda 0.0.0 pypi_0 pypi
readline 8.2 h5eee18b_0
requests 2.28.1 py310h06a4308_0
responses 0.18.0 pypi_0 pypi
rfc3986 2.0.0 pypi_0 pypi
safetensors 0.3.0 pypi_0 pypi
sentencepiece 0.1.97 pypi_0 pypi
setuptools 65.6.3 py310h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.40.1 h5082296_0
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.2 pypi_0 pypi
torchaudio 0.13.1 py310_cpu pytorch
torchvision 0.14.1 py310_cpu pytorch
typing_extensions 4.4.0 py310h06a4308_0
tzdata 2022g h04d1e81_0
urllib3 1.26.14 py310h06a4308_0
wheel 0.38.4 py310h06a4308_0
xxhash 3.2.0 pypi_0 pypi
xz 5.2.10 h5eee18b_1
zlib 1.2.13 h5eee18b_0
zstd 1.5.2 ha4553b6_0
I haven't tried this yet but I wondered if is possible to save the quantized checkpoints. Or does it quantize it every time you run it?
I believe that we can achieve further optimisation beyond even 4bit quantization with selective quantization of specifically chosen layers down to 2bits.
See: https://arxiv.org/abs/2203.08368
By selectively quantizing 50% of the layers down to 2bits, it may even be possible to run 65B Llama on a 24GB VRAM gpu.
I don't know precisely which layers would work best, (it may be an arduous process of trial and error). Perhaps the best thing to do would be to let the user specify which level of quantization they desire for each layer.
4bit is not the end of the road.
How hard is it to implement quantizing other models to 4 bit. I see there is already a python file for bloom but that only comes in really small and really big flavors.
But if we could convert other 13-30b models it would be a big help. Or is the plan to wait on bitsandbytes?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.