Giter VIP home page Giter VIP logo

qwen's Issues

安装flash-attention装不上 报错

我使用Python3.8/3.10安装这个flash-attention都装不上 报错 Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects,有遇到过这个问题的小伙伴吗

Bug of tokenize "<|endoftext|>"

在对"<|endoftext|>"进行tokenize的时候,会将其切分成多个token,而不是151643这一个token。

运行脚本:

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
print('encode <|endoftext|>: {}'.format(tokenizer.encode('<|endoftext|>')))

分词结果为:

encode <|endoftext|>: [27, 91, 8691, 723, 427, 91, 29]

希望qwen的同学修复一下。

基于Qwen-7B实现了QLoRA多轮对话微调,完善API和Web demo功能

首先感谢开源 Qwen-7B 模型,我基于该模型实现了 QLoRA 多轮对话微调,项目地址:https://github.com/hiyouga/LLaMA-Efficient-Tuning

QLoRA 指令微调:

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --do_train \
    --dataset sharegpt_zh \
    --template chatml \
    --finetuning_type lora \
    --lora_target c_attn \
    --output_dir qwen_lora \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 3e-5 \
    --num_train_epochs 1.0 \
    --quantization_bit 4 \
    --fp16

Web Demo:

python src/web_demo.py \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --template chatml

API 部署(基于 OpenAI 格式):

python src/api_demo.py \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --template chatml

另外,希望开发者可以修复一下 tokenizer 的 decode 方法,使其支持 skip_special_tokens 参数,便于后续开发,目前该参数没有实际生效。 (最新版已修复)

源码对应位置:huggingface.co/Qwen/Qwen-7B-Chat/blob/5e7f6a3f41724e7cb8ea3e3be7a1faf2bd5d6a38/tokenization_qwen.py#L228

def _decode(
    self,
    token_ids: Union[int, List[int]],
    skip_special_tokens: bool = False,
    clean_up_tokenization_spaces: bool = None,
    **kwargs,
) -> str:
    if isinstance(token_ids, int):
        token_ids = [token_ids]
    return self.tokenizer.decode(token_ids)

RuntimeError: value cannot be converted to type at::Half without overflow

跑官方README里面的例子和repo中的demo.py都报这个错误。

File "/root/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B-Chat/44e46a0f02169a2c4790fbcccec82cd20f4df717/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow

无法使用deepspeed zero3训练

│ /root/.cache/huggingface/modules/transformers_modules/Qwen-7B/modeling_qwen.py:206 in __init__   │
│                                                                                                  │
│    203 │   │   self.use_logn_attn = config.use_logn_attn                                         │
│    204 │   │                                                                                     │
│    205 │   │   logn_list = [math.log(i, self.seq_length) if i > self.seq_length else 1 for i in  │
│ ❱  206 │   │   self.logn_tensor = torch.Tensor(logn_list)[None, :, None, None]                   │
│    207 │   │   self._ntk_cached = 1.0                                                            │
│    208 │   │                                                                                     │
│    209 │   │   self.attn_dropout = nn.Dropout(config.attn_pdrop)                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py:209 in    │
│ new_tensor                                                                                       │
│                                                                                                  │
│    206 def get_new_tensor_fn_for_dtype(dtype: torch.dtype) -> Callable:                          │
│    207 │   def new_tensor(cls, *args) -> Tensor:                                                 │
│    208 │   │   device = torch.device(get_accelerator().device_name(os.environ["LOCAL_RANK"]))    │
│ ❱  209 │   │   tensor = _orig_torch_empty(0, device=device).new_empty(*args)                     │
│    210 │   │   if tensor.is_floating_point():                                                    │
│    211 │   │   │   tensor = tensor.to(dtype)                                                     │
│    212                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: new_empty(): argument 'size' must be tuple of ints, but found element of type float at pos 2049

麻烦解决一下。

请求int4量化模型

未量化的模型会导致在RAM容量较少的情况下占满RAM导致崩溃(例如Google Colaboratory和Kaggle Notebook环境),所以在此希望官方能像ChatGLM-6B一样在Huggingface上上传一份int4量化后的模型。
image

关于 SFT 训练 label mask 的疑问

  1. system 和 user 对应的 <|im_end|> 是否要添加 label mask(label_id设为 -100)?
  2. assistant 的 <|im_end|> 后面的 \n 是否要添加 label mask?

测试输入:

<|im_start|>system
system test<|im_end|>
<|im_start|>user
round 1 query<|im_end|>
<|im_start|>assistant
round 1 answer<|im_end|>
<|im_start|>user
round 2 query<|im_end|>
<|im_start|>assistant
round 2 answer<|im_end|>

tokenizer 结果

image

logn attention size does not match

modeling_qwen.py, line 373

seq_end = key.size(0)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]

should be

seq_start = key.size(1) - query.size(1)
seq_end = key.size(1)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]

关于text-generation-webui调用,前面的兄弟,有网页版的

在huggingface上留言可能看不到,这里热闹一些:
使用text-generation-webui加载Qwen/Qwen-7B-Chat模型的时候参数如图一所示(这台机器显卡太差,CPU较好),加载之后默认只能使用1个CPU线程(如图二),大量的CPU被闲置,然后推理速度非常非常慢,我查了你们开源的readme,没有看到启动参数调整的信息,请问我可以在哪里调整启动参数,使用更多的CPU用于推理呢,谢谢。
PS:Git从huggingface下载的时候默认会漏一个文件qwen.tiktoken,我不知道是不是我的特例。
微信图片_20230804091731
4861308d0ae0fe62430e99d7cd6503f3

此项目需要解决的问题:1、...

1、按README的方法从头到尾实践后,无法启动。
2、下载flash-attention后,无法成功pip install csrc/layer_norm和pip install csrc/rotary。
2、无法流式问答。
4、无webUI。
5、没有说明如何加载本地模型,本地模型的路径应该填写在哪里?希望给个代码范本。
6、按说明安装环境后,在项目内打开CMD输入python medo.py加载后报错:device_map="auto"
总结:希望有更易读且全面的说明流程(起码按README的方法从头到尾实践后可运行)。
如果不改进可能:不利于推广,即使那么多人说你的好,却没有一个真正运行后的测评,也没有视频真正去讲解,因为没人能按你的readme运行得起来。

Flash attention 加速效果较差,大约只提升5%的推理速度

Hi
我按照 您这边给的flash attention 安装步骤,成功安装了 flash attention,
在运行时 log也显示了:
use flash_attn rotary
use flash_attn rms_norm

我在A100 机器上测试,方式安装flash attention比不安装带来的性能提速,只能带来低于5%的推理提速,(每个token的生成耗时)
所以我想问问,在你们内部实测时,flash attention 带来的性能提升大概是多少呀

24G GPU 炸显存了

用你们的DEMO,结果跑不起来,炸显存了,难道只能用量化的吗?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

qwen7b的硬件需求是什么呀?

请问有公布模型训练和测评的硬件需求吗?需要调研硬件资源需求,但是md文件好像没有明确说明,大家有看到嘛?

输入文本较长时无输出结果

首先感谢开源qwen-7B大模型!
我在使用chat版本时遇到输入文本较长时无输出结果的问题,输入指令的文字长度为4722,该指令经过tokenizer编码后的input_ids长度为3172,我修改了generation_config.json中关于输入长度的配置:

  "max_context_size": 4096

但是模型的 response 是一个空字符串,我通过单步调试确认没有因为token过长等原因提前结束,而是进入了正常的自回归解码过程,输出的前两个token刚好是 stop_words_ids中的两个 token,我看了下readme中应该是能支持8k规模的context:

Support of 8K Context Length. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.

我尝试将指令输入截断至3265个字这时又能正常输出结果,想问下这是什么原因呢?单纯是输入过长导致性能不好还是我的使用方式存在问题?

工具调用的评估方式

非常有价值的工作!
不过工具调用的评估没有太多的材料,希望官方能提供评估的数量级,以及训练时是否针对了特定API进行了训练,对于没见过的API的工具选择的效果如何呢?希望开发者能回复,谢谢!

fail to save tokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
tokenizer.save_pretrained('checkpoint')

fail to save tokenizer

    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
TypeError: save_vocabulary() got an unexpected keyword argument 'filename_prefix'

下一步会提供stream chat接口吗?尝试加了一下,会有乱码 😂

在modeling_qwen.py里加进去了,但有时候好像会有乱码。

image

下面是diff,求指导。

diff --git a/modeling_qwen.py b/modeling_qwen.py
index cc58746..a0361d9 100644
--- a/modeling_qwen.py
+++ b/modeling_qwen.py
@@ -883,6 +883,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         history: Optional[HistoryType],
         system: str = "You are a helpful assistant.",
         append_history: bool = True,
+        stream: Optional[bool] = False,
     ) -> Tuple[str, HistoryType]:
 
         if history is None:
@@ -902,25 +903,39 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         )
         input_ids = torch.tensor([context_tokens]).to(self.device)
 
-        outputs = self.generate(
-            input_ids,
-            stop_words_ids=stop_words_ids,
-            return_dict_in_generate=False,
-        )
+        if stream:
+            from transformers_stream_generator.main import NewGenerationMixin, StreamGenerationConfig
+            self.__class__.generate = NewGenerationMixin.generate
+            self.__class__.sample_stream = NewGenerationMixin.sample_stream
+            stream_config = StreamGenerationConfig(**self.generation_config.to_dict(), do_stream=True)
 
-        response = decode_tokens(
-            outputs[0],
-            tokenizer,
-            raw_text_len=len(raw_text),
-            context_length=len(context_tokens),
-            chat_format=self.generation_config.chat_format,
-            verbose=False,
-        )
+            def stream_generator():
+                outputs = []
+                for token in self.generate(input_ids, stop_words_ids=stop_words_ids, return_dict_in_generate=False, generation_config=stream_config):
+                    outputs.append(token.item())
+                    yield tokenizer.decode(outputs, skip_special_tokens=True)
+
+            return stream_generator()
+        else:
+            outputs = self.generate(
+                input_ids,
+                stop_words_ids=stop_words_ids,
+                return_dict_in_generate=False,
+            )
+
+            response = decode_tokens(
+                outputs[0],
+                tokenizer,
+                raw_text_len=len(raw_text),
+                context_length=len(context_tokens),
+                chat_format=self.generation_config.chat_format,
+                verbose=False,
+            )
 
-        if append_history:
-            history.append((query, response))
+            if append_history:
+                history.append((query, response))
 
-        return response, history
+            return response, history
 
     def generate(
         self,

What is the padding token?

Thanks for your amazing work. By the way, may I ask that what is the padding token in your tokenizer? Without that, I don't think I can perform finetuning on this model.

安装flash-attn错误

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

MPS does not support cumsum op with int64 input

您好,尝试在M1的mac上运行模型,由于内存问题,加了一个offload_folder和torch_dtype,代码如下:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch

tokenizer = AutoTokenizer.from_pretrained("/Users/sniper/model/Qwen-7b-chat", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("/Users/sniper/model/Qwen-7b-chat", device_map="auto",
                                             offload_folder="offload", torch_dtype=torch.float16,
                                             trust_remote_code=True, fp16=True).eval()


model.generation_config = GenerationConfig.from_pretrained("/Users/sniper/model/Qwen-7b-chat",
                                                           trust_remote_code=True)  
# 第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)

但是在chat行(倒数第二)出现错误:

 position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: MPS does not support cumsum op with int64 input

请问这是什么原因呀

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.