qwenlm / qwen Goto Github PK

View Code? Open in Web Editor NEW

11.4K 93.0 926.0 35.64 MB

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

License: Apache License 2.0

Python 62.82% Shell 5.37% Dockerfile 0.91% Jupyter Notebook 30.79% Jinja 0.10%

chinese large-language-models natural-language-processing flash-attention llm pretrained-models

qwen's Issues

直接用提供的transformers例子跑不起来

用这个例子跑不起来transformers

gpu感觉不用上，最后ram占满崩溃。。。

QWenLMHeadModel.init() got an unexpected keyword argument 'use_bf16'

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, use_bf16=True).eval()

QWenLMHeadModel.init() got an unexpected keyword argument 'use_bf16'

我使用Python3.8/3.10安装这个flash-attention都装不上报错 Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects，有遇到过这个问题的小伙伴吗

关于 Qwen-7B tokenizer avoid injection attacks 的疑问

感谢开源预训练模型，试了下 chat 模型的效果感觉很强，正在尝试调优，在集成过程中发现 Qwen/Qwen-7B 的 tokenizer 中也实现了 avoid injection attacks 的功能，这个对于非 chat 模型来说应该是不需要的吧？

BTW: warning 中有一处 "OpenAI" 字样未修改：https://huggingface.co/Qwen/Qwen-7B/blob/65b57b1a586a38c959e91bb9dd5fc37cdb5c86fa/tokenization_qwen.py#L156

Qwen-7B model can not be found at HuggingFace

https://huggingface.co/Qwen/Qwen-7B

Bug of tokenize "<|endoftext|>"

在对"<|endoftext|>"进行tokenize的时候，会将其切分成多个token，而不是151643这一个token。

运行脚本：

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
print('encode <|endoftext|>: {}'.format(tokenizer.encode('<|endoftext|>')))

分词结果为：

encode <|endoftext|>: [27, 91, 8691, 723, 427, 91, 29]

希望qwen的同学修复一下。

能否提供web_demo示例

以及相关的deployment步骤说明

怎么流式输出？有demo例子吗

基于Qwen-7B实现了QLoRA多轮对话微调，完善API和Web demo功能

首先感谢开源 Qwen-7B 模型，我基于该模型实现了 QLoRA 多轮对话微调，项目地址：https://github.com/hiyouga/LLaMA-Efficient-Tuning

QLoRA 指令微调：

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --do_train \
    --dataset sharegpt_zh \
    --template chatml \
    --finetuning_type lora \
    --lora_target c_attn \
    --output_dir qwen_lora \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 3e-5 \
    --num_train_epochs 1.0 \
    --quantization_bit 4 \
    --fp16

Web Demo：

python src/web_demo.py \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --template chatml

API 部署（基于 OpenAI 格式）：

python src/api_demo.py \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --template chatml

~~另外，希望开发者可以修复一下 tokenizer 的 decode 方法，使其支持 skip_special_tokens 参数，便于后续开发，目前该参数没有实际生效。~~ （最新版已修复）

~~源码对应位置：huggingface.co/Qwen/Qwen-7B-Chat/blob/5e7f6a3f41724e7cb8ea3e3be7a1faf2bd5d6a38/tokenization_qwen.py#L228~~

def _decode(
    self,
    token_ids: Union[int, List[int]],
    skip_special_tokens: bool = False,
    clean_up_tokenization_spaces: bool = None,
    **kwargs,
) -> str:
    if isinstance(token_ids, int):
        token_ids = [token_ids]
    return self.tokenizer.decode(token_ids)

RuntimeError: value cannot be converted to type at::Half without overflow

跑官方README里面的例子和repo中的demo.py都报这个错误。

File "/root/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B-Chat/44e46a0f02169a2c4790fbcccec82cd20f4df717/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow

请问Qwen-7B插件调用那个Agent要怎么用

类似openAI那样的函数调用功能

无法使用deepspeed zero3训练

│ /root/.cache/huggingface/modules/transformers_modules/Qwen-7B/modeling_qwen.py:206 in __init__   │
│                                                                                                  │
│    203 │   │   self.use_logn_attn = config.use_logn_attn                                         │
│    204 │   │                                                                                     │
│    205 │   │   logn_list = [math.log(i, self.seq_length) if i > self.seq_length else 1 for i in  │
│ ❱  206 │   │   self.logn_tensor = torch.Tensor(logn_list)[None, :, None, None]                   │
│    207 │   │   self._ntk_cached = 1.0                                                            │
│    208 │   │                                                                                     │
│    209 │   │   self.attn_dropout = nn.Dropout(config.attn_pdrop)                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py:209 in    │
│ new_tensor                                                                                       │
│                                                                                                  │
│    206 def get_new_tensor_fn_for_dtype(dtype: torch.dtype) -> Callable:                          │
│    207 │   def new_tensor(cls, *args) -> Tensor:                                                 │
│    208 │   │   device = torch.device(get_accelerator().device_name(os.environ["LOCAL_RANK"]))    │
│ ❱  209 │   │   tensor = _orig_torch_empty(0, device=device).new_empty(*args)                     │
│    210 │   │   if tensor.is_floating_point():                                                    │
│    211 │   │   │   tensor = tensor.to(dtype)                                                     │
│    212                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: new_empty(): argument 'size' must be tuple of ints, but found element of type float at pos 2049

麻烦解决一下。

跟bloom 7B有什么关系

如题

请问会公开训练分析吗？例如loss下降曲线，如果有scale曲线的话，会公开吗？

能否实现一个兼容openai的api server？

只有主要的/v1/chat/completions接口就够用了

请求int4量化模型

未量化的模型会导致在RAM容量较少的情况下占满RAM导致崩溃（例如Google Colaboratory和Kaggle Notebook环境），所以在此希望官方能像ChatGLM-6B一样在Huggingface上上传一份int4量化后的模型。

关于 SFT 训练 label mask 的疑问

system 和 user 对应的 <|im_end|> 是否要添加 label mask(label_id设为 -100)?
assistant 的 <|im_end|> 后面的 \n 是否要添加 label mask？

测试输入：

<|im_start|>system
system test<|im_end|>
<|im_start|>user
round 1 query<|im_end|>
<|im_start|>assistant
round 1 answer<|im_end|>
<|im_start|>user
round 2 query<|im_end|>
<|im_start|>assistant
round 2 answer<|im_end|>

tokenizer 结果

是否有开源百亿参数模型的计划？

如题

logn attention size does not match

modeling_qwen.py, line 373

seq_end = key.size(0)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]

should be

seq_start = key.size(1) - query.size(1)
seq_end = key.size(1)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]

希望能像MOSS和GLM一样，提供一下多轮对话微调的说明

比如数据格式对齐方式等

关于text-generation-webui调用，前面的兄弟，有网页版的

在huggingface上留言可能看不到，这里热闹一些：
使用text-generation-webui加载Qwen/Qwen-7B-Chat模型的时候参数如图一所示（这台机器显卡太差，CPU较好），加载之后默认只能使用1个CPU线程（如图二），大量的CPU被闲置，然后推理速度非常非常慢，我查了你们开源的readme，没有看到启动参数调整的信息，请问我可以在哪里调整启动参数，使用更多的CPU用于推理呢，谢谢。
PS：Git从huggingface下载的时候默认会漏一个文件qwen.tiktoken，我不知道是不是我的特例。

Are you planning on making bigger models?

Are there any intensions on making 13B , 30B or 60B kind of models , or any kind of bigger open-source foundation models??

tiktoken不支持多线程tokenize?

报错：

TypeError: cannot pickle 'builtins.CoreBPE' obiect

此项目需要解决的问题：1、...

1、按README的方法从头到尾实践后，无法启动。
2、下载flash-attention后，无法成功pip install csrc/layer_norm和pip install csrc/rotary。
2、无法流式问答。
4、无webUI。
5、没有说明如何加载本地模型，本地模型的路径应该填写在哪里？希望给个代码范本。
6、按说明安装环境后，在项目内打开CMD输入python medo.py加载后报错：device_map="auto"
总结：希望有更易读且全面的说明流程（起码按README的方法从头到尾实践后可运行）。
如果不改进可能：不利于推广，即使那么多人说你的好，却没有一个真正运行后的测评，也没有视频真正去讲解，因为没人能按你的readme运行得起来。

基于QLoRA微调通义千问Qwen-7B

24G显存微调通义千问Qwen-7B

项目链接：https://github.com/yangjianxin1/Firefly

训练脚本：

torchrun --nproc_per_node={num_gpus} train_qlora.py --train_args_file train_args/qlora/qwen-7b-qlora.json

Flash attention 加速效果较差，大约只提升5%的推理速度

Hi
我按照您这边给的flash attention 安装步骤，成功安装了 flash attention，
在运行时 log也显示了：
use flash_attn rotary
use flash_attn rms_norm

我在A100 机器上测试，方式安装flash attention比不安装带来的性能提速，只能带来低于5%的推理提速，（每个token的生成耗时）
所以我想问问，在你们内部实测时，flash attention 带来的性能提升大概是多少呀

24G GPU 炸显存了

用你们的DEMO，结果跑不起来，炸显存了，难道只能用量化的吗？
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

请问有没有提供在M1上部署的教程？

qwen7b的硬件需求是什么呀？

请问有公布模型训练和测评的硬件需求吗？需要调研硬件资源需求，但是md文件好像没有明确说明，大家有看到嘛？

请问有通义千问提示词工程的相关社区或教程吗？

看到example中激活调用工具的提示词，眼前一亮，请问有通义千问提示词工程的相关社区或教程吗，例如如何让模型按照JSON 、Markdown等格式返回，目前需要很多次调试提示词才能实现

输入文本较长时无输出结果

首先感谢开源qwen-7B大模型！
我在使用chat版本时遇到输入文本较长时无输出结果的问题，输入指令的文字长度为4722，该指令经过tokenizer编码后的input_ids长度为3172，我修改了generation_config.json中关于输入长度的配置：

  "max_context_size": 4096

但是模型的 response 是一个空字符串，我通过单步调试确认没有因为token过长等原因提前结束，而是进入了正常的自回归解码过程，输出的前两个token刚好是 stop_words_ids中的两个 token，我看了下readme中应该是能支持8k规模的context:

Support of 8K Context Length. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.

我尝试将指令输入截断至3265个字这时又能正常输出结果，想问下这是什么原因呢？单纯是输入过长导致性能不好还是我的使用方式存在问题？

工具调用的评估方式

非常有价值的工作！
不过工具调用的评估没有太多的材料，希望官方能提供评估的数量级，以及训练时是否针对了特定API进行了训练，对于没见过的API的工具选择的效果如何呢？希望开发者能回复，谢谢！

fail to save tokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
tokenizer.save_pretrained('checkpoint')

fail to save tokenizer

    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
TypeError: save_vocabulary() got an unexpected keyword argument 'filename_prefix'

下一步会提供stream chat接口吗？尝试加了一下，会有乱码 😂

在modeling_qwen.py里加进去了，但有时候好像会有乱码。

下面是diff，求指导。

diff --git a/modeling_qwen.py b/modeling_qwen.py
index cc58746..a0361d9 100644
--- a/modeling_qwen.py
+++ b/modeling_qwen.py
@@ -883,6 +883,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         history: Optional[HistoryType],
         system: str = "You are a helpful assistant.",
         append_history: bool = True,
+        stream: Optional[bool] = False,
     ) -> Tuple[str, HistoryType]:
 
         if history is None:
@@ -902,25 +903,39 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         )
         input_ids = torch.tensor([context_tokens]).to(self.device)
 
-        outputs = self.generate(
-            input_ids,
-            stop_words_ids=stop_words_ids,
-            return_dict_in_generate=False,
-        )
+        if stream:
+            from transformers_stream_generator.main import NewGenerationMixin, StreamGenerationConfig
+            self.__class__.generate = NewGenerationMixin.generate
+            self.__class__.sample_stream = NewGenerationMixin.sample_stream
+            stream_config = StreamGenerationConfig(**self.generation_config.to_dict(), do_stream=True)
 
-        response = decode_tokens(
-            outputs[0],
-            tokenizer,
-            raw_text_len=len(raw_text),
-            context_length=len(context_tokens),
-            chat_format=self.generation_config.chat_format,
-            verbose=False,
-        )
+            def stream_generator():
+                outputs = []
+                for token in self.generate(input_ids, stop_words_ids=stop_words_ids, return_dict_in_generate=False, generation_config=stream_config):
+                    outputs.append(token.item())
+                    yield tokenizer.decode(outputs, skip_special_tokens=True)
+
+            return stream_generator()
+        else:
+            outputs = self.generate(
+                input_ids,
+                stop_words_ids=stop_words_ids,
+                return_dict_in_generate=False,
+            )
+
+            response = decode_tokens(
+                outputs[0],
+                tokenizer,
+                raw_text_len=len(raw_text),
+                context_length=len(context_tokens),
+                chat_format=self.generation_config.chat_format,
+                verbose=False,
+            )
 
-        if append_history:
-            history.append((query, response))
+            if append_history:
+                history.append((query, response))
 
-        return response, history
+            return response, history
 
     def generate(
         self,

What is the padding token?

Thanks for your amazing work. By the way, may I ask that what is the padding token in your tokenizer? Without that, I don't think I can perform finetuning on this model.

FlashAttention是必须要安装的吗？

如题

安装flash-attn错误

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch

tokenizer = AutoTokenizer.from_pretrained("/Users/sniper/model/Qwen-7b-chat", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("/Users/sniper/model/Qwen-7b-chat", device_map="auto",
                                             offload_folder="offload", torch_dtype=torch.float16,
                                             trust_remote_code=True, fp16=True).eval()


model.generation_config = GenerationConfig.from_pretrained("/Users/sniper/model/Qwen-7b-chat",
                                                           trust_remote_code=True)  
# 第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)

但是在chat行（倒数第二）出现错误：

 position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: MPS does not support cumsum op with int64 input

请问这是什么原因呀

qwenlm / qwen Goto Github PK

qwen's Issues

Recommend Projects

Recommend Topics

Recommend Org