报错： <div class="snippet-clipboard-content notranslate position-relative overflow-a

使用替代方案GPT2Tokenizer支持多线程：<a href="https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

您好，请问像下面这样子写能 work 吗？ <div class="highlight highlight-source-python notranslate po

tiktoken不支持多线程tokenize? about qwen HOT 8 CLOSED

skepsun commented on May 17, 2024

tiktoken不支持多线程tokenize?

from qwen.

Comments (8)

songkq commented on May 17, 2024 2

使用替代方案GPT2Tokenizer支持多线程：https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat/tree/main

from qwen.

jklj077 commented on May 17, 2024 1

可能是LLaMA-Efficient-Tuning使用的HuggingFace datasets版本上的问题，请尝试升级datasets版本到最新看看？

以下是MWE，datasets新版可正常运行

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
def process(example):
    ids = tokenizer.encode(example['text'])
    out = {'ids': ids, 'len': len(ids)}
    return out

dataset = load_dataset("stas/openwebtext-10k") # just an example
tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
    num_proc=3,
)

参见
datasets commit: huggingface/datasets#5552
datasets issue: huggingface/datasets#5769
LLaMA-Efficient-Tuning issue: hiyouga/LLaMA-Factory#328

from qwen.

jklj077 commented on May 17, 2024 1

datasets中的多进程处理逻辑我们无法控制。一般而言，多进程tokenize最好在进程中初始化tokenizer，避免进程间传递tokenizer对象，可能会触发意外问题。

from qwen.

geekinglcq commented on May 17, 2024

你好，请问一下方便提供更详细的代码让我们复现吗？

from qwen.

skepsun commented on May 17, 2024

@geekinglcq 感谢回复，使用的训练框架是https://github.com/hiyouga/LLaMA-Efficient-Tuning 这个仓库，将preprocessing_num_workers设置超过1就会报这个错，我的脚本是这样的：

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 accelerate launch --num_processes=7 src/train_bash.py \
    --stage sft \
    --deepspeed configs/ds_zero2.json \
    --lora_target q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --template vicuna \
    --model_name_or_path ../Qwen-7B \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type full \
    --warmup_ratio 0.03 \
    --output_dir outputs/qwen-7b-sft \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 12 \
    --lr_scheduler_type cosine \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --logging_steps 1 \
    --save_steps 100 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --dev_ratio 0.001 \
    --num_train_epochs 3 \
    --resume_lora_training True \
    --plot_loss \
    --report_to wandb \
    --fp16 \
    --tf32 True

from qwen.

nobodybut commented on May 17, 2024

我在text-generation-webui里调用，也只能用到1个CPU线程，推理超慢无比，开了个issue在这里，没人搭理……

from qwen.

zhaochs1995 commented on May 17, 2024

@skepsun 请问你解决了吗？除了改为单线程，还可以怎么解决？升级到datasets最近版本问题依然

from qwen.

JianxinMa commented on May 17, 2024

您好，请问像下面这样子写能 work 吗？

import os
import threading
from transformers import AutoTokenizer

tokenizer_dict = {}

def process(example):
    k = str(os.getpid()) + str(threading.get_ident())
    if k not in tokenizer_dict:
        for _ in range(100):  # try multiple times when the network is unreliable
            try:
                tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
                break
            except Exception:
                pass
        tokenizer_dict[k] = tokenizer
    else:
        tokenizer = tokenizer_dict[k]
    ids = tokenizer.encode(example["text"])
    out = {"ids": ids, "len": len(ids)}
    return out

from qwen.

tiktoken不支持多线程tokenize? about qwen HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent