Giter VIP home page Giter VIP logo

Comments (8)

songkq avatar songkq commented on May 17, 2024 2

使用替代方案GPT2Tokenizer支持多线程:https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat/tree/main

from qwen.

jklj077 avatar jklj077 commented on May 17, 2024 1

可能是LLaMA-Efficient-Tuning使用的HuggingFace datasets版本上的问题,请尝试升级datasets版本到最新看看?

以下是MWE,datasets新版可正常运行

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
def process(example):
    ids = tokenizer.encode(example['text'])
    out = {'ids': ids, 'len': len(ids)}
    return out

dataset = load_dataset("stas/openwebtext-10k") # just an example
tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
    num_proc=3,
)

参见
datasets commit: huggingface/datasets#5552
datasets issue: huggingface/datasets#5769
LLaMA-Efficient-Tuning issue: hiyouga/LLaMA-Factory#328

from qwen.

jklj077 avatar jklj077 commented on May 17, 2024 1

datasets中的多进程处理逻辑我们无法控制。一般而言,多进程tokenize最好在进程中初始化tokenizer,避免进程间传递tokenizer对象,可能会触发意外问题。

from qwen.

geekinglcq avatar geekinglcq commented on May 17, 2024

你好,请问一下方便提供更详细的代码让我们复现吗?

from qwen.

skepsun avatar skepsun commented on May 17, 2024

@geekinglcq 感谢回复,使用的训练框架是https://github.com/hiyouga/LLaMA-Efficient-Tuning 这个仓库,将preprocessing_num_workers设置超过1就会报这个错,我的脚本是这样的:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 accelerate launch --num_processes=7 src/train_bash.py \
    --stage sft \
    --deepspeed configs/ds_zero2.json \
    --lora_target q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --template vicuna \
    --model_name_or_path ../Qwen-7B \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type full \
    --warmup_ratio 0.03 \
    --output_dir outputs/qwen-7b-sft \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 12 \
    --lr_scheduler_type cosine \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --logging_steps 1 \
    --save_steps 100 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --dev_ratio 0.001 \
    --num_train_epochs 3 \
    --resume_lora_training True \
    --plot_loss \
    --report_to wandb \
    --fp16 \
    --tf32 True

from qwen.

nobodybut avatar nobodybut commented on May 17, 2024

我在text-generation-webui里调用,也只能用到1个CPU线程,推理超慢无比,开了个issue在这里,没人搭理……

from qwen.

zhaochs1995 avatar zhaochs1995 commented on May 17, 2024

@skepsun 请问你解决了吗?除了改为单线程,还可以怎么解决?升级到datasets最近版本问题依然

from qwen.

JianxinMa avatar JianxinMa commented on May 17, 2024

您好,请问像下面这样子写能 work 吗?

import os
import threading
from transformers import AutoTokenizer

tokenizer_dict = {}

def process(example):
    k = str(os.getpid()) + str(threading.get_ident())
    if k not in tokenizer_dict:
        for _ in range(100):  # try multiple times when the network is unreliable
            try:
                tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
                break
            except Exception:
                pass
        tokenizer_dict[k] = tokenizer
    else:
        tokenizer = tokenizer_dict[k]
    ids = tokenizer.encode(example["text"])
    out = {"ids": ids, "len": len(ids)}
    return out

from qwen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.