Giter VIP home page Giter VIP logo

Comments (3)

Uxito-Ada avatar Uxito-Ada commented on September 25, 2024

Hi @YongZhuIntel ,

I reproduced it and got the same error, which means that XPU memory resource on the platform has been used out.

In addition, it is profiled that chatglm3-6b model in BF16 takes ~11G+ after the trainer is started. In the following forward/backward, the memory consumption gradually grows, and thus it is easy to exceed 16G max limit on Arc.

Also, it should be noted that multi-instance training is in a data-parallel way, which loads the whole model on each card and therefore does not save any memory.

Two suggestions:

Firstly, you could try QLoRA, which quants the base model into NF4 that requires less memory than BF16. As the base model is freezed, this will not harm the tuning accuracy. Moreover, we have already validated chatglm with QLoRA. This is the most recommended.

Secondly, hyperparameters can be tuned to decrease the memory consumption. With the below configurations, I can run more than 100 steps on 2 cards. And more configurations can be tried as well:

# in alpaca_lora_finetuning.py
lora_r: int = 2,
lora_alpha: int = 4,
lora_dropout: float = 0.85,

# in .sh script
......
      python ./alpaca_lora_finetuning.py \
      --micro_batch_size 1 \
      --batch_size 2 \
......

from bigdl.

YongZhuIntel avatar YongZhuIntel commented on September 25, 2024

@Uxito-Ada Thanks for your help, I has successfully run qlora_finetune_chatglm3_6b on 1card , but when tryin to run qlora_finetune_chatglm3_6b on 2 card, I got error at 100 steps.

2 cards script:

export MASTER_ADDR=127.0.0.1
export OMP_NUM_THREADS=6
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export TORCH_LLM_ALLREDUCE=0
mpirun -n 2 \
    python ./alpaca_qlora_finetuning.py \
    --base_model "/home/intel/models/chatglm3-6b" \
    --data_path "yahma/alpaca-cleaned" \
    --lora_target_modules '[query_key_value,dense,dense_h_to_4h,dense_4h_to_h]' \
    --output_dir "./ipex-llm-qlora-alpaca"

error message:

OSError: [Errno 39] Directory not empty: './ipex-llm-qlora-alpaca/tmp-checkpoint-100' -> './ipex-llm-qlora-alpaca/checkpoint-100'

and #11099 said this issue fixed on transformers 4.39.1
But After I installed transformers 4.39.1

pip install transformers==4.39.1
pip install accelerate==0.28.0

I got new error:

Traceback (most recent call last):
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 759, in convert_to_tensors
    tensor = as_tensor(value)
             ^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 721, in as_tensor
    return torch.tensor(value)
           ^^^^^^^^^^^^^^^^^^^
ValueError: expected sequence of length 256 at dim 1 (got 255)

Is there something else that needs to be installed?

error log:
qlora_finetune_chatglm3_6b_arc_2_card_def_tmp.log

from bigdl.

Uxito-Ada avatar Uxito-Ada commented on September 25, 2024

Hi @YongZhuIntel ,

I reproduced your error, and the below dependencies can help to solve it:

pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install transformers==4.36.1
pip install accelerate==0.23.0

from bigdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.