Comments (26)
This error is caused by trying to uninstall python libs installed by OS package manager (e.g., apt). In your example, it's blinker
.
A simple workaround is to add --ignore-installed
after the pip command.
pip install transformers==4.36.0 --ignore-installed blinker
If this command doesn't work, please uninstall os install package first.
apt remove python3-blinker
pip install transformers==4.36.0
from bigdl.
If I try to enable DeepSpeed via
accelerate config
, it tells me that DeepSpeed is not installed.Do you want to enable dynamic shape tracing? [yes/NO]: Do you want to use DeepSpeed? [yes/NO]: yes Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/config.py", line 67, in config_command config = get_user_input() ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/config.py", line 40, in get_user_input config = get_cluster_input() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/cluster.py", line 192, in get_cluster_input is_deepspeed_available() AssertionError: DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source exit code: 1I skipped DeepSpeed and generated this
default_config.yaml
:compute_environment: LOCAL_MACHINE debug: false distributed_type: 'NO' downcast_bf16: 'no' dynamo_config: dynamo_backend: IPEX gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Yes. We omit deepspeed in requirements (so does axolotl).
You can answer no
for deepspeed. If you need the deepspeed in later stage (tensor parallel etc), you can install it with pip and re-config accelerate.
from bigdl.
Currently axolotl reports error when saving the prepared dataset:
This looks to be related to OpenAccess-AI-Collective/axolotl#1544.
from bigdl.
LIBXSMM_TARGET: adl [12th Gen Intel(R) Core(TM) i3-12100]
Hi @kwaa
I will add the prepare data fail issue to troubleshooting. Thank you for your solution! :)
This segment fault issue may be related to intel-extension-for-pytorch and oneAPI.
Can you provide oneAPI version, GPU driver version and pip list? We will try to reproduce this issue.
from bigdl.
Hmm... Something went wrong, I tried to run
sycl-ls
inside a container and it only showed cpu.
This is probably a NixOS bug, after I rolled back the system,sycl-ls
would output normally.[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix] [opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.42] [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 730 OpenCL 3.0 NEO [23.35.27191.42] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 730 1.3 [1.3.27191]
This is because older versions of
intel-compute-runtime
are not compatible with the 6.8 kernel, see intel/compute-runtime#710
ARC 770 is in 0. Please change env.
export ONEAPI_DEVICE_SELECTOR=level_zero:0
from bigdl.
I think it might be useful to provide an
accelerate/default_config.yaml
reference file to avoid misconfiguration.Also, I fixed this (#10821 (comment)) by setting environments (intel/compute-runtime#710 (comment)), but at the moment the Trainer doesn't seem to be running correctly: it runs for a second and then never logs again, and I don't get any usable files outside of the json in the
output_dir
.Maybe there is something wrong with my axolotl config?
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( 2024-04-23 18:33:37,631 - INFO - intel_extension_for_pytorch auto imported 2024-04-23 18:33:37,650 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [2024-04-23 18:33:38,905] [INFO] [datasets.<module>:58] [PID:53] PyTorch version 2.1.0a0+cxx11.abi available. dP dP dP 88 88 88 .d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88 88' `88 `8bd8' 88' `88 88 88' `88 88 88 88. .88 .d88b. 88. .88 88 88. .88 88 88 `88888P8 dP' `dP `88888P' dP `88888P' dP dP [2024-04-23 18:33:39,850] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:53] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train [2024-04-23 18:33:39,853] [WARNING] [axolotl.validate_config:263] [PID:53] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning [2024-04-23 18:33:39,854] [INFO] [axolotl.normalize_config:169] [PID:53] [RANK:0] GPU memory usage baseline: 0.000GB () [2024-04-23 18:33:39,854] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:53] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors [2024-04-23 18:33:39,854] [INFO] [axolotl.scripts.check_user_token:371] [PID:53] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used. [2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 128256 / <|im_end|> [2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 128000 / <|begin_of_text|> [2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 128001 / <|end_of_text|> [2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: None / None [2024-04-23 18:33:40,126] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:53] [RANK:0] Loading prepared dataset from disk at /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375... [2024-04-23 18:33:40,134] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:53] [RANK:0] Prepared dataset loaded from disk... [2024-04-23 18:33:40,143] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_tokens: 18727 [2024-04-23 18:33:40,146] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] `total_supervised_tokens: 14240` [2024-04-23 18:33:43,050] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 18727 [2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] data_loader_len: 3 [2024-04-23 18:33:43,051] [INFO] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est across ranks: [0.914404296875] [2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est: None [2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_steps: 12 [2024-04-23 18:33:43,062] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_tokens: 338809 [2024-04-23 18:33:43,071] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] `total_supervised_tokens: 249975` [2024-04-23 18:33:43,072] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 338809 [2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] data_loader_len: 80 [2024-04-23 18:33:43,073] [INFO] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est across ranks: [0.9618260583212209] [2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est: 0.97 [2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_steps: 320 [2024-04-23 18:33:43,085] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading tokenizer... /workspace/models/llama-3-8b [2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 128256 / <|im_end|> [2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 128000 / <|begin_of_text|> [2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 128001 / <|end_of_text|> [2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: None / None [2024-04-23 18:33:43,286] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading model and peft_config... [2024-04-23 18:33:43,287] [INFO] [axolotl.load_model:366] [PID:53] [RANK:0] patching _expand_mask Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 2.76it/s] [2024-04-23 18:34:38,518] [INFO] [axolotl.load_model:677] [PID:53] [RANK:0] converting modules to torch.bfloat16 for flash attention [2024-04-23 18:34:39,488] [INFO] [axolotl.load_lora:789] [PID:53] [RANK:0] found linear modules: ['k_proj', 'v_proj', 'gate_proj', 'up_proj', 'down_proj', 'o_proj', 'q_proj'] trainable params: 2,143,322,112 || all params: 5,851,353,088 || trainable%: 36.629512520711515 [2024-04-23 18:35:15,919] [INFO] [axolotl.load_model:714] [PID:53] [RANK:0] GPU memory usage after adapters: 0.000GB () [2024-04-23 18:35:17,927] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Pre-saving adapter config to /workspace/out [2024-04-23 18:35:18,034] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Starting trainer... [2024-04-23 18:35:18,204] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809 [2024-04-23 18:35:18,205] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809 [2024-04-23 18:35:18,273] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
We will consider about the default accelerate/default_config.yaml
. But different HW leads to different configs, so it may be hard to provide a suitable one.
Just checked your axolot config. There are several problems
- Llama 3 is not supported by axolotl v0.4.0 (ipex-llm only support axolotl v0.4.0 right now). It's just supported in the main branch with OpenAccess-AI-Collective/axolotl#1536 and OpenAccess-AI-Collective/axolotl#1553 . Llama 3 also requires a different yaml, especially for tokens, https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/llama-3/lora-8b.yml#L67 .
- dataset path
- path: /workspace/datasets/alpaca_2k_test
should be fine.
To support Llama 3 finetuning on ARC, we need to change to the main branch and upgrade several key libs (e.g., peft, transformers). Meanwhile, we need to change several source codes in Ipex-llm.
Good news is that we are already working on peft upgrade and llama 3 axolotl support. Will let you know when it's ready. :)
from bigdl.
Update:
- We just added Axolotl for Llama-3-8B example. #10984
- accelerate default config https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml
- axolotl image for fine-tune is working in progress ### #10971
from bigdl.
Hi @kwaa
Thank you for submitting this issue. We will consider adding Docker images for IPEX-LLM + Axolotl Docker and other examples. However, it usually takes some time to go through internal reviews (especially for docker image).
Back to your docker file, can you share the error messages during docker building? Maybe we can fix this problem first. :)
from bigdl.
can you share the error messages during docker building? Maybe we can fix this problem first. :)
Okay. The problem is that the installation of the dependency fails:
# Install requirements
RUN pip install -e . && \
pip install transformers==4.36.0
podman -v # podman version 5.0.2
sudo podman compose build
Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Downloading svgwrite-1.4.3-py3-none-any.whl (67 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.1/67.1 kB 7.1 MB/s eta 0:00:00
Building wheels for collected packages: optimum, rouge-score, fire, ffmpy, wavedrom
Building wheel for optimum (pyproject.toml): started
Building wheel for optimum (pyproject.toml): finished with status 'done'
Created wheel for optimum: filename=optimum-1.13.2-py3-none-any.whl size=395599 sha256=38896e176613a1c92028a9c2383cfe66ab8aef2f86d9540096930717ef731afb
Stored in directory: /root/.cache/pip/wheels/c7/36/5c/712f2d963d6d312afee816293b58610a3442d1a1de2182e651
Building wheel for rouge-score (setup.py): started
Building wheel for rouge-score (setup.py): finished with status 'done'
Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=2f55fb6e5a68b745b71c26dd809038bc0d897f941d83d80cbaa0d0f031a4ec11
Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Building wheel for fire (setup.py): started
Building wheel for fire (setup.py): finished with status 'done'
Created wheel for fire: filename=fire-0.6.0-py2.py3-none-any.whl size=117047 sha256=f2fc59bf03786a40ae6b203d36fd6cd26335e2987e9598f6b8a1a1f4cc368d49
Stored in directory: /root/.cache/pip/wheels/6a/f3/0c/fa347dfa663f573462c6533d259c2c859e97e103d1ce21538f
Building wheel for ffmpy (setup.py): started
Building wheel for ffmpy (setup.py): finished with status 'done'
Created wheel for ffmpy: filename=ffmpy-0.3.2-py3-none-any.whl size=5600 sha256=9f04bfa5e3cc1ac776f4482a61ec953af63db0bb0ded8f0dd7bc7457e2301df6
Stored in directory: /root/.cache/pip/wheels/55/3c/f2/f6e34046bac0d57c13c7d08123b85872423b89c8f59bafda51
Building wheel for wavedrom (setup.py): started
Building wheel for wavedrom (setup.py): finished with status 'done'
Created wheel for wavedrom: filename=wavedrom-2.0.3.post3-py2.py3-none-any.whl size=30071 sha256=dadea55437d57655f9f7d85627c29ddc43c7629a4bb5eb9330129dd0f36b71df
Stored in directory: /root/.cache/pip/wheels/23/cf/3b/4dcf6b22fa41c5ece715fa5f4e05afd683e7b0ce0f2fcc7bb6
Successfully built optimum rouge-score fire ffmpy wavedrom
Installing collected packages: wcwidth, pytz, pydub, nh3, ffmpy, appdirs, aniso8601, addict, xxhash, wrapt, werkzeug, websockets, tzdata, toolz, tomlkit, threadpoolctl, termcolor, tensorboard-data-server, svgwrite, sqlparse, smmap, shtab, shortuuid, shellingham, setproctitle, sentry-sdk, semantic-version, scipy, ruff, rpds-py, querystring-parser, python-multipart, python-dateutil, pynvml, pygments, pyasn1, pyarrow-hotfix, pyarrow, protobuf, prompt-toolkit, packaging, orjson, multidict, mdurl, markdown2, markdown, Mako, llvmlite, kiwisolver, joblib, jmespath, itsdangerous, importlib-resources, humanfriendly, httpcore, hf_transfer, grpcio, greenlet, graphql-core, google-crc32c, frozenlist, fonttools, entrypoints, docstring-parser, docker-pycreds, dill, decorator, cycler, contourpy, cloudpickle, cachetools, blinker, attrs, art, aioitertools, aiofiles, absl-py, yarl, wavedrom, tensorboard, sqlalchemy, scikit-learn, rsa, responses, requests-oauthlib, referencing, pyasn1-modules, proto-plus, pandas, numba, nltk, multiprocess, matplotlib, markdown-it-py, httpx, gunicorn, graphql-relay, googleapis-common-protos, google-resumable-media, gitdb, Flask, fire, docker, coloredlogs, botocore, aiosignal, rouge-score, rich, jsonschema-specifications, graphene, gradio-client, google-auth, gitpython, bitsandbytes, alembic, aiohttp, accelerate, wandb, tyro, typer, mlflow, jsonschema, google-auth-oauthlib, google-api-core, fschat, aiobotocore, s3fs, peft, google-cloud-core, datasets, bert-score, altair, trl, optimum, gradio, google-cloud-storage, evaluate, gcsfs, axolotl
Attempting uninstall: websockets
Found existing installation: websockets 12.0
Uninstalling websockets-12.0:
Successfully uninstalled websockets-12.0
Attempting uninstall: protobuf
Found existing installation: protobuf 5.27.0rc1
Uninstalling protobuf-5.27.0rc1:
Successfully uninstalled protobuf-5.27.0rc1
Attempting uninstall: packaging
Found existing installation: packaging 24.0
Uninstalling packaging-24.0:
Successfully uninstalled packaging-24.0
Attempting uninstall: blinker
Found existing installation: blinker 1.4
ERROR: Cannot uninstall 'blinker'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
Error: building at STEP "RUN pip install -e . && pip install transformers==4.36.0": while running runtime: exit status 1
from bigdl.
Update: The image builds successfully after adding apt remove -y python3-blinker
, but it looks like still need to set up the accelerate config
as described here.
from bigdl.
If I try to enable DeepSpeed via accelerate config
, it tells me that DeepSpeed is not installed.
Do you want to enable dynamic shape tracing? [yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: yes
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/config.py", line 67, in config_command
config = get_user_input()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/config.py", line 40, in get_user_input
config = get_cluster_input()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/cluster.py", line 192, in get_cluster_input
is_deepspeed_available()
AssertionError: DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source
exit code: 1
I skipped DeepSpeed and generated this default_config.yaml
:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: IPEX
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
from bigdl.
Currently axolotl reports error when saving the prepared dataset:
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-04-22 16:47:55,676 - INFO - intel_extension_for_pytorch auto imported
2024-04-22 16:47:55,683 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-04-22 16:47:56,906] [INFO] [datasets.<module>:58] [PID:46] PyTorch version 2.1.0a0+cxx11.abi available.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-22 16:47:57,920] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:46] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train
[2024-04-22 16:47:57,922] [WARNING] [axolotl.validate_config:263] [PID:46] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-04-22 16:47:57,923] [INFO] [axolotl.normalize_config:169] [PID:46] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-04-22 16:47:57,924] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:46] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors
[2024-04-22 16:47:57,924] [INFO] [axolotl.scripts.check_user_token:371] [PID:46] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:216] [PID:46] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:217] [PID:46] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:218] [PID:46] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:219] [PID:46] [RANK:0] UNK: None / None
[2024-04-22 16:47:58,135] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:46] [RANK:0] Unable to find prepared dataset in /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375
[2024-04-22 16:47:58,135] [INFO] [axolotl.load_tokenized_prepared_datasets:182] [PID:46] [RANK:0] Loading raw datasets...
[2024-04-22 16:47:58,135] [WARNING] [axolotl.load_tokenized_prepared_datasets:184] [PID:46] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-04-22 16:47:58,135] [INFO] [axolotl.load_tokenized_prepared_datasets:191] [PID:46] [RANK:0] No seed provided, using default seed of 42
[2024-04-22 16:48:18,281] [INFO] [axolotl.load_tokenized_prepared_datasets:394] [PID:46] [RANK:0] merging datasets
[2024-04-22 16:48:18,310] [INFO] [axolotl.load_tokenized_prepared_datasets:404] [PID:46] [RANK:0] Saving merged prepared dataset to disk... /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375
Traceback (most recent call last):
File "/workspace/axolotl/finetune.py", line 86, in <module>
fire.Fire(do_cli)
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/finetune.py", line 81, in do_cli
dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 315, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/utils/data.py", line 78, in prepare_dataset
train_dataset, eval_dataset, prompters = load_prepare_datasets(
^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/utils/data.py", line 441, in load_prepare_datasets
dataset, prompters = load_tokenized_prepared_datasets(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/utils/data.py", line 405, in load_tokenized_prepared_datasets
dataset.save_to_disk(prepared_ds_path)
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fsspec/core.py", line 383, in url_to_fs
chain = _un_chain(url, kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fsspec/core.py", line 323, in _un_chain
if "::" in path
^^^^^^^^^^^^
TypeError: argument of type 'PosixPath' is not iterable
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'finetune.py', 'lora.yml']' returned non-zero exit status 1.
from bigdl.
I fixed this by patching code (moeru-ai/Moeru-Llama-3-8B@c65e3b1#diff-b135d17426f077f767e0ec29114d24b182dcaa3f6dadaee03d8ff424adcdff0bR407), the problem now is that it will Segfault in the Starting trainer
phase:
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-04-22 18:13:19,839 - INFO - intel_extension_for_pytorch auto imported
2024-04-22 18:13:19,861 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-04-22 18:13:21,510] [INFO] [datasets.<module>:58] [PID:46] PyTorch version 2.1.0a0+cxx11.abi available.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-22 18:13:23,047] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:46] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train
[2024-04-22 18:13:23,050] [WARNING] [axolotl.validate_config:263] [PID:46] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-04-22 18:13:23,051] [INFO] [axolotl.normalize_config:169] [PID:46] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-04-22 18:13:23,051] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:46] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors
[2024-04-22 18:13:23,051] [INFO] [axolotl.scripts.check_user_token:371] [PID:46] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:216] [PID:46] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:217] [PID:46] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:218] [PID:46] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:219] [PID:46] [RANK:0] UNK: None / None
[2024-04-22 18:13:23,380] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:46] [RANK:0] Loading prepared dataset from disk at /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375...
[2024-04-22 18:13:23,383] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:46] [RANK:0] Prepared dataset loaded from disk...
[2024-04-22 18:13:23,387] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_tokens: 18727
[2024-04-22 18:13:23,388] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] `total_supervised_tokens: 14240`
[2024-04-22 18:13:26,569] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 18727
[2024-04-22 18:13:26,569] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] data_loader_len: 3
[2024-04-22 18:13:26,569] [INFO] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est across ranks: [0.914404296875]
[2024-04-22 18:13:26,569] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est: None
[2024-04-22 18:13:26,569] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_steps: 12
[2024-04-22 18:13:26,571] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_tokens: 338809
[2024-04-22 18:13:26,580] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] `total_supervised_tokens: 249975`
[2024-04-22 18:13:26,582] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 338809
[2024-04-22 18:13:26,583] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] data_loader_len: 80
[2024-04-22 18:13:26,583] [INFO] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est across ranks: [0.9618260583212209]
[2024-04-22 18:13:26,583] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est: 0.97
[2024-04-22 18:13:26,583] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_steps: 320
[2024-04-22 18:13:26,598] [DEBUG] [axolotl.train.log:60] [PID:46] [RANK:0] loading tokenizer... /workspace/models/llama-3-8b
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:216] [PID:46] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:217] [PID:46] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:218] [PID:46] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:219] [PID:46] [RANK:0] UNK: None / None
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.train.log:60] [PID:46] [RANK:0] loading model and peft_config...
[2024-04-22 18:13:26,816] [INFO] [axolotl.load_model:366] [PID:46] [RANK:0] patching _expand_mask
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 3.79it/s]
[2024-04-22 18:14:33,304] [INFO] [axolotl.load_model:677] [PID:46] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-22 18:14:34,232] [INFO] [axolotl.load_lora:789] [PID:46] [RANK:0] found linear modules: ['up_proj', 'k_proj', 'gate_proj', 'down_proj', 'v_proj', 'o_proj', 'q_proj']
trainable params: 2,143,322,112 || all params: 5,851,353,088 || trainable%: 36.629512520711515
[2024-04-22 18:15:12,793] [INFO] [axolotl.load_model:714] [PID:46] [RANK:0] GPU memory usage after adapters: 0.000GB ()
[2024-04-22 18:15:12,838] [INFO] [axolotl.train.log:60] [PID:46] [RANK:0] Pre-saving adapter config to /workspace/out
[2024-04-22 18:15:12,947] [INFO] [axolotl.train.log:60] [PID:46] [RANK:0] Starting trainer...
[2024-04-22 18:15:13,108] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-22 18:15:13,109] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-22 18:15:13,197] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
0%| | 0/80 [00:00<?, ?it/s]
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [12th Gen Intel(R) Core(TM) i3-12100]
Registry and code: 13 MB
Command: /usr/bin/python3 finetune.py lora.yml
Uptime: 115.136140 s
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'finetune.py', 'lora.yml']' died with <Signals.SIGSEGV: 11>.
from bigdl.
Can you provide oneAPI version, GPU driver version and pip list? We will try to reproduce this issue.
I installed mesa 24.0.5
, intel-compute-runtime 24.09.28717.12
and level-zero 1.16.14
on host, the rest is in the Dockerfile. (intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT)
requirements.txt was copied from the axolotl example and has not been modified.
My GPUs are currently A770 16G and UHD730.
sudo intel_gpu_top -L
# card2 Intel Dg2 (Gen12) pci:vendor=8086,device=56A0,card=0
# └─renderD129
# card1 Intel Alderlake_s (Gen12) pci:vendor=8086,device=4692,card=0
# └─renderD128
from bigdl.
sudo intel_gpu_top -L
Thank you for providing detailed env! :)
Your dockerfile seems fine. It is based on our XPU inference image. That means you are using intel/oneapi-basekit:2024.0.1 with level-zero 1.16.14 driver.
According to the previous error message and env setting, I think the main problem is this finetune program is running on card 1, i.e., iGPU. This may lead to segment fault and OOM.
Please refer to this doc and select ARC 770 as main GPU. In most cases, you can choose GPU with env.
export ONEAPI_DEVICE_SELECTOR=level_zero:1
from bigdl.
BTW, instead of patching Axolotl. We can downgrade datasets
to 2.15.0 to avoid previous the prepare data fail issue. Will add this change to doc and quick start. #10849
pip install datasets==2.15.0
from bigdl.
export ONEAPI_DEVICE_SELECTOR=level_zero:1
Hmm... Something went wrong, I tried to run sycl-ls
inside a container and it only showed cpu.
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
from bigdl.
Hmm... Something went wrong, I tried to run
sycl-ls
inside a container and it only showed cpu.
This is probably a NixOS bug, after I rolled back the system, sycl-ls
would output normally.
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.42]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 730 OpenCL 3.0 NEO [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 730 1.3 [1.3.27191]
This is because older versions of intel-compute-runtime
are not compatible with the 6.8 kernel, see intel/compute-runtime#710
from bigdl.
I think it might be useful to provide an accelerate/default_config.yaml
reference file to avoid misconfiguration.
Also, I fixed this (#10821 (comment)) by setting environments (intel/compute-runtime#710 (comment)), but at the moment the Trainer doesn't seem to be running correctly: it runs for a second and then never logs again, and I don't get any usable files outside of the json in the output_dir
.
Maybe there is something wrong with my axolotl config?
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-04-23 18:33:37,631 - INFO - intel_extension_for_pytorch auto imported
2024-04-23 18:33:37,650 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-04-23 18:33:38,905] [INFO] [datasets.<module>:58] [PID:53] PyTorch version 2.1.0a0+cxx11.abi available.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-23 18:33:39,850] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:53] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train
[2024-04-23 18:33:39,853] [WARNING] [axolotl.validate_config:263] [PID:53] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-04-23 18:33:39,854] [INFO] [axolotl.normalize_config:169] [PID:53] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-04-23 18:33:39,854] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:53] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors
[2024-04-23 18:33:39,854] [INFO] [axolotl.scripts.check_user_token:371] [PID:53] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: None / None
[2024-04-23 18:33:40,126] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:53] [RANK:0] Loading prepared dataset from disk at /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375...
[2024-04-23 18:33:40,134] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:53] [RANK:0] Prepared dataset loaded from disk...
[2024-04-23 18:33:40,143] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_tokens: 18727
[2024-04-23 18:33:40,146] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] `total_supervised_tokens: 14240`
[2024-04-23 18:33:43,050] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 18727
[2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] data_loader_len: 3
[2024-04-23 18:33:43,051] [INFO] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est across ranks: [0.914404296875]
[2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est: None
[2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_steps: 12
[2024-04-23 18:33:43,062] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_tokens: 338809
[2024-04-23 18:33:43,071] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] `total_supervised_tokens: 249975`
[2024-04-23 18:33:43,072] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 338809
[2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] data_loader_len: 80
[2024-04-23 18:33:43,073] [INFO] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est across ranks: [0.9618260583212209]
[2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est: 0.97
[2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_steps: 320
[2024-04-23 18:33:43,085] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading tokenizer... /workspace/models/llama-3-8b
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: None / None
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading model and peft_config...
[2024-04-23 18:33:43,287] [INFO] [axolotl.load_model:366] [PID:53] [RANK:0] patching _expand_mask
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 2.76it/s]
[2024-04-23 18:34:38,518] [INFO] [axolotl.load_model:677] [PID:53] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-23 18:34:39,488] [INFO] [axolotl.load_lora:789] [PID:53] [RANK:0] found linear modules: ['k_proj', 'v_proj', 'gate_proj', 'up_proj', 'down_proj', 'o_proj', 'q_proj']
trainable params: 2,143,322,112 || all params: 5,851,353,088 || trainable%: 36.629512520711515
[2024-04-23 18:35:15,919] [INFO] [axolotl.load_model:714] [PID:53] [RANK:0] GPU memory usage after adapters: 0.000GB ()
[2024-04-23 18:35:17,927] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Pre-saving adapter config to /workspace/out
[2024-04-23 18:35:18,034] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Starting trainer...
[2024-04-23 18:35:18,204] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-23 18:35:18,205] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-23 18:35:18,273] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
from bigdl.
It looks like I need to download llama-2-7b
and try the example lora.yml
to confirm that the current version works.
from bigdl.
I tried unsloth/llama-2-7b
and it was consistent with the previous behavior.
[2024-04-24 17:20:25,931] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading tokenizer... /workspace/models/llama-2-7b
[2024-04-24 17:20:25,986] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 2 / </s>
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 1 / <s>
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 0 / <unk>
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: 0 / <unk>
[2024-04-24 17:20:25,987] [INFO] [axolotl.load_tokenizer:224] [PID:53] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading model and peft_config...
[2024-04-24 17:20:25,990] [INFO] [axolotl.load_model:366] [PID:53] [RANK:0] patching _expand_mask
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.88s/it]
[2024-04-24 17:20:56,250] [INFO] [axolotl.load_model:677] [PID:53] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-24 17:20:56,501] [INFO] [axolotl.load_lora:789] [PID:53] [RANK:0] found linear modules: ['q_proj', 'up_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj']
trainable params: 39,976,960 || all params: 3,742,765,056 || trainable%: 1.0681129967245264
[2024-04-24 17:21:33,106] [INFO] [axolotl.load_model:714] [PID:53] [RANK:0] GPU memory usage after adapters: 0.000GB ()
[2024-04-24 17:21:35,274] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Pre-saving adapter config to /workspace/out
[2024-04-24 17:21:35,278] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Starting trainer...
[2024-04-24 17:21:35,516] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041
[2024-04-24 17:21:35,517] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041
[2024-04-24 17:21:35,612] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041
from bigdl.
I tried
unsloth/llama-2-7b
and it was consistent with the previous behavior.[2024-04-24 17:20:25,931] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading tokenizer... /workspace/models/llama-2-7b [2024-04-24 17:20:25,986] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 2 / </s> [2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 1 / <s> [2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 0 / <unk> [2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: 0 / <unk> [2024-04-24 17:20:25,987] [INFO] [axolotl.load_tokenizer:224] [PID:53] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-04-24 17:20:25,987] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading model and peft_config... [2024-04-24 17:20:25,990] [INFO] [axolotl.load_model:366] [PID:53] [RANK:0] patching _expand_mask Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.88s/it] [2024-04-24 17:20:56,250] [INFO] [axolotl.load_model:677] [PID:53] [RANK:0] converting modules to torch.bfloat16 for flash attention [2024-04-24 17:20:56,501] [INFO] [axolotl.load_lora:789] [PID:53] [RANK:0] found linear modules: ['q_proj', 'up_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj'] trainable params: 39,976,960 || all params: 3,742,765,056 || trainable%: 1.0681129967245264 [2024-04-24 17:21:33,106] [INFO] [axolotl.load_model:714] [PID:53] [RANK:0] GPU memory usage after adapters: 0.000GB () [2024-04-24 17:21:35,274] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Pre-saving adapter config to /workspace/out [2024-04-24 17:21:35,278] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Starting trainer... [2024-04-24 17:21:35,516] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041 [2024-04-24 17:21:35,517] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041 [2024-04-24 17:21:35,612] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041
Seems axolotl loaded the checkpoint but didn't start training. Can you dump GPU usage with the following command?
sudo xpu-smi dump -d 0 -m 0,1,2,5,18
-d 0
means device 0. Change it if you are using other devices.
from bigdl.
Hi @kwaa
We built our example with meta llama-2-7b. Not sure if it works on other models.
Can you give us the model version & link you are using in fintuning? We can try to reproduce this error.
from bigdl.
Oh, sorry I missed the message before.
I'm using unsloth/llama-2-7b
Then I now suspect it may have something to do with the container running in the background.
If I run the container in the foreground, the interface gets stuck, so I started trying to fix my system yesterday to make it display with iGPU...
I'll keep updating if there are changes.
from bigdl.
Update: I fixed the iGPU display issue with i915.enable_psr=1
, but it wasn't running Trainer in the foreground either.
I also tried running xpu-smi
inside the container, but it doesn't seem to exist
crun: executable file `xpu-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
from bigdl.
level_zero
Hi @kwaa
If xpu-smi
was not found in PATH, it may be caused by the wrong env or forgetting to source /xxx/Konami/servers.sh
.
If you are using our image as base image, you can follow this command. https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/finetune/qlora/xpu/docker/start-qlora-finetuning-on-xpu.sh#L5
from bigdl.
Related Issues (20)
- about conflict HOT 2
- Phi3-4k winograde drop from 0515 version to 0516 version HOT 3
- [langchain-chatchat] ERROR: The expanded size of the tensor (559) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 559]. Tensor sizes: [1, 512] HOT 2
- Unable to get LanguageBind/Video-LLaVA-7B-hf model working through ipex-llm HOT 8
- ipex-llm[cpp] error: Sub-group size 8 is not supported on the device HOT 3
- MTL Linux Qwen-VL: LLVM ERROR: GenXCisaBuilder failed
- Support for MTL-H & MTL-U iGPU on Linux HOT 1
- try to test multi xpu with example HOT 14
- miniCPM run benchmark get error in iGPU HOT 1
- Shape Mismatch with Checkpoint for Deepspeed Zero3
- [script issue] - newly created checkpoint already contain a file. HOT 3
- [Feature]internlm-xcomposer2-vl-7b support HOT 2
- Qwen-7B TypeError: qwen_attention_forward() got an unexpected keyword argument 'registered_causal_mask' HOT 2
- ipex-llm(0517) Failed to Run 'baichuan-inc/Baichuan2-7B-Chat' in batch_size==2 and batch_size==4 with 32-32, 1024-128, 2048-256 input_length HOT 1
- Qwen-7B-Chat fail with larger 6.7k for second or 3rd time
- Ollama Linux No Response Issue with IPEX-LLM HOT 2
- Qwen1.5-4b and Qwen1.5-7b model cannot be loaded correctly in ipex-llm version 20240522 HOT 6
- [inference]: fine tuned model fails to do inferencing
- ModuleNotFoundError: No module named 'ipex_llm.vllm.xpu' while using docker and installation HOT 1
- [integration]: merging bfloat16 model failed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigdl.