Using a well-crafted FAUXPILOT, we can execute inference tasks based on the Codegen mo

Comments (9)

moyix commented on May 4, 2024 2

I don't know of a good guide to fine-tuning unfortunately! One of my colleagues, @shailja-thakur, has fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune the 16B model (we had to use 80GB A100s).

The --dataset_name is just the location of the code you want to train on in a format that Huggingface Datasets recognizes. The simplest is probably to use JSONL format – a JSON file with one dictionary per line, using the format:

{"text": "content_of_source_file_1", "url": "path_to_source_file_1"}
{"text": "content_of_source_file_2", "url": "path_to_source_file_2"}
...

(You can add other keys if you want; the only field used by the training script is text, but I find it helpful to include some extra metadata so I can keep track of where the code came from.)

You can see an example of a dataset I put together of C/C++ code found in Debian here: https://huggingface.co/datasets/moyix/debian_csrc

I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code.

Also note that it is still a bit tricky to get a custom model working – you'll have to run the conversion from HF to FasterTransformers after training it, and create a configuration file for the new model (there is a script for this in the converter directory: https://github.com/moyix/fauxpilot/blob/main/converter/triton_config_gen.py).

from fauxpilot.

shailja-thakur commented on May 4, 2024 1

Hello Geunsik, Thank you for your email I will be happy to help. Can you share your my-codegen-350m-deepspeed- finetune.sh, ds_config.json, and the size of the training data, so I get an idea of what could be happening in your case? Thank you shailja

…

On Thu, Nov 3, 2022 at 7:46 PM Geunsik Lim ***@***.***> wrote: I don't know of a good guide to fine-tuning unfortunately! One of my colleagues, @shailja-thakur <https://github.com/shailja-thakur>, has fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune the 16B model (we had to use 80GB A100s). @moyix <https://github.com/moyix>, @shailja-thakur <https://github.com/shailja-thakur>, I got the unexpected OOM issue (e.g., torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) while running the fine-tuning task with the smallest model (e.g., 350M) and your debian dataset on my Ubuntu 22.04 (DRAM 32GB)+ Nvidia GPU Xp (Vram 12GB). Have you had a similar experience? Did you have to utilize Nvidia A100 VRAM 80GB (or 40GB) at the time, even if you tried to fine-tune tasks using the smallest model, such as the 350M? Can we try to change the 'ds config.json' file to reduce the memory consumption of the GPU VRAM in order to complete the fine-tuning operation successfully? Any feedback will be appreciated. - Screenshot: $ my-codegen-350m-deepspeed-finetune.sh ......... OMISSION .......... [INFO|trainer.py:1608] 2022-11-04 11:17:11,278 >> ***** Running training ***** [INFO|trainer.py:1609] 2022-11-04 11:17:11,278 >> Num examples = 3786289 [INFO|trainer.py:1610] 2022-11-04 11:17:11,278 >> Num Epochs = 1 [INFO|trainer.py:1611] 2022-11-04 11:17:11,278 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1612] 2022-11-04 11:17:11,278 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1613] 2022-11-04 11:17:11,278 >> Gradient Accumulation steps = 32 [INFO|trainer.py:1614] 2022-11-04 11:17:11,278 >> Total optimization steps = 118321 [INFO|trainer.py:1615] 2022-11-04 11:17:11,278 >> Number of trainable parameters = 354858103 0%| /work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py:167: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version attn_weights = torch.where(causal_mask, attn_weights, mask_value) Traceback (most recent call last): File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 580, in <module> main() File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 528, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/work/qtlab/transformers/src/transformers/trainer.py", line 1501, in train return inner_training_loop( File "/work/qtlab/transformers/src/transformers/trainer.py", line 1749, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/work/qtlab/transformers/src/transformers/trainer.py", line 2508, in training_step loss = self.compute_loss(model, inputs) File "/work/qtlab/transformers/src/transformers/trainer.py", line 2540, in compute_loss outputs = model(**inputs) File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn return func(*args, **kwargs) File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1680, in forward loss = self.module(*inputs, **kwargs) File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py", line 711, in forward lm_logits = self.lm_head(hidden_states).to(torch.float32) File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) If re 0%| [2022-11-04 11:17:13,621] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3296 [2022-11-04 11:17:13,621] [ERROR] [launch.py:324:sigkill_handler] ['/home/invain/anaconda3/envs/deepspeed/bin/python', '-u', './run_clm.py', '--local_rank= 'moyix/debian_csrc', '--tokenizer_name', 'Salesforce/codegen-350M-multi', '--block_size', '2048', '--gradient_accumulation_steps', '32', '--do_train', '--fp16', '--overwrite_output_dir', '--deepspeed', real 94m15.273s user 461m18.611s sys 3m52.003s — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXKPBPX363QYC7LASGO6ETWGR2IHANCNFSM6AAAAAAQR6IMEU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from fauxpilot.

leemgs commented on May 4, 2024 1

AttributeError: 'CodeGenAttention' object has no attribute 'causal_mask'

FIXED. I figured out what was causing this problem. It was because the versions I learned and tried to sample were different. This problem has been resolved by using the most recent Transformer's latest version (e.g. 4.25.0.dev0) and incorrect weights in the config.json file. My report will be useful to anyone who may have a similar difficulty in the near future. 😄

The model card informaiton : fine-tuned Codegen-350M-multi model
- /mylab/fine-tuning-codegen/codegen-350M-finetuned$ cat ./README.md

license: bsd-3-clause

tags:

generated_from_trainer
datasets:
moyix/debian_csrc
model-index:
name: codegen-350M-finetuned
results: []

This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

codegen-350M-finetuned

This model is a fine-tuned version of Salesforce/codegen-350M-multi on the moyix/debian_csrc dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Training results

Framework versions

Transformers 4.25.0.dev0
Pytorch 1.13.0
Datasets 2.6.1
Tokenizers 0.11.0

from fauxpilot.

leemgs commented on May 4, 2024

I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code.

Yepp, I think so. :)

from fauxpilot.

leemgs commented on May 4, 2024

I don't know of a good guide to fine-tuning unfortunately! One of my colleagues, @shailja-thakur, has fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune the 16B model (we had to use 80GB A100s).

@moyix, @shailja-thakur, I got the unexpected OOM issue (e.g., torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) while running the fine-tuning task with the smallest model (e.g., 350M) and your debian dataset on my Ubuntu 22.04 (DRAM 32GB)+ Nvidia GPU Xp (Vram 12GB).

Have you had a similar experience? Did you have to utilize Nvidia A100 VRAM 80GB (or 40GB) at the time, even if you tried to fine-tune tasks using the smallest model, such as the 350M? Can we try to change the 'ds config.json' file to reduce the memory consumption of the GPU VRAM in order to complete the fine-tuning operation successfully? Any feedback will be appreciated.

Screenshot:

$ my-codegen-350m-deepspeed-finetune.sh
     ......... OMISSION ..........
[INFO|trainer.py:1608] 2022-11-04 11:17:11,278 >> ***** Running training *****
[INFO|trainer.py:1609] 2022-11-04 11:17:11,278 >>   Num examples = 3786289
[INFO|trainer.py:1610] 2022-11-04 11:17:11,278 >>   Num Epochs = 1
[INFO|trainer.py:1611] 2022-11-04 11:17:11,278 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1612] 2022-11-04 11:17:11,278 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1613] 2022-11-04 11:17:11,278 >>   Gradient Accumulation steps = 32
[INFO|trainer.py:1614] 2022-11-04 11:17:11,278 >>   Total optimization steps = 118321
[INFO|trainer.py:1615] 2022-11-04 11:17:11,278 >>   Number of trainable parameters = 354858103
  0%|                                                                                                                                                                                                      /work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py:167: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)
Traceback (most recent call last):
  File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 580, in <module>
    main()
  File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 528, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 1501, in train
    return inner_training_loop(
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 1749, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 2508, in training_step
    loss = self.compute_loss(model, inputs)
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 2540, in compute_loss
    outputs = model(**inputs)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1680, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py", line 711, in forward
    lm_logits = self.lm_head(hidden_states).to(torch.float32)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) If re
  0%|
[2022-11-04 11:17:13,621] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3296
[2022-11-04 11:17:13,621] [ERROR] [launch.py:324:sigkill_handler] ['/home/invain/anaconda3/envs/deepspeed/bin/python', '-u', './run_clm.py', '--local_rank= 'moyix/debian_csrc', '--tokenizer_name', 'Salesforce/codegen-350M-multi', '--block_size', '2048', '--gradient_accumulation_steps', '32', '--do_train', '--fp16', '--overwrite_output_dir', '--deepspeed',

real    94m15.273s
user    461m18.611s
sys     3m52.003s

from fauxpilot.

leemgs commented on May 4, 2024

Can you share your my-codegen-350m-deepspeed-
finetune.sh, ds_config.json, and the size of the training data, so I get an
idea of what could be happening in your case?

@shailja-thakur, Here, I don't know why this training strategy still gives a CUDA-out-of-memory issue on out-of-date Nvidia GPU (e.g., VRAM 12GB).

fine-tune option with deepspeed framework (e.g., my-codegen-350m-deepspeed-finetune.sh)
- 12th Gen Intel Core i7 + DRAM 31GB + Nvidia Titan Xp (VRAM 12GB) : It's failed due to CUDA-OOM 😭
- 12th Gen Intel Core i7 + DRAM 31GB + Nvidia A100 (VRAM 80GB) : It's succeeded thanks to VRAM 80GB 😄

 --num_gpus 1 --num_nodes 1 $RUN_CLM --model_name_or_path=Salesforce/codegen-${PARAM_SIZE}-multi \
 --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 \
 --output_dir=./codegen-${PARAM_SIZE}-finetuned --dataset_name $MY_DATASET \
 --tokenizer_name Salesforce/codegen-${PARAM_SIZE}-multi  \
 --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir \
 --deepspeed $DS_CONFIG

ds_config.json

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false

the size of the training data
- 153G ~/.cache/huggingface/datasets/moyix___parquet/

At that time, I concentrated on Parameters, Gradients, Optimizer States to avoid CUDA-OOM issue on Nvidia GPU (with VRAM 12GB). However, I could not still find a recipe to avoid CUDA-OOM issue on Nvidia GPU VRAM 12GB.

Source : MS Research blog, https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

from fauxpilot.

leemgs commented on May 4, 2024

12th Gen Intel Core i7 + DRAM 31GB + Nvidia Titan Xp (VRAM 12GB) : It's failed due to CUDA-OOM 😭
12th Gen Intel Core i7 + DRAM 31GB + Nvidia A100 (VRAM 80GB) : It's succeeded thanks to VRAM 80GB 😄

@shailja-thakur, Are there any hints or clues to work on Fine-Tune on NVIDIA TITAN XP? I tried various things, but I failed. So now, in my case, I use the high -performance GPU (e.g. NVIDIA A100 (VRAM 80GB) to avoid the CUDA room reported above.

from fauxpilot.

leemgs commented on May 4, 2024

Also note that it is still a bit tricky to get a custom model working
– you'll have to run the conversion from HF to FasterTransformers after training it,

@moyix, First of all, thank you for sharing your experiences.
Thanks to your sharing, I could create a Fine-tuned model (e.g., codegen-350M-multi-finetuned) as follows.

$ tree ./codegen-350M-multi-finetuned/
./codegen-350M-multi-finetuned/
├── added_tokens.json
├── all_results.json
├── config.json
├── merges.txt
├── pytorch_model.bin
├── README.md
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── trainer_state.json
├── training_args.bin
├── train_results.json
└── vocab.json

$ ls -al ./codegen-350M-multi-finetuned/
total 778380
drwxr-xr-x 2 leemgs leemgs      4096 Nov 10 16:40 .
drwxr-xr-x 6 leemgs leemgs      4096 Nov 10 16:43 ..
-rw-r--r-- 1 leemgs leemgs      1080 Nov 10 16:31 added_tokens.json
-rw-r--r-- 1 leemgs leemgs       582 Nov 10 16:31 all_results.json
-rw-r--r-- 1 leemgs leemgs      1011 Nov 10 16:31 config.json
-rw-r--r-- 1 leemgs leemgs    456356 Nov 10 16:31 merges.txt
-rw-r--r-- 1 leemgs leemgs 793630000 Nov 10 16:31 pytorch_model.bin
-rw-r--r-- 1 leemgs leemgs      1149 Nov 10 16:31 README.md
-rw-r--r-- 1 leemgs leemgs        99 Nov 10 16:31 special_tokens_map.json
-rw-r--r-- 1 leemgs leemgs       283 Nov 10 16:31 tokenizer_config.json
-rw-r--r-- 1 leemgs leemgs   2114827 Nov 10 16:31 tokenizer.json
-rw-r--r-- 1 leemgs leemgs       998 Nov 10 16:31 trainer_state.json
-rw-r--r-- 1 leemgs leemgs      4539 Nov 10 16:31 training_args.bin
-rw-r--r-- 1 leemgs leemgs       582 Nov 10 16:31 train_results.json
-rw-r--r-- 1 leemgs leemgs    798156 Nov 10 16:31 vocab.json
(deepspeed) leemgs@ai02:~/qtlab/CodeGen/checkpoints$

Using the generated fined-tuned model, I performed the "def hello_word" test.
Currently, I have read the official CodeGen documentation as follows:

https://github.com/salesforce/CodeGen#sampling-with-repository

However, I meet an unexpected error message like this:

error message: 'CodeGenAttention' object has no attribute 'causal_mask'
I am perplexed as to why the "pytorch model.bin" file I prepared throughout the fine-tuning process is incompatible.
I believe that any feedback or experience on this error message will be helpful.

(.venv) $ python3 -m jaxformer.hf.sample --model codegen-350M-multi --context "def hello_world():"


loading parameters
loading parameters took 9.95s
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 253, in <module>
    main()
  File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 225, in main
    model = create_model(ckpt=ckpt, fp16=use_fp16).to(device)
  File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 63, in create_model
    return CodeGenForCausalLM.from_pretrained(ckpt, revision='float16', torch_dtype=torch.float16, low_cpu_mem_usage=True)
  File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1526, in from_pretrained
    cls._load_state_dict_into_model_low_mem(model, loaded_state_dict_keys, resolved_archive_file)
  File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1786, in _load_state_dict_into_model_low_mem
    new_val = getattr(submodule, param_name)
  File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CodeGenAttention' object has no attribute 'causal_mask'

from fauxpilot.

leemgs commented on May 4, 2024

I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code.

@moyix, I have one query about the fine-tuned Codegen model. With the 350M Codegen model, how can I compare the quality/accuracy of the original Codegen model and the fine-tuned Codegen model? I'm curious if there are any well-known benchmarking tools or general methods for comparing the quality/accuracy of these two models.

from fauxpilot.

How to optimize CodeGen for my code before launching FauxPilot about fauxpilot HOT 9 OPEN

Comments (9)

license: bsd-3-clause

codegen-350M-finetuned

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent