bigcode-project / bigcode-evaluation-harness Goto Github PK
View Code? Open in Web Editor NEWA framework for the evaluation of autoregressive code generation language models.
License: Apache License 2.0
A framework for the evaluation of autoregressive code generation language models.
License: Apache License 2.0
It is possible to pre-build and publish the Docker images to ghcr.io, so that everyone doesn't need to rebuild them. I mean something like this:
https://github.com/orgs/huggingface/packages?repo_name=text-generation-inference
I am happy to do this. I think I have permissions to publish packages. But, I want to confirm with @loubnabnl or someone before I do.
Running human_eval
with Llama 7B gets 0 for pass@1,10 but it does achieve the correct values (pass@1 ~ 10) in other repos.
To reproduce, simply run
accelerate launch main.py \
--model huggyllama/llama-7b \
--max_length_generation 512 \
--tasks humaneval \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution
which returns
"humaneval": {
"pass@1": 0.0,
"pass@10": 0.0
},
I reduced n_samples
from 200 to 20 to make this run in about 1 hour on a single A100.
Running the same human_eval
with CodeCapybara's repo gets the correct values {'pass@1': 0.09542682926829267, 'pass@10': 0.12530930709402355}
Could be related to huggingface/transformers#22402 although I tried explicitly setting eos, bos, pad token ids same as CodeCapybara (see here) and didn't see a change so might be something else.
If anyone has successfully run it here, would appreciate some tips!
{
"codexglue_code_to_text-python-left": 0.06565988797511521,
"config": {
"model": "bigcode/christmas-models"
}
}
would be better to also have the metric imo
I couldn't load HuggingFaceH4/starchat-beta
using our existing codebase on a single machine with multiple 40GB GPUs.
See loading instructions on the model card.
This library seems overly hard-coded, to the point where building off of it would be difficult. Is there a particular reason you decided to use this architecture for your code? Or why a fork of https://github.com/EleutherAI/lm-evaluation-harness or https://github.com/BigScience-Workshop/lm-evaluation-harness wouldn't suit your needs?
bigcode-evaluation-harness/lm_eval/utils.py:388 in │
│ complete_code │
│ │
│ 385 │ │ │ if not INFILL_MODE: │
│ 386 │ │ │ │ gen_code = gen_code[len(prefix) :] │
│ 387 │ │ │ if postprocess: │
│ ❱ 388 │ │ │ │ code_gens[sample].append( │
│ 389 │ │ │ │ │ task.postprocess_generation(gen_code, int(sample))
Allow max_length_generation
to change from batch to batch to speed-up tasks where length changes a lot.
For tasks scored with exact match, we even know the maximum length for each sample, so it would be nice to just limited the max length for those samples. Need to be careful to make it work with batching.
This would be a bigger refactoring but imo it'd be better to save generations after each generation is done & along with that offer restarting from previously unfinished generations (e.g. if it's interrupted or sth)
just leaving this here if someone is interested
We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!
Please take a look at the existing sheet of evaluation benchmarks here.
Please use the following template to suggest new tasks for the Evaluation Harness.
Name | Link | Number of samples | Languages | Available on the HF Hub |
---|---|---|---|---|
HumanEval | https://github.com/openai/human-eval | 164 | Python | Yes |
Here's the Markdown snippet that you can copy/paste:
|Name|Link|Number of samples| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-
| | | | | | | |
In the original APPS paper and their original code, Standard Input format is used when fn_name is not given.
But, in here, Standard is used when fn_name is given.
Add tests to the existing evaluation benchmarks to make sure they are not broken by new additions. (e.g: ensure fixed generations for a specific model using greedy sampling with a fixed seed)
The below works when setting batch_size 1 🧐
(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 840.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 949.22it/s]
number of problems for this task is 14918
0it [00:06, ?it/s]
Traceback (most recent call last):
File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
main()
File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
results[task] = evaluator.evaluate(task)
File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
generations, references = self.generate_text(task_name)
File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
generations = parallel_generations(
File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
generations = complete_code(
File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 83, in complete_code
for step, batch in tqdm(enumerate(dataloader)):
File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/data_loader.py", line 491, in __iter__
observed_batch_size = find_batch_size(batch)
File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/utils/operations.py", line 177, in find_batch_size
raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
TypeError: Can only find the batch size of tensors but got <class 'NoneType'>.
Probably related:
(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --limit 8 --max_length_generation 512 --do_sample False --n_samples 100 --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 782.52it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 901.68it/s]
number of problems for this task is 8
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
main()
File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
results[task] = evaluator.evaluate(task)
File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
generations, references = self.generate_text(task_name)
File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
generations = parallel_generations(
File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
generations = complete_code(
File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 87, in complete_code
generated_tokens = accelerator.unwrap_model(model).generate(
File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/transformers/generation/utils.py", line 1513, in generate
raise ValueError(
ValueError: num_return_sequences has to be 1, but is 16 when doing greedy search.
Thank you for providing an excellent evaluation toolkit! It is very convenient and flexible.
But when I used the evaluation tool to evaluate the HumanEval performance on the statcoder, I obtained the following results.
{
"humaneval": {
"pass@1": 0.3011280487804878,
"pass@10": 0.41708568124396794,
"pass@100": 0.5175640419344132
},
"config": {
"model": "../ckpt/starcoder",
"temperature": 0.2,
"n_samples": 200
}
}
It is lower than the paper result pass@1 is 33.6. Did I miss anything crucial? All parameters are default.
Just wondering if we need to use fp32 for evaluation of santacoder?
I tried fp16 evaluation because I fine-tuned santacoder on the stack-dedup python dataset for 1000 steps with fp16 precision. But when I ran fp16 evaluation on humaneval, it leads to the following error (for both --model=bigcode/santacoder
and --model=myfp16_finetuned_santacoder
),
File "/home/ywen/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2583, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
The error went away if I use --precision=fp32
, leading to 37.19% pass@100 on humaneval which is kinda close to the number reported in the paper. This is the command I used to run fp16 evaluation on humaneval.
accelerate launch main.py \
--model bigcode/santacoder \
--max_length_generation 368 \
--tasks humaneval \
--temperature 0.4 \
--n_samples 100 \
--batch_size 20 \
--allow_code_execution \
--trust_remote_code \
--use_auth_token \
--generation_only \
--precision fp16 \
--save_generations
Add this benchmark for documentation translation to the evaluation harness: https://huggingface.co/datasets/code_x_glue_tt_text_to_text
Add this code translation (with unit test) task: https://github.com/facebookresearch/TransCoder. The C++ -> Python subsset was used in PaLM. This requires:
Hi, I am trying to test the PR submitted for Odex and Conala tasks support. The repository is here . I am able to successfully run the bigcode-evaluation-harness
code for inference. However, using the same setup throws me an error when I run the PR code.
Here is the accelerate config used
$ accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:1
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
bf16
The command used is
$ accelerate launch main.py --model Salesforce/codegen-350M-mono --tasks odex-en --temperature 0.8 --top_p 0.95 --do_sample True --n_samples 100 --batch_size 10 --save_generations --allow_code_execution
This is the error I am getting
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
File "main.py", line 10, in <module>
from lm_eval.evaluator import Evaluator
File "/home/rudra/bigcode-evaluation-harness/lm_eval/evaluator.py", line 5, in <module>
from lm_eval import tasks
File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/__init__.py", line 3, in <module>
from . import apps, codexglue_code_to_text, conala, concode, humaneval, mbpp, codexglue_text_to_text, odex, mconala
File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/codexglue_code_to_text.py", line 56, in <module>
def compute_codexglue_code_to_text_bleu(gold_and_predicted_items: list[tuple[str, str]]):
TypeError: 'type' object is not subscriptable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2237438) of binary: /home/rudra/.cache/A100/bin/python
Traceback (most recent call last):
File "/home/rudra/.cache/A100/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-23_02:34:25
host : cccxc578.pok.ibm.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2237438)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The call to .generate()
in utils.py complete_code()
seems to be mis-configured, since it produces the stack trace below.
Here I use model='hf-internal-testing/tiny-random-gpt2'
(but codeparrot fails in the same way), and allow-code-execution=True
Traceback (most recent call last):
File "~/bigcode-evaluation-harness/main.py", line 147, in <module>
main()
File "~/bigcode-evaluation-harness/main.py", line 132, in main
results[task] = evaluator.evaluate(task)
File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 193, in evaluate
generations, references = self.generate_text(task)
File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 70, in generate_text
generations = parallel_generations(
File "~/bigcode-evaluation-harness/lm_eval/generation.py", line 140, in parallel_generations
generations = complete_code(
File "~/bigcode-evaluation-harness/lm_eval/utils.py", line 177, in complete_code
generated_tokens = accelerator.unwrap_model(model).generate(
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1320, in generate
return self.sample(
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1938, in sample
outputs = self(
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1048, in forward
transformer_outputs = self.transformer(
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 835, in forward
position_embeds = self.wpe(position_ids)
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Hi, I have been trying to evaluate CodeGen2 and Replit-Code models on the mbpp task, but the code runs extremely slow. While the corresponding eval time for other models is around 2 hours, the ETA for these 2 models varies significantly and sometimes goes up to > 90 hrs. Any help to resolve this issue? Thanks!
Models are usually evaluated on APPS after fine-tuning on the train split, but one can also do few-shot evaluation. It is already implemented in this evaluation harness: the prompt includes two shortened examples from the train split one for each call type (Standard Input and Call based).
We want to improve these examples:
For the HumanEval task, we remove the last block, based on the stop tokens: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/tasks/humaneval.py#L70
If no stopword is found in the generation (for example if by chance the generation ends exactly at the function's last return statement, or before), then remove_last_block
would remove the entire generation and return an empty string.
It seems to me that we should rather: remove anything that is after the first block, if there ever is a match with one of the stop tokens
If this issue makes sense, happy to create a PR for that.
I would like to express my sincere gratitude for the ease of code eval provided by your repository.
I see that tasks and features are being added rapidly thanks to lots of contributors.
Do you have any plan for attaching release tags for version control?
Hi, I am performing code generations using the following command
accelerate launch main.py --model bigcode/santacoder --tasks humaneval --max_length_generation 256 \
--temperature 0.8 --top_p 0.95 --do_sample True --generation_only --n_samples 100 --batch_size 32 \
--output_generations generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json \
--save_generations --allow_code_execution --trust_remote_code
I am expecting the number of candidate generations per task to be around 100
. However, on inspecting the generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json
file I see that there are 96
generations per task.
Is there something I am missing? Thanks
The SantaCoder FIM evaluation with MultiPL-E uses exact match. We should also execute the generated code. The dataset is here:
https://huggingface.co/datasets/bigcode/santacoder-fim-task
All that is needed is is to execute item['prefix'] + generated_solution + item['suffix'] + item['tests'
.
I recommend supporting n
samples per item.
In recent times, Code generation models have shown to be good at solving Natural language and/or math reasoning tasks (1 and 2). So, it would be good to evaluate the Bigcode models on these tasks.
As discussed, in the evaluation meeting - We could explore the options of adding PAL-datasets and/or reasoning tasks from HELM
PAL Datasets:
(Related issue here: #73)
The requirements.txt file lists transformers==4.25.1
, which doesn't support a lot of the newer models such as bigcode/starcoder
and huggyllama/llama-7b
(it gives errors such as KeyError: 'gpt_bigcode'
). This should be simple to fix from the user side (just install a newer version of transformers
), but just thought I'd flag it here since it's probably best if the requirements.txt can accommodate these newer models.
We would appreciate help with adding one of the following tasks to the Evaluation Harness. See the issues below:
I try to run your code in a docker container from ghcr.io/bigcode-project/evaluation-harness
.
The exact bash command is
accelerate launch main.py \
--model bigcode/starcoder \
--use_auth_token \
--max_length_generation 512 \
--tasks pal-gsm8k-greedy \
--n_samples 1 \
--temperature 0 \
--batch_size 1 \
--do_sample False \
--allow_code_execution \
--save_generations \
--save_generations_path ./output/starcoder_on_gsm8k.json
However, it returns the following:
Evaluating generations...
{
"pal-gsm8k-greedy": {
"accuracy": 0.0,
"num_failed_execution": 1319
},
"config": {
"model": "bigcode/starcoder",
"revision": null,
"temperature": 0.0,
"n_samples": 1
}
}
where the saved generation contents are like:
Any solutions?
Currently, the harness raises an exception when used with 8-bit models:
Traceback (most recent call last):
File "bigcode-evaluation-harness/main.py", line 233, in <module>
main()
File "bigcode-evaluation-harness/main.py", line 216, in main
results[task] = evaluator.evaluate(task)
File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 67, in evaluate
generations, references = self.generate_text(task_name)
File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
generations = parallel_generations(
File "bigcode-evaluation-harness/lm_eval/generation.py", line 83, in parallel_generations
model = model.to(accelerator.device)
File "/root/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1873, in to
raise ValueError(
ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
for context, this is the model i've been trying to eval: https://huggingface.co/cassanof/santacoder-lua/tree/main
seems like a check is needed for every .to call... any suggestions?
Thanks for the great work and convenient benchmarking tool!
I would like to evaluate CodeGen-16B
model on the humaneval
benchmark. At my disposal there is A6000 GPUs with 48Gb of memory. The evaluation script crashes due to CUDA out of memory here (i.e accelerator.prepare) even with the smallest batch size - 1.
Since it is model evaluation I would expect that most of the memory is occupied by the model params (no optimizer states).
Naively, this model should fit into a single GPU if loaded in half precision, since 2x 16 = 32 < 48
. However, when setting in the accelerate launch
mixed precision with fp16
I still face OOM error.
What measures would you suggest to fit the model onto a single GPU?
Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:
These are just a few thoughts, let me know if you think this makes sense @loubnabnl.
The guide on adding a new task recommends to
From the bigcode-evaluation-harness project root, copy over the new_task.py template to lm_eval/tasks
but there is no new_task.py
template in the project root.
I suspect this is a relic from the lm-evaluation-harness repo?
As part of the integration of MultiPL-E benchmark create Dockerfile/Docker image with all dependencies required to execute the code generations for different programming languages
Add this benchmark for code refinement to the evaluation harness: https://huggingface.co/datasets/code_x_glue_cc_code_refinement
Hi, I am trying to run the evaluation of Santacoder using the script provided, but I am getting the following error which I am not able to find out why:
File "/home/kcdharma/ndec/eval/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 910, in launch_command
simple_launcher(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 397, in simple_launcher
process = subprocess.Popen(cmd, env=current_env)
File "/usr/lib/python3.10/subprocess.py", line 966, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1762, in _execute_child
env_list.append(k + b'=' + os.fsencode(v))
File "/usr/lib/python3.10/os.py", line 810, in fsencode
filename = fspath(filename) # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType
Any help is appreciated.
We do not have natural language prompts for all tasks in the Evaluation Harness. We would either like to find prompts which have been adopted by other research groups or design prompts that work well for the task at hand. For example, we would appreciate help with designing prompts for the following tasks:
Running pip install -r requirements.txt
gives me
ERROR: Cannot install -r requirements.txt (line 1) and huggingface_hub==0.8.1 because these package versions have conflicting dependencies.
The conflict is caused by:
The user requested huggingface_hub==0.8.1
transformers 4.25.1 depends on huggingface-hub<1.0 and >=0.10.0
Instead,
huggingface_hub>=0.10.0
fixed it for me and hasn't broken anything so far.
Hi. Thank you for your hard work!
I am trying to run bigcode/starcoder model on a server. I downloaded the huggingface repo with git and transferred the folder to the server.
To be extra sure I wasn't passing the path wrong, I modified main.py:
parser.add_argument(
"--model",
default="/home/ubuntu/.cache/huggingface/hub/models--bigcode--starcoder/snapshots/8a57e3930912e5d22ddc4d5f46b4b99f169afbe9",
help="Model to evaluate, provide a repo name in Hugging Face hub or a local path",
)
Once I run python main.py
I get:
─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:230 in <module> │
│ │
│ 227 │
│ 228 │
│ 229 if __name__ == "__main__": │
│ ❱ 230 │ main() │
│ 231 │
│ │
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:176 in main │
│ │
│ 173 │ │ │ │ f"Non valid precision {args.precision}, choose from: fp16, fp32, bf16" │
│ 174 │ │ │ ) │
│ 175 │ │ print(f"Loading tokenizer and model (in {args.precision})") │
│ ❱ 176 │ │ model = AutoModelForCausalLM.from_pretrained( │
│ 177 │ │ │ args.model, │
│ 178 │ │ │ revision=args.revision, │
│ 179 │ │ │ torch_dtype=dict_precisions[args.precision], │
│ │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:434 in │
│ from_pretrained │
│ │
│ 431 │ │ ] │
│ 432 │ │ hub_kwargs = {name: kwargs.pop(name) for name in hub_kwargs_names if name in kwa │
│ 433 │ │ if not isinstance(config, PretrainedConfig): │
│ ❱ 434 │ │ │ config, kwargs = AutoConfig.from_pretrained( │
│ 435 │ │ │ │ pretrained_model_name_or_path, │
│ 436 │ │ │ │ return_unused_kwargs=True, │
│ 437 │ │ │ │ trust_remote_code=trust_remote_code, │
│ │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:8 │
│ 29 in from_pretrained │
│ │
│ 826 │ │ │ ) │
│ 827 │ │ │ return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs) │
│ 828 │ │ elif "model_type" in config_dict: │
│ ❱ 829 │ │ │ config_class = CONFIG_MAPPING[config_dict["model_type"]] │
│ 830 │ │ │ return config_class.from_dict(config_dict, **unused_kwargs) │
│ 831 │ │ else: │
│ 832 │ │ │ # Fallback: use pattern matching on the string. │
│ │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:5 │
│ 36 in __getitem__ │
│ │
│ 533 │ │ if key in self._extra_content: │
│ 534 │ │ │ return self._extra_content[key] │
│ 535 │ │ if key not in self._mapping: │
│ ❱ 536 │ │ │ raise KeyError(key) │
│ 537 │ │ value = self._mapping[key] │
│ 538 │ │ module_name = model_type_to_module_name(key) │
│ 539 │ │ if module_name not in self._modules: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'gpt_bigcode'
Any idea on how to fix this issue? Had trouble authenticating my huggingface that is why I am trying to load a local model.
I've tried many nodes and this error is reported. According to this link it seems that the torch version needs to be upgraded, but the highest supported torch for python 3.7 is 1.13.1, so it looks like this is a dead end. How do I avoid this problem, since someone has successfully reviewed it? My evaluation command is as follows
accelerate launch main.py \ --model bigcode/starcoder \ --tasks apps-introductory \ --max_length_generation 2048 \ --temperature 0.8 \ --n_samples 1 \ --batch_size 32 \ --save_generations \ --precision bf16 \ --save_generations_path generations.json \ --metric_output_path evaluation_results.json \ --allow_code_execution
Add support for this FIM task discussed in this issue on HumanEval and make sure numbers match with MultiPL-E implementation for santacoder for example (see table 7 of this paper)
The evaluation-harness already supports FIM mode for santacoder and incoder which is used by DS-1000 task for insertion mode
I am running the following :
accelerate launch main.py \
--model bigcode/starcoder \
--max_length_generation 512 \
--tasks multiple-js \
--n_samples 120 \
--batch_size 10 \
--temperature 0.2 \
--precision bf16 \
--allow_code_execution --use_auth_token
The results is :
{
"multiple-js": {
"pass@1": 0.0,
"pass@10": 0.0,
"pass@100": 0.0
},
"config": {
"model": "bigcode/starcoderbase",
"temperature": 0.1,
"n_samples": 120
}
}
Is their any other parameters that I might be missing ?
If "bs" in the Language list means bash, I think it should be changed to "sh".
Even if you look at the dataset name in hf, it is sh, so the current code does not load the dataset properly when working with All_task.
HumanEval-X from CodeGeeX is a multilingual version of HumanEval for Java, JS, C++ and Go. In addition to code generation, it can also be used for code translation. We want to:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.