A framework for the evaluation of autoregressive code generation language models.

License: Apache License 2.0

Python 99.20% Dockerfile 0.07% Makefile 0.18% Shell 0.55%

bigcode-evaluation-harness's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes reshinthadithyan haileyschoelkopf qqq-tech chicham sleepyhead01 loubnabnl manandey ocramz infinitylogesh pacific-wide jeankaddour timothyxxx benlipkin xingyaoww ddkang1 violacampos siviltaram armelrandy standardgalactic keyboardant murthyrudra mitya52 swayaminsync stjordanis andre15silva enkaybit thinhda96 liujuncn sssszh rand sedrickkeh esslushy ericxsun haotang1995 vipitis iq179 didier-durand prateek replit cassanof changwangss alinayawar danielkorat naman-ntc saumyapanda17 gavinchen1314 ramstorage faabian hitech777 rongaoli icsawyer ludoplex vikparuchuri arushisharma17 smallcloudai nlp-core-team stdyyl dongguanting patched-codes chiyeunglaw kyle8581 tokenbender andrew21s reindeer25 draviren soodrohit jamun23 simba2017 keytoyze pnewhook thomwolf tingchenfu evelynmitchell martijnsmits remcoschrijver anindyadeep andreyanufr taisazero ishaan-jaff teamplayer3 stanleee5 pesc101 larryleeworking abhinavnmagic evion-kim-db aoezis stovecat infi-coder gabeorlanski terryyz jaskirat8 lhaausing emlakp glaive-ai ab-10 taishi-n324 namratarshivagunde phuonglvh hiya906

bigcode-evaluation-harness's Issues

Publish the Docker images to ghcr.io?

It is possible to pre-build and publish the Docker images to ghcr.io, so that everyone doesn't need to rebuild them. I mean something like this:

https://github.com/orgs/huggingface/packages?repo_name=text-generation-inference

I am happy to do this. I think I have permissions to publish packages. But, I want to confirm with @loubnabnl or someone before I do.

Llama 7B fails for Human Eval

Running human_eval with Llama 7B gets 0 for pass@1,10 but it does achieve the correct values (pass@1 ~ 10) in other repos.

To reproduce, simply run

accelerate launch  main.py \
  --model huggyllama/llama-7b \
  --max_length_generation 512 \
  --tasks humaneval \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution

which returns

"humaneval": {
    "pass@1": 0.0,
    "pass@10": 0.0
  },

I reduced n_samples from 200 to 20 to make this run in about 1 hour on a single A100.

Running the same human_eval with CodeCapybara's repo gets the correct values {'pass@1': 0.09542682926829267, 'pass@10': 0.12530930709402355}

Could be related to huggingface/transformers#22402 although I tried explicitly setting eos, bos, pad token ids same as CodeCapybara (see here) and didn't see a change so might be something else.

If anyone has successfully run it here, would appreciate some tips!

how to use --instruction_tokens?

Show metric in outfile

{
  "codexglue_code_to_text-python-left": 0.06565988797511521,
  "config": {
    "model": "bigcode/christmas-models"
  }
}

would be better to also have the metric imo

Support `Salesforce/codet5p-220m` and other `T5ForConditionalGeneration` models

AutoModelForCausalLM.from_pretrained(Salesforce/codet5p-220m) raises ValueError: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM.. Consider utilizing AutoModelForSeq2SeqLM. Here is a quick draft example.

Support `transformers.pipeline(model=...)` models like `HuggingFaceH4/starchat-beta`

I couldn't load HuggingFaceH4/starchat-beta using our existing codebase on a single machine with multiple 40GB GPUs.
See loading instructions on the model card.

Library seems unnecessarily hardcoded

This library seems overly hard-coded, to the point where building off of it would be difficult. Is there a particular reason you decided to use this architecture for your code? Or why a fork of https://github.com/EleutherAI/lm-evaluation-harness or https://github.com/BigScience-Workshop/lm-evaluation-harness wouldn't suit your needs?

Multilingual evaluation benchmarks

error: list index out of range, when testing in multi-gpu?

bigcode-evaluation-harness/lm_eval/utils.py:388 in │
│ complete_code │
│ │
│ 385 │ │ │ if not INFILL_MODE: │
│ 386 │ │ │ │ gen_code = gen_code[len(prefix) :] │
│ 387 │ │ │ if postprocess: │
│ ❱ 388 │ │ │ │ code_gens[sample].append( │
│ 389 │ │ │ │ │ task.postprocess_generation(gen_code, int(sample))

Variable max_length_generation

Allow max_length_generation to change from batch to batch to speed-up tasks where length changes a lot.
For tasks scored with exact match, we even know the maximum length for each sample, so it would be nice to just limited the max length for those samples. Need to be careful to make it work with batching.

Would be better to save generations on the fly

This would be a bigger refactoring but imo it'd be better to save generations after each generation is done & along with that offer restarting from previously unfinished generations (e.g. if it's interrupted or sth)

just leaving this here if someone is interested

Suggest tasks for the Evaluation Harness

Creating an Evaluation Harness for code LLMs

We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!

Existing list

Please take a look at the existing sheet of evaluation benchmarks here.

Contribute

Please use the following template to suggest new tasks for the Evaluation Harness.

Name	Link	Number of samples	Languages	Available on the HF Hub
HumanEval	https://github.com/openai/human-eval	164	Python	Yes

Here's the Markdown snippet that you can copy/paste:

|Name|Link|Number of samples| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-
| | | | | | | |

APPS dataset prompting seems wrong

In the original APPS paper and their original code, Standard Input format is used when fn_name is not given.

But, in here, Standard is used when fn_name is given.

Add tests to the evaluation harness

Add tests to the existing evaluation benchmarks to make sure they are not broken by new additions. (e.g: ensure fixed generations for a specific model using greedy sampling with a fixed seed)

support for batch size > 1 for single problem generations (n_samples=1)

The below works when setting batch_size 1 🧐

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 840.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 949.22it/s]
number of problems for this task is 14918
0it [00:06, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 83, in complete_code
    for step, batch in tqdm(enumerate(dataloader)):
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/data_loader.py", line 491, in __iter__
    observed_batch_size = find_batch_size(batch)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/utils/operations.py", line 177, in find_batch_size
    raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
TypeError: Can only find the batch size of tensors but got <class 'NoneType'>.

Probably related:

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --limit 8 --max_length_generation 512 --do_sample False --n_samples 100 --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 782.52it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 901.68it/s]
number of problems for this task is 8
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 87, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/transformers/generation/utils.py", line 1513, in generate
    raise ValueError(
ValueError: num_return_sequences has to be 1, but is 16 when doing greedy search.

Reproducing the performance of HumanEval on starcoder

Thank you for providing an excellent evaluation toolkit! It is very convenient and flexible.

But when I used the evaluation tool to evaluate the HumanEval performance on the statcoder, I obtained the following results.

{
  "humaneval": {
    "pass@1": 0.3011280487804878,
    "pass@10": 0.41708568124396794,
    "pass@100": 0.5175640419344132
  },
  "config": {
    "model": "../ckpt/starcoder",
    "temperature": 0.2,
    "n_samples": 200
  }
}

It is lower than the paper result pass@1 is 33.6. Did I miss anything crucial? All parameters are default.

santacoder fp16 causes NaN on humaneval?

Just wondering if we need to use fp32 for evaluation of santacoder?
I tried fp16 evaluation because I fine-tuned santacoder on the stack-dedup python dataset for 1000 steps with fp16 precision. But when I ran fp16 evaluation on humaneval, it leads to the following error (for both --model=bigcode/santacoder and --model=myfp16_finetuned_santacoder),

File "/home/ywen/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2583, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The error went away if I use --precision=fp32, leading to 37.19% pass@100 on humaneval which is kinda close to the number reported in the paper. This is the command I used to run fp16 evaluation on humaneval.

accelerate launch main.py \
    --model bigcode/santacoder \
    --max_length_generation 368 \
    --tasks humaneval \
    --temperature 0.4 \
    --n_samples 100 \
    --batch_size 20 \
    --allow_code_execution \
    --trust_remote_code \
    --use_auth_token \
    --generation_only \
    --precision fp16 \
    --save_generations

Add CodeXGLUE-text-to-text benchmark for documentation translation

Add this benchmark for documentation translation to the evaluation harness: https://huggingface.co/datasets/code_x_glue_tt_text_to_text

few-shot setting (see CoNaLa and Spider tasks for examples)
fine-tning

add TransCoder task for code translation

Add this code translation (with unit test) task: https://github.com/facebookresearch/TransCoder. The C++ -> Python subsset was used in PaLM. This requires:

adding the evaluation metric to HuggingFace evaluate https://huggingface.co/docs/evaluate/index
adding the TransCoder dataset to HuggingFace hub, there already is this dataset but make sure it matches the original dataset in the GitHub repo.
adding the benchmark to the evaluation harness in a few-shot setting (similarily to PaLM approach)

Error Running Odex Integration Code

Hi, I am trying to test the PR submitted for Odex and Conala tasks support. The repository is here . I am able to successfully run the bigcode-evaluation-harness code for inference. However, using the same setup throws me an error when I run the PR code.

Here is the accelerate config used

$ accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                                                                                                
multi-GPU                                                                                                                                                                                                           
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                                                                          
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                                                                                                   
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                                                                                                          
Do you want to use FullyShardedDataParallel? [yes/NO]: NO                                                                                                                                                           
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:1
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
bf16

The command used is

$ accelerate launch main.py --model Salesforce/codegen-350M-mono --tasks odex-en --temperature 0.8 --top_p 0.95 --do_sample True --n_samples 100 --batch_size 10 --save_generations --allow_code_execution

This is the error I am getting

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "main.py", line 10, in <module>
    from lm_eval.evaluator import Evaluator
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/evaluator.py", line 5, in <module>
    from lm_eval import tasks
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/__init__.py", line 3, in <module>
    from . import apps, codexglue_code_to_text, conala, concode, humaneval, mbpp, codexglue_text_to_text, odex, mconala
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/codexglue_code_to_text.py", line 56, in <module>
    def compute_codexglue_code_to_text_bleu(gold_and_predicted_items: list[tuple[str, str]]):
TypeError: 'type' object is not subscriptable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2237438) of binary: /home/rudra/.cache/A100/bin/python
Traceback (most recent call last):
  File "/home/rudra/.cache/A100/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-23_02:34:25
  host      : cccxc578.pok.ibm.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2237438)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

main() crashes with --allow-code-execution=True

The call to .generate() in utils.py complete_code() seems to be mis-configured, since it produces the stack trace below.

Here I use model='hf-internal-testing/tiny-random-gpt2' (but codeparrot fails in the same way), and allow-code-execution=True

Traceback (most recent call last):
  File "~/bigcode-evaluation-harness/main.py", line 147, in <module>
    main()
  File "~/bigcode-evaluation-harness/main.py", line 132, in main
    results[task] = evaluator.evaluate(task)
  File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 193, in evaluate
    generations, references = self.generate_text(task)
  File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 70, in generate_text
    generations = parallel_generations(
  File "~/bigcode-evaluation-harness/lm_eval/generation.py", line 140, in parallel_generations
    generations = complete_code(
  File "~/bigcode-evaluation-harness/lm_eval/utils.py", line 177, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1320, in generate
    return self.sample(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1938, in sample
    outputs = self(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1048, in forward
    transformer_outputs = self.transformer(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 835, in forward
    position_embeds = self.wpe(position_ids)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

MBPP eval extremely slow for CodeGen2 and Replit-Code

Hi, I have been trying to evaluate CodeGen2 and Replit-Code models on the mbpp task, but the code runs extremely slow. While the corresponding eval time for other models is around 2 hours, the ETA for these 2 models varies significantly and sometimes goes up to > 90 hrs. Any help to resolve this issue? Thanks!

improve the prompt examples of one-shot setting in APPS evaluation

Models are usually evaluated on APPS after fine-tuning on the train split, but one can also do few-shot evaluation. It is already implemented in this evaluation harness: the prompt includes two shortened examples from the train split one for each call type (Standard Input and Call based).

We want to improve these examples:

analyse the different question types of APPS and build 2 or 3 examples to cover these types (make sure they aren't in the test set)
see how models behave given different examples (you can play with the model demos/spaces in this org there's codeparrot, incoder and codegen)
the prompt shouldn't end up being too long

HumanEval post-processing

For the HumanEval task, we remove the last block, based on the stop tokens: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/tasks/humaneval.py#L70

If no stopword is found in the generation (for example if by chance the generation ends exactly at the function's last return statement, or before), then remove_last_block would remove the entire generation and return an empty string.

It seems to me that we should rather: remove anything that is after the first block, if there ever is a match with one of the stop tokens

If this issue makes sense, happy to create a PR for that.

Any plan for attaching release tag?

I would like to express my sincere gratitude for the ease of code eval provided by your repository.
I see that tasks and features are being added rapidly thanks to lots of contributors.
Do you have any plan for attaching release tags for version control?

Support Kotlin in MultiPL-E

Query around n_samples argument

Hi, I am performing code generations using the following command

accelerate launch  main.py --model bigcode/santacoder --tasks humaneval --max_length_generation 256 \
--temperature 0.8 --top_p 0.95 --do_sample True --generation_only --n_samples 100 --batch_size 32 \
--output_generations generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json \
--save_generations --allow_code_execution --trust_remote_code

I am expecting the number of candidate generations per task to be around 100. However, on inspecting the generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json file I see that there are 96 generations per task.

Is there something I am missing? Thanks

Execution-based FIM evaluation

The SantaCoder FIM evaluation with MultiPL-E uses exact match. We should also execute the generated code. The dataset is here:

https://huggingface.co/datasets/bigcode/santacoder-fim-task

All that is needed is is to execute item['prefix'] + generated_solution + item['suffix'] + item['tests'.

I recommend supporting n samples per item.

Add Reasoning tasks to the evaluation

In recent times, Code generation models have shown to be good at solving Natural language and/or math reasoning tasks (1 and 2). So, it would be good to evaluate the Bigcode models on these tasks.

As discussed, in the evaluation meeting - We could explore the options of adding PAL-datasets and/or reasoning tasks from HELM

PAL Datasets:

investigate discrepancy in odex implementation

This PR adds Odex benchmark to the evaluation harness, however there is a discrepancy in pass@1 for codegen-2B-mono (37% vs 41%) between the two implementations as explained in this comment

requirements.txt doesn't support newer models (KeyError)

(Related issue here: #73)
The requirements.txt file lists transformers==4.25.1, which doesn't support a lot of the newer models such as bigcode/starcoder and huggyllama/llama-7b (it gives errors such as KeyError: 'gpt_bigcode'). This should be simple to fix from the user side (just install a newer version of transformers), but just thought I'd flag it here since it's probably best if the requirements.txt can accommodate these newer models.

Add selected tasks to the Evaluation Harness

We would appreciate help with adding one of the following tasks to the Evaluation Harness. See the issues below:

failed evaluation on GSM8K

I try to run your code in a docker container from ghcr.io/bigcode-project/evaluation-harness.

The exact bash command is

accelerate launch  main.py \
  --model bigcode/starcoder \
  --use_auth_token \
  --max_length_generation 512 \
  --tasks pal-gsm8k-greedy \
  --n_samples 1 \
  --temperature 0 \
  --batch_size 1 \
  --do_sample False \
  --allow_code_execution \
  --save_generations \
  --save_generations_path ./output/starcoder_on_gsm8k.json

However, it returns the following:

Evaluating generations...
{
  "pal-gsm8k-greedy": {
    "accuracy": 0.0,
    "num_failed_execution": 1319
  },
  "config": {
    "model": "bigcode/starcoder",
    "revision": null,
    "temperature": 0.0,
    "n_samples": 1
  }
}

where the saved generation contents are like:

Any solutions?

8-bit models unsupported

Currently, the harness raises an exception when used with 8-bit models:

Traceback (most recent call last):
  File "bigcode-evaluation-harness/main.py", line 233, in <module>
    main()
  File "bigcode-evaluation-harness/main.py", line 216, in main
    results[task] = evaluator.evaluate(task)
  File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 67, in evaluate
    generations, references = self.generate_text(task_name)
  File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "bigcode-evaluation-harness/lm_eval/generation.py", line 83, in parallel_generations
    model = model.to(accelerator.device)
  File "/root/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1873, in to
    raise ValueError(
ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

for context, this is the model i've been trying to eval: https://huggingface.co/cassanof/santacoder-lua/tree/main

seems like a check is needed for every .to call... any suggestions?

How to evaluate the model memory efficiently?

Thanks for the great work and convenient benchmarking tool!

I would like to evaluate CodeGen-16B model on the humaneval benchmark. At my disposal there is A6000 GPUs with 48Gb of memory. The evaluation script crashes due to CUDA out of memory here (i.e accelerator.prepare) even with the smallest batch size - 1.

Since it is model evaluation I would expect that most of the memory is occupied by the model params (no optimizer states).
Naively, this model should fit into a single GPU if loaded in half precision, since 2x 16 = 32 < 48. However, when setting in the accelerate launch mixed precision with fp16 I still face OOM error.

What measures would you suggest to fit the model onto a single GPU?

Consider a refactoring

Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:

how can we make it as easy as possible to add new metrics. it's possible that we may want to add a few dozen more datasets each with some quirks. we can look at other frameworks like the lm-evaluation-harness to see how it's done there and if it make sense to build on top of it or just take inspiration. e.g. i think it would be nice if adding a new evaluation would require changes in as few places as possible.
going for multilinguality we might need to run the code execution in different environments. maybe we should decouple generation and execution by saving the results on disk in between.
for the execution part we probably will need to think about docker environments to execute code in different frameworks.

These are just a few thoughts, let me know if you think this makes sense @loubnabnl.

[Minor] Missing task template

The guide on adding a new task recommends to

From the bigcode-evaluation-harness project root, copy over the new_task.py template to lm_eval/tasks

but there is no new_task.py template in the project root.

I suspect this is a relic from the lm-evaluation-harness repo?

MultiPL-E Integration

As part of the integration of MultiPL-E benchmark create Dockerfile/Docker image with all dependencies required to execute the code generations for different programming languages

Per-PL perplexity vs. pass@k rate

For @canders1, @ytzi, and me

Add CodeXGLUE-code-refinement benchmark

Add this benchmark for code refinement to the evaluation harness: https://huggingface.co/datasets/code_x_glue_cc_code_refinement

few-shot setting (see CoNaLa and Spider tasks for examples)
fine-tuning

Problem launching evaluation

Hi, I am trying to run the evaluation of Santacoder using the script provided, but I am getting the following error which I am not able to find out why:

File "/home/kcdharma/ndec/eval/bin/accelerate", line 8, in <module>
  sys.exit(main())
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
  args.func(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 910, in launch_command
  simple_launcher(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 397, in simple_launcher
  process = subprocess.Popen(cmd, env=current_env)
File "/usr/lib/python3.10/subprocess.py", line 966, in __init__
  self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1762, in _execute_child
  env_list.append(k + b'=' + os.fsencode(v))
File "/usr/lib/python3.10/os.py", line 810, in fsencode
  filename = fspath(filename)  # Does type-checking of `filename`.

TypeError: expected str, bytes or os.PathLike object, not NoneType

Any help is appreciated.

Design prompts for few-shot evaluation tasks

We do not have natural language prompts for all tasks in the Evaluation Harness. We would either like to find prompts which have been adopted by other research groups or design prompts that work well for the task at hand. For example, we would appreciate help with designing prompts for the following tasks:

APPS
CoNaLA
Concode

[Minor] Conflicting dependencies in requirements.txt

Running pip install -r requirements.txt gives me

ERROR: Cannot install -r requirements.txt (line 1) and huggingface_hub==0.8.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested huggingface_hub==0.8.1
    transformers 4.25.1 depends on huggingface-hub<1.0 and >=0.10.0

Instead,

huggingface_hub>=0.10.0

fixed it for me and hasn't broken anything so far.

Cannot run eval with local model directory

Hi. Thank you for your hard work!

I am trying to run bigcode/starcoder model on a server. I downloaded the huggingface repo with git and transferred the folder to the server.

To be extra sure I wasn't passing the path wrong, I modified main.py:

    parser.add_argument(
        "--model",
        default="/home/ubuntu/.cache/huggingface/hub/models--bigcode--starcoder/snapshots/8a57e3930912e5d22ddc4d5f46b4b99f169afbe9",
        help="Model to evaluate, provide a repo name in Hugging Face hub or a local path",
    )

Once I run python main.py I get:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:230 in <module>                       │
│                                                                                                  │
│   227                                                                                            │
│   228                                                                                            │
│   229 if __name__ == "__main__":                                                                 │
│ ❱ 230 │   main()                                                                                 │
│   231                                                                                            │
│                                                                                                  │
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:176 in main                           │
│                                                                                                  │
│   173 │   │   │   │   f"Non valid precision {args.precision}, choose from: fp16, fp32, bf16"     │
│   174 │   │   │   )                                                                              │
│   175 │   │   print(f"Loading tokenizer and model (in {args.precision})")                        │
│ ❱ 176 │   │   model = AutoModelForCausalLM.from_pretrained(                                      │
│   177 │   │   │   args.model,                                                                    │
│   178 │   │   │   revision=args.revision,                                                        │
│   179 │   │   │   torch_dtype=dict_precisions[args.precision],                                   │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:434 in  │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   431 │   │   ]                                                                                  │
│   432 │   │   hub_kwargs = {name: kwargs.pop(name) for name in hub_kwargs_names if name in kwa   │
│   433 │   │   if not isinstance(config, PretrainedConfig):                                       │
│ ❱ 434 │   │   │   config, kwargs = AutoConfig.from_pretrained(                                   │
│   435 │   │   │   │   pretrained_model_name_or_path,                                             │
│   436 │   │   │   │   return_unused_kwargs=True,                                                 │
│   437 │   │   │   │   trust_remote_code=trust_remote_code,                                       │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:8 │
│ 29 in from_pretrained                                                                            │
│                                                                                                  │
│   826 │   │   │   )                                                                              │
│   827 │   │   │   return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)   │
│   828 │   │   elif "model_type" in config_dict:                                                  │
│ ❱ 829 │   │   │   config_class = CONFIG_MAPPING[config_dict["model_type"]]                       │
│   830 │   │   │   return config_class.from_dict(config_dict, **unused_kwargs)                    │
│   831 │   │   else:                                                                              │
│   832 │   │   │   # Fallback: use pattern matching on the string.                                │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:5 │
│ 36 in __getitem__                                                                                │
│                                                                                                  │
│   533 │   │   if key in self._extra_content:                                                     │
│   534 │   │   │   return self._extra_content[key]                                                │
│   535 │   │   if key not in self._mapping:                                                       │
│ ❱ 536 │   │   │   raise KeyError(key)                                                            │
│   537 │   │   value = self._mapping[key]                                                         │
│   538 │   │   module_name = model_type_to_module_name(key)                                       │
│   539 │   │   if module_name not in self._modules:                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'gpt_bigcode'

Any idea on how to fix this issue? Had trouble authenticating my huggingface that is why I am trying to load a local model.

spider zero-shot evaluation with execution acuuracy metric

Spider is currently two-shot using BLEU score, we want to add zero-shot evaluation with execution accuracy metric similarily to Picard and this paper.

When I evaluated the dataset APPS, I got the error RuntimeError: stack.size() >= frames.back().function->n_inputs INTERNAL ASSERT FAILED

I've tried many nodes and this error is reported. According to this link it seems that the torch version needs to be upgraded, but the highest supported torch for python 3.7 is 1.13.1, so it looks like this is a dead end. How do I avoid this problem, since someone has successfully reviewed it? My evaluation command is as follows
accelerate launch main.py \ --model bigcode/starcoder \ --tasks apps-introductory \ --max_length_generation 2048 \ --temperature 0.8 \ --n_samples 1 \ --batch_size 32 \ --save_generations \ --precision bf16 \ --save_generations_path generations.json \ --metric_output_path evaluation_results.json \ --allow_code_execution

Add SantaCoder FIM task

Add support for this FIM task discussed in this issue on HumanEval and make sure numbers match with MultiPL-E implementation for santacoder for example (see table 7 of this paper)
The evaluation-harness already supports FIM mode for santacoder and incoder which is used by DS-1000 task for insertion mode

Getting Zeros for StarCoder on multiple-js

I am running the following :

accelerate launch  main.py   \
  --model bigcode/starcoder   \
  --max_length_generation 512  \
  --tasks multiple-js   \
  --n_samples 120  \
  --batch_size 10  \
  --temperature 0.2  \
  --precision bf16  \
  --allow_code_execution   --use_auth_token

The results is :

{
  "multiple-js": {
    "pass@1": 0.0,
    "pass@10": 0.0,
    "pass@100": 0.0
  },
  "config": {
    "model": "bigcode/starcoderbase",
    "temperature": 0.1,
    "n_samples": 120
  }
}

Is their any other parameters that I might be missing ?

What does 'bs' in LANGUAGES list mean?

bigcode-evaluation-harness/lm_eval/tasks/multiple.py

Line 39 in 5b7e723

"bs",

If "bs" in the Language list means bash, I think it should be changed to "sh".
Even if you look at the dataset name in hf, it is sh, so the current code does not load the dataset properly when working with All_task.

add HumanEval-X metric to the HF hub and the task to the harness

HumanEval-X from CodeGeeX is a multilingual version of HumanEval for Java, JS, C++ and Go. In addition to code generation, it can also be used for code translation. We want to:

add the evaluation metric for different languages to HF evaluate similarily to code_eval
add the code generation task to this evaluation harness

bigcode-project / bigcode-evaluation-harness Goto Github PK