Giter VIP home page Giter VIP logo

bigcode-evaluation-harness's People

Contributors

andre15silva avatar arjunguha avatar armelrandy avatar benlipkin avatar cassanof avatar changwangss avatar chiyeunglaw avatar didier-durand avatar elfsong avatar ganler avatar icsawyer avatar infinitylogesh avatar iq179 avatar keytoyze avatar loubnabnl avatar lvwerra avatar manandey avatar maxmatical avatar mitya52 avatar muennighoff avatar raymondli0 avatar sedrickkeh avatar shehrozek-cerebras avatar siviltaram avatar terryyz avatar thomwolf avatar vikparuchuri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigcode-evaluation-harness's Issues

Llama 7B fails for Human Eval

Running human_eval with Llama 7B gets 0 for pass@1,10 but it does achieve the correct values (pass@1 ~ 10) in other repos.

To reproduce, simply run

accelerate launch  main.py \
  --model huggyllama/llama-7b \
  --max_length_generation 512 \
  --tasks humaneval \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution

which returns

"humaneval": {
    "pass@1": 0.0,
    "pass@10": 0.0
  },

I reduced n_samples from 200 to 20 to make this run in about 1 hour on a single A100.

Running the same human_eval with CodeCapybara's repo gets the correct values {'pass@1': 0.09542682926829267, 'pass@10': 0.12530930709402355}

Could be related to huggingface/transformers#22402 although I tried explicitly setting eos, bos, pad token ids same as CodeCapybara (see here) and didn't see a change so might be something else.

If anyone has successfully run it here, would appreciate some tips!

Show metric in outfile

{
  "codexglue_code_to_text-python-left": 0.06565988797511521,
  "config": {
    "model": "bigcode/christmas-models"
  }
}

would be better to also have the metric imo

error: list index out of range, when testing in multi-gpu?

bigcode-evaluation-harness/lm_eval/utils.py:388 in │
│ complete_code │
│ │
│ 385 │ │ │ if not INFILL_MODE: │
│ 386 │ │ │ │ gen_code = gen_code[len(prefix) :] │
│ 387 │ │ │ if postprocess: │
│ ❱ 388 │ │ │ │ code_gens[sample].append( │
│ 389 │ │ │ │ │ task.postprocess_generation(gen_code, int(sample))

Variable max_length_generation

Allow max_length_generation to change from batch to batch to speed-up tasks where length changes a lot.
For tasks scored with exact match, we even know the maximum length for each sample, so it would be nice to just limited the max length for those samples. Need to be careful to make it work with batching.

Would be better to save generations on the fly

This would be a bigger refactoring but imo it'd be better to save generations after each generation is done & along with that offer restarting from previously unfinished generations (e.g. if it's interrupted or sth)

just leaving this here if someone is interested

Suggest tasks for the Evaluation Harness

Creating an Evaluation Harness for code LLMs

We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!

Existing list

Please take a look at the existing sheet of evaluation benchmarks here.

Contribute

Please use the following template to suggest new tasks for the Evaluation Harness.

Name Link Number of samples Languages Available on the HF Hub
HumanEval https://github.com/openai/human-eval 164 Python Yes

Here's the Markdown snippet that you can copy/paste:

|Name|Link|Number of samples| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-
| | | | | | | |

Add tests to the evaluation harness

Add tests to the existing evaluation benchmarks to make sure they are not broken by new additions. (e.g: ensure fixed generations for a specific model using greedy sampling with a fixed seed)

support for batch size > 1 for single problem generations (n_samples=1)

The below works when setting batch_size 1 🧐

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 840.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 949.22it/s]
number of problems for this task is 14918
0it [00:06, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 83, in complete_code
    for step, batch in tqdm(enumerate(dataloader)):
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/data_loader.py", line 491, in __iter__
    observed_batch_size = find_batch_size(batch)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/utils/operations.py", line 177, in find_batch_size
    raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
TypeError: Can only find the batch size of tensors but got <class 'NoneType'>.

Probably related:

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --limit 8 --max_length_generation 512 --do_sample False --n_samples 100 --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 782.52it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 901.68it/s]
number of problems for this task is 8
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 87, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/transformers/generation/utils.py", line 1513, in generate
    raise ValueError(
ValueError: num_return_sequences has to be 1, but is 16 when doing greedy search.

Reproducing the performance of HumanEval on starcoder

Thank you for providing an excellent evaluation toolkit! It is very convenient and flexible.

But when I used the evaluation tool to evaluate the HumanEval performance on the statcoder, I obtained the following results.

{
  "humaneval": {
    "pass@1": 0.3011280487804878,
    "pass@10": 0.41708568124396794,
    "pass@100": 0.5175640419344132
  },
  "config": {
    "model": "../ckpt/starcoder",
    "temperature": 0.2,
    "n_samples": 200
  }
}

It is lower than the paper result pass@1 is 33.6. Did I miss anything crucial? All parameters are default.

santacoder fp16 causes NaN on humaneval?

Just wondering if we need to use fp32 for evaluation of santacoder?
I tried fp16 evaluation because I fine-tuned santacoder on the stack-dedup python dataset for 1000 steps with fp16 precision. But when I ran fp16 evaluation on humaneval, it leads to the following error (for both --model=bigcode/santacoder and --model=myfp16_finetuned_santacoder),

File "/home/ywen/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2583, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The error went away if I use --precision=fp32, leading to 37.19% pass@100 on humaneval which is kinda close to the number reported in the paper. This is the command I used to run fp16 evaluation on humaneval.

accelerate launch main.py \
    --model bigcode/santacoder \
    --max_length_generation 368 \
    --tasks humaneval \
    --temperature 0.4 \
    --n_samples 100 \
    --batch_size 20 \
    --allow_code_execution \
    --trust_remote_code \
    --use_auth_token \
    --generation_only \
    --precision fp16 \
    --save_generations 

add TransCoder task for code translation

Add this code translation (with unit test) task: https://github.com/facebookresearch/TransCoder. The C++ -> Python subsset was used in PaLM. This requires:

  • adding the evaluation metric to HuggingFace evaluate https://huggingface.co/docs/evaluate/index
  • adding the TransCoder dataset to HuggingFace hub, there already is this dataset but make sure it matches the original dataset in the GitHub repo.
  • adding the benchmark to the evaluation harness in a few-shot setting (similarily to PaLM approach)

Error Running Odex Integration Code

Hi, I am trying to test the PR submitted for Odex and Conala tasks support. The repository is here . I am able to successfully run the bigcode-evaluation-harness code for inference. However, using the same setup throws me an error when I run the PR code.

Here is the accelerate config used

$ accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                                                                                                
multi-GPU                                                                                                                                                                                                           
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                                                                          
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                                                                                                   
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                                                                                                          
Do you want to use FullyShardedDataParallel? [yes/NO]: NO                                                                                                                                                           
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:1
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
bf16                                                                                            

The command used is

$ accelerate launch main.py --model Salesforce/codegen-350M-mono --tasks odex-en --temperature 0.8 --top_p 0.95 --do_sample True --n_samples 100 --batch_size 10 --save_generations --allow_code_execution

This is the error I am getting

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "main.py", line 10, in <module>
    from lm_eval.evaluator import Evaluator
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/evaluator.py", line 5, in <module>
    from lm_eval import tasks
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/__init__.py", line 3, in <module>
    from . import apps, codexglue_code_to_text, conala, concode, humaneval, mbpp, codexglue_text_to_text, odex, mconala
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/codexglue_code_to_text.py", line 56, in <module>
    def compute_codexglue_code_to_text_bleu(gold_and_predicted_items: list[tuple[str, str]]):
TypeError: 'type' object is not subscriptable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2237438) of binary: /home/rudra/.cache/A100/bin/python
Traceback (most recent call last):
  File "/home/rudra/.cache/A100/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-23_02:34:25
  host      : cccxc578.pok.ibm.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2237438)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

main() crashes with --allow-code-execution=True

The call to .generate() in utils.py complete_code() seems to be mis-configured, since it produces the stack trace below.

Here I use model='hf-internal-testing/tiny-random-gpt2' (but codeparrot fails in the same way), and allow-code-execution=True

Traceback (most recent call last):
  File "~/bigcode-evaluation-harness/main.py", line 147, in <module>
    main()
  File "~/bigcode-evaluation-harness/main.py", line 132, in main
    results[task] = evaluator.evaluate(task)
  File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 193, in evaluate
    generations, references = self.generate_text(task)
  File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 70, in generate_text
    generations = parallel_generations(
  File "~/bigcode-evaluation-harness/lm_eval/generation.py", line 140, in parallel_generations
    generations = complete_code(
  File "~/bigcode-evaluation-harness/lm_eval/utils.py", line 177, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1320, in generate
    return self.sample(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1938, in sample
    outputs = self(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1048, in forward
    transformer_outputs = self.transformer(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 835, in forward
    position_embeds = self.wpe(position_ids)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

MBPP eval extremely slow for CodeGen2 and Replit-Code

Hi, I have been trying to evaluate CodeGen2 and Replit-Code models on the mbpp task, but the code runs extremely slow. While the corresponding eval time for other models is around 2 hours, the ETA for these 2 models varies significantly and sometimes goes up to > 90 hrs. Any help to resolve this issue? Thanks!

improve the prompt examples of one-shot setting in APPS evaluation

Models are usually evaluated on APPS after fine-tuning on the train split, but one can also do few-shot evaluation. It is already implemented in this evaluation harness: the prompt includes two shortened examples from the train split one for each call type (Standard Input and Call based).

We want to improve these examples:

  • analyse the different question types of APPS and build 2 or 3 examples to cover these types (make sure they aren't in the test set)
  • see how models behave given different examples (you can play with the model demos/spaces in this org there's codeparrot, incoder and codegen)
  • the prompt shouldn't end up being too long

HumanEval post-processing

For the HumanEval task, we remove the last block, based on the stop tokens: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/tasks/humaneval.py#L70

If no stopword is found in the generation (for example if by chance the generation ends exactly at the function's last return statement, or before), then remove_last_block would remove the entire generation and return an empty string.

It seems to me that we should rather: remove anything that is after the first block, if there ever is a match with one of the stop tokens

If this issue makes sense, happy to create a PR for that.

Any plan for attaching release tag?

I would like to express my sincere gratitude for the ease of code eval provided by your repository.
I see that tasks and features are being added rapidly thanks to lots of contributors.
Do you have any plan for attaching release tags for version control?

Query around n_samples argument

Hi, I am performing code generations using the following command

accelerate launch  main.py --model bigcode/santacoder --tasks humaneval --max_length_generation 256 \
--temperature 0.8 --top_p 0.95 --do_sample True --generation_only --n_samples 100 --batch_size 32 \
--output_generations generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json \
--save_generations --allow_code_execution --trust_remote_code

I am expecting the number of candidate generations per task to be around 100. However, on inspecting the generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json file I see that there are 96 generations per task.

Is there something I am missing? Thanks

Add Reasoning tasks to the evaluation

In recent times, Code generation models have shown to be good at solving Natural language and/or math reasoning tasks (1 and 2). So, it would be good to evaluate the Bigcode models on these tasks.

As discussed, in the evaluation meeting - We could explore the options of adding PAL-datasets and/or reasoning tasks from HELM

PAL Datasets:

requirements.txt doesn't support newer models (KeyError)

(Related issue here: #73)
The requirements.txt file lists transformers==4.25.1, which doesn't support a lot of the newer models such as bigcode/starcoder and huggyllama/llama-7b (it gives errors such as KeyError: 'gpt_bigcode'). This should be simple to fix from the user side (just install a newer version of transformers), but just thought I'd flag it here since it's probably best if the requirements.txt can accommodate these newer models.

failed evaluation on GSM8K

I try to run your code in a docker container from ghcr.io/bigcode-project/evaluation-harness.

The exact bash command is

accelerate launch  main.py \
  --model bigcode/starcoder \
  --use_auth_token \
  --max_length_generation 512 \
  --tasks pal-gsm8k-greedy \
  --n_samples 1 \
  --temperature 0 \
  --batch_size 1 \
  --do_sample False \
  --allow_code_execution \
  --save_generations \
  --save_generations_path ./output/starcoder_on_gsm8k.json

However, it returns the following:

Evaluating generations...
{
  "pal-gsm8k-greedy": {
    "accuracy": 0.0,
    "num_failed_execution": 1319
  },
  "config": {
    "model": "bigcode/starcoder",
    "revision": null,
    "temperature": 0.0,
    "n_samples": 1
  }
}

where the saved generation contents are like:
Screenshot 2023-06-20 at 14 37 40

Any solutions?

8-bit models unsupported

Currently, the harness raises an exception when used with 8-bit models:

Traceback (most recent call last):
  File "bigcode-evaluation-harness/main.py", line 233, in <module>
    main()
  File "bigcode-evaluation-harness/main.py", line 216, in main
    results[task] = evaluator.evaluate(task)
  File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 67, in evaluate
    generations, references = self.generate_text(task_name)
  File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "bigcode-evaluation-harness/lm_eval/generation.py", line 83, in parallel_generations
    model = model.to(accelerator.device)
  File "/root/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1873, in to
    raise ValueError(
ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

for context, this is the model i've been trying to eval: https://huggingface.co/cassanof/santacoder-lua/tree/main

seems like a check is needed for every .to call... any suggestions?

How to evaluate the model memory efficiently?

Thanks for the great work and convenient benchmarking tool!

I would like to evaluate CodeGen-16B model on the humaneval benchmark. At my disposal there is A6000 GPUs with 48Gb of memory. The evaluation script crashes due to CUDA out of memory here (i.e accelerator.prepare) even with the smallest batch size - 1.

Since it is model evaluation I would expect that most of the memory is occupied by the model params (no optimizer states).
Naively, this model should fit into a single GPU if loaded in half precision, since 2x 16 = 32 < 48. However, when setting in the accelerate launch mixed precision with fp16 I still face OOM error.

What measures would you suggest to fit the model onto a single GPU?

Consider a refactoring

Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:

  • how can we make it as easy as possible to add new metrics. it's possible that we may want to add a few dozen more datasets each with some quirks. we can look at other frameworks like the lm-evaluation-harness to see how it's done there and if it make sense to build on top of it or just take inspiration. e.g. i think it would be nice if adding a new evaluation would require changes in as few places as possible.
  • going for multilinguality we might need to run the code execution in different environments. maybe we should decouple generation and execution by saving the results on disk in between.
  • for the execution part we probably will need to think about docker environments to execute code in different frameworks.

These are just a few thoughts, let me know if you think this makes sense @loubnabnl.

MultiPL-E Integration

As part of the integration of MultiPL-E benchmark create Dockerfile/Docker image with all dependencies required to execute the code generations for different programming languages

Problem launching evaluation

Hi, I am trying to run the evaluation of Santacoder using the script provided, but I am getting the following error which I am not able to find out why:

File "/home/kcdharma/ndec/eval/bin/accelerate", line 8, in <module>
  sys.exit(main())
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
  args.func(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 910, in launch_command
  simple_launcher(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 397, in simple_launcher
  process = subprocess.Popen(cmd, env=current_env)
File "/usr/lib/python3.10/subprocess.py", line 966, in __init__
  self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1762, in _execute_child
  env_list.append(k + b'=' + os.fsencode(v))
File "/usr/lib/python3.10/os.py", line 810, in fsencode
  filename = fspath(filename)  # Does type-checking of `filename`.

TypeError: expected str, bytes or os.PathLike object, not NoneType

Any help is appreciated.

Design prompts for few-shot evaluation tasks

We do not have natural language prompts for all tasks in the Evaluation Harness. We would either like to find prompts which have been adopted by other research groups or design prompts that work well for the task at hand. For example, we would appreciate help with designing prompts for the following tasks:

  • APPS
  • CoNaLA
  • Concode

[Minor] Conflicting dependencies in requirements.txt

Running pip install -r requirements.txt gives me

ERROR: Cannot install -r requirements.txt (line 1) and huggingface_hub==0.8.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested huggingface_hub==0.8.1
    transformers 4.25.1 depends on huggingface-hub<1.0 and >=0.10.0

Instead,

huggingface_hub>=0.10.0

fixed it for me and hasn't broken anything so far.

Cannot run eval with local model directory

Hi. Thank you for your hard work!

I am trying to run bigcode/starcoder model on a server. I downloaded the huggingface repo with git and transferred the folder to the server.

To be extra sure I wasn't passing the path wrong, I modified main.py:

    parser.add_argument(
        "--model",
        default="/home/ubuntu/.cache/huggingface/hub/models--bigcode--starcoder/snapshots/8a57e3930912e5d22ddc4d5f46b4b99f169afbe9",
        help="Model to evaluate, provide a repo name in Hugging Face hub or a local path",
    )

Once I run python main.py I get:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:230 in <module>                       │
│                                                                                                  │
│   227                                                                                            │
│   228                                                                                            │
│   229 if __name__ == "__main__":                                                                 │
│ ❱ 230 │   main()                                                                                 │
│   231                                                                                            │
│                                                                                                  │
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:176 in main                           │
│                                                                                                  │
│   173 │   │   │   │   f"Non valid precision {args.precision}, choose from: fp16, fp32, bf16"     │
│   174 │   │   │   )                                                                              │
│   175 │   │   print(f"Loading tokenizer and model (in {args.precision})")                        │
│ ❱ 176 │   │   model = AutoModelForCausalLM.from_pretrained(                                      │
│   177 │   │   │   args.model,                                                                    │
│   178 │   │   │   revision=args.revision,                                                        │
│   179 │   │   │   torch_dtype=dict_precisions[args.precision],                                   │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:434 in  │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   431 │   │   ]                                                                                  │
│   432 │   │   hub_kwargs = {name: kwargs.pop(name) for name in hub_kwargs_names if name in kwa   │
│   433 │   │   if not isinstance(config, PretrainedConfig):                                       │
│ ❱ 434 │   │   │   config, kwargs = AutoConfig.from_pretrained(                                   │
│   435 │   │   │   │   pretrained_model_name_or_path,                                             │
│   436 │   │   │   │   return_unused_kwargs=True,                                                 │
│   437 │   │   │   │   trust_remote_code=trust_remote_code,                                       │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:8 │
│ 29 in from_pretrained                                                                            │
│                                                                                                  │
│   826 │   │   │   )                                                                              │
│   827 │   │   │   return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)   │
│   828 │   │   elif "model_type" in config_dict:                                                  │
│ ❱ 829 │   │   │   config_class = CONFIG_MAPPING[config_dict["model_type"]]                       │
│   830 │   │   │   return config_class.from_dict(config_dict, **unused_kwargs)                    │
│   831 │   │   else:                                                                              │
│   832 │   │   │   # Fallback: use pattern matching on the string.                                │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:5 │
│ 36 in __getitem__                                                                                │
│                                                                                                  │
│   533 │   │   if key in self._extra_content:                                                     │
│   534 │   │   │   return self._extra_content[key]                                                │
│   535 │   │   if key not in self._mapping:                                                       │
│ ❱ 536 │   │   │   raise KeyError(key)                                                            │
│   537 │   │   value = self._mapping[key]                                                         │
│   538 │   │   module_name = model_type_to_module_name(key)                                       │
│   539 │   │   if module_name not in self._modules:                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'gpt_bigcode'

Any idea on how to fix this issue? Had trouble authenticating my huggingface that is why I am trying to load a local model.

When I evaluated the dataset APPS, I got the error RuntimeError: stack.size() >= frames.back().function->n_inputs INTERNAL ASSERT FAILED

I've tried many nodes and this error is reported. According to this link it seems that the torch version needs to be upgraded, but the highest supported torch for python 3.7 is 1.13.1, so it looks like this is a dead end. How do I avoid this problem, since someone has successfully reviewed it? My evaluation command is as follows
accelerate launch main.py \ --model bigcode/starcoder \ --tasks apps-introductory \ --max_length_generation 2048 \ --temperature 0.8 \ --n_samples 1 \ --batch_size 32 \ --save_generations \ --precision bf16 \ --save_generations_path generations.json \ --metric_output_path evaluation_results.json \ --allow_code_execution

Getting Zeros for StarCoder on multiple-js

I am running the following :

accelerate launch  main.py   \
  --model bigcode/starcoder   \
  --max_length_generation 512  \
  --tasks multiple-js   \
  --n_samples 120  \
  --batch_size 10  \
  --temperature 0.2  \
  --precision bf16  \
  --allow_code_execution   --use_auth_token

The results is :

{
  "multiple-js": {
    "pass@1": 0.0,
    "pass@10": 0.0,
    "pass@100": 0.0
  },
  "config": {
    "model": "bigcode/starcoderbase",
    "temperature": 0.1,
    "n_samples": 120
  }
}

Is their any other parameters that I might be missing ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.