Evaluation of instruct model about bigcode-evaluation-harness HOT 8 CLOSED

phqtuyen commented on July 2, 2024

Evaluation of instruct model

from bigcode-evaluation-harness.

Comments (8)

loubnabnl commented on July 2, 2024 2

You can use:

accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf  \
--tasks humanevalsynthesize-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt codellama \
--save_generations_path generations_humanevalsynthesizepython_codellama.json \
--metric_output_path evaluation_humanevalsynthesizepython_codellama.json \
--max_length_generation 2048 \
--precision fp16

from bigcode-evaluation-harness.

loubnabnl commented on July 2, 2024

Hi, to use an instruction version of HumanEval prompt, you can use the HumenEvalSynthesize task (the one used for instruction models in the leaderboard when evaluating on Python), for example:

accelerate launch main.py \
--model bigcode/octocoder  \
--tasks humanevalsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalsynthesizepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16

To change how the instruction prompt is built you can update --prompt argument check the code for the list of options (i.e the transformations that we apply to HumanEval prompts to make them instruction friendly)

from bigcode-evaluation-harness.

phqtuyen commented on July 2, 2024

Thanks @loubnabnl , do we have to specify instruction token for this task? Much appreciated.

from bigcode-evaluation-harness.

phqtuyen commented on July 2, 2024

Also, do you mind telling the exact setting to replicate the codellama instruct performance? Thank you so much.

from bigcode-evaluation-harness.

loubnabnl commented on July 2, 2024

If your model uses different tokens you'll need to build a new prompt and update the code. See this PR for adding codellama prompt: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/130/files

from bigcode-evaluation-harness.

phqtuyen commented on July 2, 2024

Ah I just want to replicate the performance of codellama-intruct in HF leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard . Do you know what config/args that they run the evaluation with? Also, is the reported number for "humanevalsynthesize-python"? Thanks.

from bigcode-evaluation-harness.

phqtuyen commented on July 2, 2024

Thank you, another minor detail, in the leaderboard HF says that this is the setting which they use "All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.", here you use fp16, is this correct? Much appreciated.

from bigcode-evaluation-harness.

loubnabnl commented on July 2, 2024

The displayed models were indeed evaluated in that setting, but we've found greedy to give results close to top-p sampling with 50 samples so you can use greedy to speed-up the evaluation. HumanEvalSynthesize requires sequence length of 2048 though not 512.

from bigcode-evaluation-harness.

Evaluation of instruct model about bigcode-evaluation-harness HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent