Comments (8)
You can use:
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt codellama \
--save_generations_path generations_humanevalsynthesizepython_codellama.json \
--metric_output_path evaluation_humanevalsynthesizepython_codellama.json \
--max_length_generation 2048 \
--precision fp16
from bigcode-evaluation-harness.
Hi, to use an instruction version of HumanEval prompt, you can use the HumenEvalSynthesize task (the one used for instruction models in the leaderboard when evaluating on Python), for example:
accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalsynthesizepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
To change how the instruction prompt is built you can update --prompt
argument check the code for the list of options (i.e the transformations that we apply to HumanEval prompts to make them instruction friendly)
from bigcode-evaluation-harness.
Thanks @loubnabnl , do we have to specify instruction token for this task? Much appreciated.
from bigcode-evaluation-harness.
Also, do you mind telling the exact setting to replicate the codellama instruct performance? Thank you so much.
from bigcode-evaluation-harness.
If your model uses different tokens you'll need to build a new prompt
and update the code. See this PR for adding codellama prompt: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/130/files
from bigcode-evaluation-harness.
Ah I just want to replicate the performance of codellama-intruct in HF leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard . Do you know what config/args that they run the evaluation with? Also, is the reported number for "humanevalsynthesize-python"? Thanks.
from bigcode-evaluation-harness.
Thank you, another minor detail, in the leaderboard HF says that this is the setting which they use "All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.", here you use fp16, is this correct? Much appreciated.
from bigcode-evaluation-harness.
The displayed models were indeed evaluated in that setting, but we've found greedy to give results close to top-p sampling with 50 samples so you can use greedy to speed-up the evaluation. HumanEvalSynthesize requires sequence length of 2048 though not 512.
from bigcode-evaluation-harness.
Related Issues (20)
- The results are different from those of the codellama paper HOT 6
- solved
- code HOT 1
- Multi card operation large model HOT 2
- Is it possible to run the harness against API hosted models? HOT 3
- Phi 1.5 evaluation problem HOT 2
- Humanevalpack-c++ issue HOT 1
- Score discrepancy with humaneval HOT 2
- A common interface for APIs and Models. HOT 9
- exact commandline to reproduce leaderboard HOT 2
- Anyway to access to the public leaderboard? HOT 1
- Inconsistencies in results in Mistral, CodeLlama and some strange behavior. HOT 11
- Reproduction Issues of Code Llama HOT 3
- OOM when running Multi-gpu inference using A100 40GB HOT 1
- Evaluation speed with multi-gpu HOT 2
- Try to reproduce the leaderboard result of codellama-instruct-34B for Rust HOT 6
- Human_eval .yaml HOT 1
- Why doesn't mbpp use few-shot? HOT 1
- Use trust_remote_code for dataset load HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigcode-evaluation-harness.