So, I also came across some inconsistencies in results. Also, when I was going to rais

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. about bigcode-evaluation-harness HOT 11 CLOSED

Anindyadeep commented on August 22, 2024

Inconsistencies in results in Mistral, CodeLlama and some strange behavior.

from bigcode-evaluation-harness.

Comments (11)

loubnabnl commented on August 22, 2024 1

When I did the experiments, with the required greedy configuration, I saw it was giving me a score of 29.8 %, which was same as the leaderboard. However, I still had questions that why we have a difference of such a value of ~ (2-3) % w.r.t. the paper. Although #142 (comment) clarifies a part of my doubt.

yes as explained in my comment, it might be due to small differences in post-processing or inference precision. When comparing models it should be fine as long as you use the same framework and pipeline which is what the leaderboard is intended for, instead of just comparing scores across research papers which might use different implementatins.

However, I also want to know is there any reasons, these model's repo (like mistral-src or codellama repo), do not provide any reproducibility script?

That's up to the Mistral and CodeLLaMa authors to answer ^^

Regarding the differences you're observing with respect to quantization, greedy evaluation might have more noise than top-p sampling with a large number of solutions e.g 50, where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

from bigcode-evaluation-harness.

loubnabnl commented on August 22, 2024 1

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

You can also use greedy which is deterministic for the model under the same setup and gives results with a 1 or 2 points difference to top_p sampling with 50 samples. However if you want to reduce noise doing sampling with a large number of samples (e.g 50 or 100) might be less noisy than greedy since you give the model more than one chance to solve each problem.

from bigcode-evaluation-harness.

loubnabnl commented on August 22, 2024 1

yes

from bigcode-evaluation-harness.

Anindyadeep commented on August 22, 2024 1

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

Ahh I see, got it, thanks for this reference appreciate this :)

from bigcode-evaluation-harness.

Anindyadeep commented on August 22, 2024

where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

That's something worth trying out, will do that. Thanks for clearing some of my doubts.

from bigcode-evaluation-harness.

phqtuyen commented on August 22, 2024

May I also ask what is the decoding strategy is applied by default? Does it affect the performance observed? Much appreciated.

from bigcode-evaluation-harness.

phqtuyen commented on August 22, 2024

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

from bigcode-evaluation-harness.

Anindyadeep commented on August 22, 2024

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

Yes you are right. But for pass@1, we want greedy decoding (which should provide deterministic response) and so, temp is kept to 0 and top_p is not required so None should work.

from bigcode-evaluation-harness.

Anindyadeep commented on August 22, 2024

Oh, I see. Actually I was following the convention provided by codellama paper. However, this makes sense. I would like to do that for my case of evaluation.

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1

And I believe the temp is 0.2 right?

from bigcode-evaluation-harness.

Anindyadeep commented on August 22, 2024

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

I am kinda confused here, because does by definition pass@1 means to use the model's generation only one time? Doing 50 times, would mean, I am getting pass@1 and pass@10, but yeah...

from bigcode-evaluation-harness.

loubnabnl commented on August 22, 2024

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

from bigcode-evaluation-harness.

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. about bigcode-evaluation-harness HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent