Giter VIP home page Giter VIP logo

Comments (11)

loubnabnl avatar loubnabnl commented on August 22, 2024 1

When I did the experiments, with the required greedy configuration, I saw it was giving me a score of 29.8 %, which was same as the leaderboard. However, I still had questions that why we have a difference of such a value of ~ (2-3) % w.r.t. the paper. Although #142 (comment) clarifies a part of my doubt.

yes as explained in my comment, it might be due to small differences in post-processing or inference precision. When comparing models it should be fine as long as you use the same framework and pipeline which is what the leaderboard is intended for, instead of just comparing scores across research papers which might use different implementatins.

However, I also want to know is there any reasons, these model's repo (like mistral-src or codellama repo), do not provide any reproducibility script?

That's up to the Mistral and CodeLLaMa authors to answer ^^

Regarding the differences you're observing with respect to quantization, greedy evaluation might have more noise than top-p sampling with a large number of solutions e.g 50, where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

from bigcode-evaluation-harness.

loubnabnl avatar loubnabnl commented on August 22, 2024 1

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

You can also use greedy which is deterministic for the model under the same setup and gives results with a 1 or 2 points difference to top_p sampling with 50 samples. However if you want to reduce noise doing sampling with a large number of samples (e.g 50 or 100) might be less noisy than greedy since you give the model more than one chance to solve each problem.

from bigcode-evaluation-harness.

loubnabnl avatar loubnabnl commented on August 22, 2024 1

yes

from bigcode-evaluation-harness.

Anindyadeep avatar Anindyadeep commented on August 22, 2024 1

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

Ahh I see, got it, thanks for this reference appreciate this :)

from bigcode-evaluation-harness.

Anindyadeep avatar Anindyadeep commented on August 22, 2024

where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

That's something worth trying out, will do that. Thanks for clearing some of my doubts.

from bigcode-evaluation-harness.

phqtuyen avatar phqtuyen commented on August 22, 2024

May I also ask what is the decoding strategy is applied by default? Does it affect the performance observed? Much appreciated.

from bigcode-evaluation-harness.

phqtuyen avatar phqtuyen commented on August 22, 2024

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

from bigcode-evaluation-harness.

Anindyadeep avatar Anindyadeep commented on August 22, 2024

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

Yes you are right. But for pass@1, we want greedy decoding (which should provide deterministic response) and so, temp is kept to 0 and top_p is not required so None should work.

from bigcode-evaluation-harness.

Anindyadeep avatar Anindyadeep commented on August 22, 2024

Oh, I see. Actually I was following the convention provided by codellama paper. However, this makes sense. I would like to do that for my case of evaluation.

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1

And I believe the temp is 0.2 right?

from bigcode-evaluation-harness.

Anindyadeep avatar Anindyadeep commented on August 22, 2024

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

I am kinda confused here, because does by definition pass@1 means to use the model's generation only one time? Doing 50 times, would mean, I am getting pass@1 and pass@10, but yeah...

from bigcode-evaluation-harness.

loubnabnl avatar loubnabnl commented on August 22, 2024

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

from bigcode-evaluation-harness.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.