Hi thank you for this great dataset! I have some questions about how you compute the a

BTW APPS is now available on Hugging Face hub <a href="https://huggingface.co/datasets

Okay I think now with the <a href="https://github.com/hendrycks/apps/blob/main/eval/te

Computation of the accuracy scores when there are compilation and runtime errors about apps HOT 7 CLOSED

hendrycks commented on July 29, 2024

Computation of the accuracy scores when there are compilation and runtime errors

from apps.

Comments (7)

xksteven commented on July 29, 2024 1

Thanks for catching the bug. Think we forgot to add a "> 0" on line 29.
I'll push the big fix soon.

As for your other question. I thought it was a good metric that ultimately wasn't used to be able to measure how often does the model even produce code that runs compared to gibberish. So compile errors were worse in my opinion than runtime errors. It can also be seen as measuring grammar vs semantic errors in a loose way.

from apps.

loubnabnl commented on July 29, 2024 1

BTW APPS is now available on Hugging Face hub https://huggingface.co/datasets/codeparrot/apps and we're currently adding the evaluation metric

from apps.

xksteven commented on July 29, 2024 1

I'll make the changes you suggested but also feel free to make a pull requests too. Thanks for looking through it and adding it to hugging face!

from apps.

loubnabnl commented on July 29, 2024 1

Great ~~I'll open a PR!~~ I saw that you already changed it thanks!

from apps.

xksteven commented on July 29, 2024 1

Okay I think now with the examples and documentation it is working correctly and as intended. So I think this issue is good to close now. Feel free to reopen if there's something that was missed.

from apps.

xksteven commented on July 29, 2024

Also for the comment regarding the expressions. The following should work provided they're numpy arrays:

import numpy as np
a = [-2, -1, 0, 1, -2]
a[a==-2]  # outputs  -2 which is not what we expect

b = np.asarray(a)
b==-2  # outputs array([ True, False, False, False,  True])
# Then the line below returns the following, which is what we expect. The length of which is 2.
b[b==-2]   # outputs array([-2, -2])

from apps.

loubnabnl commented on July 29, 2024

Thank you for your reply and for the fix! Regarding the comment above tmp_results is defined as a list in the function, maybe we could add res.extend(np.array(results[index])) here

apps/eval/test_one_solution.py

Line 27 in d5c8e99

res.extend(results[index])

and tmp_results = np.array(res) here

apps/eval/test_one_solution.py

Line 30 in d5c8e99

tmp_results = res

from apps.

Recommend Projects

Computation of the accuracy scores when there are compilation and runtime errors about apps HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent