Giter VIP home page Giter VIP logo

odex's Introduction

Execution-based Evaluation for Open Domain Code Generation

CC BY-SA 4.0

This repository contains the data and code for the work Execution-based Evaluation for Open Domain Code Generation.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

If you find our paper or code useful, please cite the paper

@article{wang2022execution,
  title={Execution-Based Evaluation for Open-Domain Code Generation},
  author={Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig},
  journal={arXiv preprint arXiv:2212.10481},
  year={2022}
}

Install

pip install -r requirements.txt

Dataset

We split the dataset by which natural language is of the corresponding intent.

.
├── README.md
├── data
│   ├── en_test.jsonl
│   ├── es_test.jsonl
│   ├── ja_test.jsonl
│   └── ru_test.jsonl

Each line contains a serialized json object, an example looks like:

{
    'task_id': 3844801,
    'intent': "check if all elements in list `myList` are identical", 
    'prompt': "def f_3844801(myList):\n\treturn ",
    'canonical_solution': "all(x == myList[0] for x in myList)",
    'suffix': "",
    'test_start': "\ndef check(candidate):",
    'test': [
        "\n    assert candidate([1,2,3]) == False\n", 
        "\n    assert candidate([1,1,1,1,1,1]) == True\n",
        "\n    assert candidate([1]) == True\n",
        "\n    assert candidate(['k','k','k','k','k']) == True\n",
        "\n    assert candidate([None,'%$#ga',3]) == False\n"
    ],
    'entry_point': "f_3844801",
}

where:

  1. task_id is the post id of the original StackOverflow post where the sample is constructed from;
  2. intent is the natural language description rewritten by human annotators with qualified specificity;
  3. prompt is the function prefix (definition, input arguments, etc.) to properly execute the code snippet;
  4. canonical_solution is the reference solution (verified by human annotators) of the coding problem;
  5. suffix is the function suffix (return values, if any) to proper to execute the code;
  6. test_start is the definition of test functions, also, including library imports if necessitated by the program;
  7. test is the list of test cases created by human annotators;
  8. entry_point is the function name that should be called for 'check' during evaluation.

To correctly execute the (canonical) code snippets, one needs to install all involved libraries, as listed in the ./library/ directory.

Evaluating Code Generation Models

We provide code to evaluate on two state-of-the-art code generation models: CodeX and CodeGen. To perform the NL-to-Code generation task and collect model predictions:

For CodeX, run

python nl2code_codex.py --language en \
--model_name "code-davinci-002" \
--openai_api_key ${YOUR_API_KEY} \

change the model_name argument to "code-cushman-001" or "code-davinci-001" to try other model variants.

For CodeGen, run

python nl2code_codegen.py --language en \
--model_size 350M --model_data mono 

Other valid options for model_size include: "2B", "6B", and "16B", which correspond to the 2.7B, 6.1B, and 16.1B CodeGen models.

For model_data, other options include "multi" and "nl".

Evaluation

Our default evaluation metric is the execution pass rate. Before the evaluation, make sure your environment has all required libraries installed, and better imported as in the code samples. To do this, you can:

pip install -r ./library/requirements.txt 
python ./library/imports.py

Then we can perform the execution by running:

python eval_passk.py --language en --prediction_path ${MODEL_PRED_PATH}

We also support five other non-execution metrics: BLEU, ROUGE, METEOR, ChrF, and CodeBLEU. For example, to evaluate with the BLEU metric, run:

python eval_nonex.py --language en --prediction_path ${MODEL_PRED_PATH} --eval_metric bleu

Specifying the eval_metric argument with "rouge"/"meteor"/"chrf"/"codebleu" to use other metrics.

Detailed Analysis

Open-Domain versus Closed-Domain

To evaluate on the subset of open-domain or closed-domain samples, you only need to add another argument at evaluation time (when running eval_passk.py or eval_nonex.py), by

--library_usage "open"   # or "closed"

Few-shot Prompting

To include more prompt-solution pairs for in-context prompting learning, specify the num_examples at inference time (when running nl2code_codex.py and nl2code_codegen.py), by

--num_examples 1    # 2, 3, ... 

Number of Input Test Cases

To add exemplar test cases in the prompt inputs, specify the num_tests at inference time, by

--num_tests 1   # 2, 3, ...

Number of Evaluation Test Cases

To use different numbers of test cases for execution-based evaluation, specify the num_tests_eval when running eval_passk.py, for example

python eval_passk.py --language en --prediction_path ${MODEL_PRED_PATH} --num_tests_eval 1 

Semantics of Function Names

Our paper explores three methods to create function names in the wrapping context:

  • "id": f_${ID}, simple string formatting using the StackOverflow post ID
  • "constant": function, using the same string constant for all samples
  • "intent": heuristic-based extraction from the paired NL intent

To experiment with different function names, specify the function_name at inference time, by

--function_name "intent"   # "id" "constant"

Metric Correlation

We also provide code to compare execution-based and non-execution evaluation metrics on a sample-wise basis. Take the execution and BLEU score as an example, one can run:

python metric_corr.py --language en \
--prediction_file ${MODEL_PRED_PATH} \
--eval_metric "bleu"

To get visualizations in violin plots and histograms, add --do_plot_violin or do_plot_stacked_hist.

odex's People

Contributors

lwaekfjlk avatar zorazrw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

odex's Issues

Hugging face call is deprecated

I get this when using a codebase based on ODEX:

/Users/gneubig/work/gemini-benchmark/benchmarking/Code/verify.py:13: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
  code_eval_metric = load_metric("code_eval")

Stripping the prompt can improve model performance

It seems that the Odex prompts fed to the model have a trailing whitespace, and this degrades the performance of models (CodeGen here) on the benchmark. Adding a strip to the prompt here would increase the performance. here are some numbers:

python nl2code_codegen.py --language en --model_size 2B --model_data mono \
 --num_tests_input 0 --num_tests_eval 100 --num_examples 0 --temperature 0.8 \
 --top_p 0.95 --num_return_sequences 50

gives:

Overall Pass@K Scores:
[pass@1] 0.4137 (439)
[pass@2] 0.4662 (439)
[pass@3] 0.4920 (439)
[pass@4] 0.5078 (439)
[pass@5] 0.5188 (439)
[pass@6] 0.5270 (439)
[pass@7] 0.5335 (439)
[pass@8] 0.5387 (439)
[pass@9] 0.5431 (439)
[pass@10] 0.5467 (439)

as opposed to

  "pass@1": 14.28,
  "pass@2": 15.69,
  "pass@5": 16.99,
  "pass@10": 17.54

without stripping (also the numbers reported in the paper).

(thanks @murthyrudra for running the code)

CodeBLEU file missing

The init.py in the metric file mentions the import of compute_codebleu. However, codebleu is not included in the repo? Could you provide the corresponded code?

bug dataset quality

I have found several description error and answer error in your English data:

  • prompt: reverse the list that contains 1 to 10
    • answer: list(reversed(list(range(10))))
    • true answer: should be range(1, 11) instead of range(10)
  • prompt: print a list l and move first 3 elements to the end of the list
    • answer: l[3:] + l[:3]
    • true answer: print(l); return l[3:] + l[:3]

There are still many problems with bugs (containing semantic ambiguity or the incorrect answer.)
I hope you can revise your dataset carefully, as your dataset contains several diverse libraries, which can make huge impact on the whole code generation progress.

Unexpected Keyword Argument 'replace_function_name'

Hi, When I run the codegen code, I am getting the following error

Command:

python nl2code_codegen.py --language en --model_size 350M --model_data mono --output_dir codegen_350M

Error:

Traceback (most recent call last):
  File "/home/rudra/odex/nl2code_codegen.py", line 203, in <module>
    main()
  File "/home/rudra/odex/nl2code_codegen.py", line 175, in main
    scores_dict = evaluate(model, eval_dataloader, tokenizer, args)
  File "/home/rudra/odex/nl2code_codegen.py", line 77, in evaluate
    for i, batch_inputs in enumerate(dataloader): 
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rudra/odex/src/data.py", line 65, in __getitem__
    prompt = create_fewshot_prompt_nl2code(
TypeError: create_fewshot_prompt_nl2code() got an unexpected keyword argument 'replace_function_name'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.