Giter VIP home page Giter VIP logo

codequeries-benchmark's Introduction

CodeQueries Benchmark

CodeQueries is a dataset to evaluate various methodologies on answering semantic queries over code. Existing datasets for question-answering in the context of programming languages target comparatively simpler tasks of predicting binary yes/no answers to a question or range over a localized context (e.g., a source-code method). In contrast, in CodeQueries, a source-code file is annotated with the required spans for a code analysis query about semantic aspects of code. Given a query and code, a Span Predictor system is expected to identify answer and supporting-fact spans in the code for the query.

CodeQueries task definition

The dataset statistics is provided in the Codequeries_Statistics file.
More details on the curated dataset for this benchmark are available on HuggingFace.

Steps


The repo provides scripts to evaluate the dataset for LLM generations and in a two-step setup. Follow the steps to use the scripts -

  1. Clone the repo in a virtual environment.
  2. Run setup.sh to setup the workspace.
  3. Run the following commands to get performance metric values.

LLM experiment evaluation


We have used the GPT3.5-Turbo model from OpenAI with different prompt templates (provided at /prompt_templates) to generate required answer and supporting-fact spans for a query. We generate 10 samples for each input and the generated results downloaded as a part of setup. Following scripts can be used to evaluate the LLM results with diffrerent prompts.
To evaluate zero-shot prompt,
    python evaluate_generated_spans.py --g=test_dir_file_0shot/logs
To evaluate few-shot prompt with BM25 retrieval,
    python evaluate_generated_spans.py --g=test_dir_file_fewshot/logs
To evaluate few-shot prompt with supporting facts,
    python evaluate_generated_spans.py --g=test_dir_file_fewshot_sf/logs --with_sf=True

Two-step setup evaluation


In many cases, the entire file contents do not fit in the input to the model. However, not all code is relevant for answering a given query. We identify the relevant code blocks using the CodeQL results during data preparation and implement a two-step procedure to deal with the problem of scaling to large-size code:
    Step 1: We first apply a relevance classifier to every block in the given code and select code blocks that are likely to be relevant for answering a given query.
    Step 2: We then apply the span prediction model to the set of selected code blocks to predict answer and supporting-fact spans.

To evaluate the two-step setup, run
python3 evaluate_spanprediction.py --example_types_to_evaluate=<positive/negative> --setting=twostep --span_type=<both/answer/sf> --span_model_checkpoint_path=<model-ckpt-with-low-data/Cubert-1K-low-data or finetuned_ckpts/Cubert-1K> --relevance_model_checkpoint_path=<model-ckpt-with-low-data/Twostep_Relevance-512-low-data or finetuned_ckpts/Twostep_Relevance-512>

Experiment results on sampled test data


LLM experiment

Zero-shot prompting
(Answer span prediction)
Few-shot prompting with BM25 retrieval
(Answer span prediction)
Few-shot prompting with supporting fact
(Answer & supporting-fact span prediction)
Pass@k Positive Negative Positive Negative Positive
1 9.82 12.83 16.45 44.25 21.88
2 13.06 17.42 21.14 55.53 28.06
5 17.47 22.85 27.69 65.43 34.94
10 20.84 26.77 32.66 70.0 39.08

Two-step setup

Answer span prediction Answer & supporting-fact span prediction
Variant Positive Negative Positive
Two-step(20, 20) 9.42 92.13 8.42
Two-step(all, 20) 15.03 94.49 13.27
Two-step(20, all) 32.87 96.26 30.66
Two-step(all, all) 51.90 95.67 49.30

Experiment results on complete test data


Variants Positive Negative
Two-step(20, 20) 3.74 95.54
Two-step(all, 20) 7.81 97.87
Two-step(20, all) 33.41 96.23
Two-step(all, all) 52.61 96.73
Prefix 36.60 93.80
Sliding window 51.91 85.75

codequeries-benchmark's People

Contributors

adityakanade avatar thepurpleowl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.