Giter VIP home page Giter VIP logo

typeevalpy's Introduction


A Micro-benchmarking Framework for Python Type Inference Tools

๐Ÿ“Œ Features:

  • ๐Ÿ“œ Contains 154 code snippets to test and benchmark.
  • ๐Ÿท Offers 845 type annotations across a diverse set of Python functionalities.
  • ๐Ÿ“‚ Organized into 18 distinct categories targeting various Python features.
  • ๐Ÿšข Seamlessly manages the execution of containerized tools.
  • ๐Ÿ”„ Efficiently transforms inferred types into a standardized format.
  • ๐Ÿ“Š Automatically produces meaningful metrics for in-depth assessment and comparison.

๐Ÿ› ๏ธ Supported Tools

Supported โœ… In-progress ๐Ÿ”ง Planned ๐Ÿ’ก
HeaderGen Intellij PSI MonkeyType
Jedi Pyre Pyannotate
Pyright PySonar2
HiTyper Pytype
Scalpel TypeT5
Type4Py
GPT-4
Ollama


๐Ÿ† TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools, coupled with top_n predictions for ML-based tools.

Rank ๐Ÿ› ๏ธ Tool Top-n Function Return Type Function Parameter Type Local Variable Type Total
1 HeaderGen 1 186 56 322 564
2 Jedi 1 122 0 293 415
3 Pyright 1 100 8 297 405
4 HiTyper 1
3
5
163
173
175
27
37
37
179
225
229
369
435
441
5 HiTyper (static) 1 141 7 102 250
6 Scalpel 1 155 32 6 193
7 Type4Py 1
3
5
39
103
109
19
31
31
99
167
174
157
301
314

(Auto-generated based on the the analysis run on 20 Oct 2023)


๐Ÿ†๐Ÿค– TypeEvalPy LLM Leaderboard

Below is a comparison showcasing exact matches for LLMs.

Rank ๐Ÿ› ๏ธ Tool Function Return Type Function Parameter Type Local Variable Type Total
1 GPT-4 225 85 465 775
2 codellama:13b-instruct 199 75 425 699
3 GPT 3.5 Turbo 188 73 429 690
4 codellama:34b-instruct 190 52 425 667
5 phind-codellama:34b-v2 182 60 399 641
6 codellama:7b-instruct 171 72 384 627
7 dolphin-mistral 184 76 356 616
8 codebooga 186 56 354 596
9 llama2:70b 168 55 342 565
10 HeaderGen 186 56 321 563
11 wizardcoder:13b-python 170 74 317 561
12 llama2:13b 153 40 283 476
13 mistral:instruct 155 45 250 450
14 mistral:v0.2 155 45 248 448
15 vicuna:13b 153 35 260 448
16 vicuna:33b 133 29 267 429
17 wizardcoder:7b-python 103 48 254 405
18 llama2:7b 140 34 216 390
19 HiTyper 163 27 179 369
20 wizardcoder:34b-python 140 43 178 361
21 orca2:7b 117 27 184 328
22 vicuna:7b 131 17 172 320
23 orca2:13b 113 19 166 298
24 tinyllama 3 0 23 26
25 phind-codellama:34b-python 5 0 15 20
26 codellama:13b-python 0 0 0 0
27 codellama:34b-python 0 0 0 0
28 codellama:7b-python 0 0 0 0

(Auto-generated based on the the analysis run on 14 Jan 2024)


๐Ÿณ Running with Docker

1๏ธโƒฃ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2๏ธโƒฃ Build Docker image

docker build -t typeevalpy .

3๏ธโƒฃ Run TypeEvalPy

๐Ÿ•’ Takes about 30mins on first run to build Docker containers.

๐Ÿ“‚ Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper Here is how the auto-generated CSV tables relate to the paper's tables:
  • Table 1 in the paper is derived from three auto-generated CSV tables:

    • paper_table_1.csv - details Exact matches by type category.
    • paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
    • paper_table_3.csv - provides Sound and Complete values for tools.
  • Table 2 in the paper is based on the following CSV table:

    • paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

  • paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
  • paper_table_6.csv - featuring Sensitivity analysis.
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

๐Ÿ”ง Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

๐Ÿ› ๏ธ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

๐Ÿค– Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

  • Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

  • openai_key: your key for accessing OpenAI's models.
  • ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
  • prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
  • ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

1. ๐Ÿ“ฅ Installation

  1. Clone the repo

    git clone https://github.com/secure-software-engineering/TypeEvalPy.git
  2. Install Dependencies and Set Up Virtual Environment

    Run the following commands to set up your virtual environment and activate the virtual environment.

    python3 -m venv .env
    source .env/bin/activate
    pip install -r requirements.txt

2. ๐Ÿš€ Usage: Running the Analysis

  1. Navigate to the src Directory

    cd src
  2. Execute the Analyzer

    Run the following command to start the benchmarking process on all tools:

    python main_runner.py

    or

    Run analysis on specific tools

    python main_runner.py --runners headergen scalpel
    

๐Ÿค Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md


โญ๏ธ Show Your Support

Give a โญ๏ธ if this project helped you!

typeevalpy's People

Contributors

ashwinprasadme avatar samzcodez avatar sapkotaruz11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

sapkotaruz11

typeevalpy's Issues

Some feedback on improving `TypeEvalPy`

Hi @ashwinprasadme and @Samzcodez,

First of all, great job! I have managed to run and evaluate Type4Py on the micro-benchmark with almost one command. The automation is great! That said, I have some comments that might help improve the user experience with TypeEvalPy, especially with the installation process.

  • Currently, one needs to run setup.sh, i.e., a shell script, to install the benchmark. This is good, though, but in general, people in the community are a bit hesitant to run a shell script for security reasons. Therefore, I suggest to create a proper Python package with setup.py file which installs the required dependencies and adds a command to the user's environment, e.g., typeevalpy. Then, the user can run the benchmark with one command, say, typeevalpy run. Or even typeevalpy run type4py, if one wants to run only one benchmark. This way, people only install a Python package, which should be also platform-agnostic, at least for Linux distributions.
  • I could not find .results folder containing the benchmark output. It'd also be helpful to log the path to the results folder so that the user can find it easily.
  • After running hityperdl, I also encountered this error below, although I could see the benchmark results at the end of the run.
b"Python is running inside a Docker container\n2023-09-08 14:11:01,441 - runner - INFO - Command returned non-zero exit status: [Errno 2] No such file or directory: '/tmp/micro-benchmark/python_features/builtins/switch/._main_INFERREDTYPES.json' for file: /tmp/micro-benchmark/python_features/builtins/switch/main.py\n"

In general, I like the TypeEvalPy benchmark and it is pretty easy to use it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.