Giter VIP home page Giter VIP logo

repohyper's Introduction

RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion

arXiv

Introduction

We introduce RepoHyper, an novel framework transforming code completion into a seamless end-to-end process for use case on real world repositories. Traditional approaches depend on integrating contexts into Code Language Models (CodeLLMs), often presuming these contexts to be inherently accurate. However, we've identified a gap: the standard benchmarks don't always present relevant contexts.

To address this, RepoHyper proposes in three novel steps:

  • Construction of a Code Property Graph, establishing a rich source of context.
  • A novel Search Algorithm for pinpointing the exact context needed.
  • The Expand Algorithm, designed to uncover implicit connections between code elements (akin to the Link Prediction problem on social network mining).

Our comprehensive evaluations reveal that RepoHyper sets a new standard, outperforming other strong baseline on the RepoBench benchmark.

Installation

pip install -r requirements.txt

Architecture

RepoHyper is a two-stage model. The first stage is a search-then-expand algorithm on Repo-level Semantic Graph (RSG) then use GNN link predictor that reranks the retrieved results from KNN search and graph expansion. The second stage is any code LLM model that takes the retrieved context and predicts the next line of code.

Checkpoints

We provide the checkpoints for the GNN model here. The GNN model is trained on the RepoBench-R dataset with gold labels. We also provide RepoBench-R RGSs to reproduce the results.

Usage

Data preparation

We need to clone Repobench dataset into data/repobench folder. Then download all the unique repositories used in this dataset

python3 -m scripts.data.download_repos --dataset data/repobench --output data/repobench/repos --num-processes 8

The next step is to generate call graph using PyCG. We use the following command to generate call graph for each repository. 60 processes are used to speed up the process (maximum RAM usage is around 350GB).

python3 -m scripts.data.generate_call_graph --repos data/repobench/repos --output data/repobench/repos_call_graphs --num-processes 60

Now we need to generate embeddings for each node for node embedding as well as create adjacency matrix by aligning Tree-sitter functions, classes, methods with call graph nodes.

python3 -m scripts.data.repo_to_embeddings --repos data/repobench/repos --call-graphs data/repobench/repos_call_graphs --output data/repobench/repos_graphs --num-processes 60

Final step is labeling which node is the most optimal for predicting next line using gold snippet from repobench dataset. In this step, we also generate the training data for GNN training by extracting the subgraph using KNN search and RSG expansion.

python3 -m scripts.data.matching_repobench_graphs -search_policy "knn-pattern" --rsg_path "YOUR RSG PATH" --output data/repobench/repos_graphs_labeled 

Training

We can train GNN linker seperately using following script

CUDA_VISIBLE_DEVICES=0 deepspeed train_gnn.py --deepspeed --deepspeed_config ds_config.json --arch GraphSage --layers 1 --data-path data/repobench/repos_graphs_labeled_cosine_radius_unix --output data/repobench/gnn_model --num-epochs 10 --batch-size 16

Evaluation for RepoBench-P

We can evaluate the model using the following script

python3 scripts/evaluate_llm.py --data data/repobench/repos_graphs_matched_retrieved --model "gpt3.5" --num-workers 8

repohyper's People

Contributors

bdqnghi avatar huyphan168 avatar

Stargazers

Tingwei Zhu avatar  avatar Anmol Agarwal avatar Vaibhav kumar avatar 白木 avatar Alexander Kovrigin avatar  avatar  avatar zhanghanxiao avatar ybbz avatar  avatar Gaz Iqbal avatar Vũ Duy Tùng avatar Aria F avatar Markus Rauhalahti avatar Jeff Carpenter avatar Anh Minh Nguyen avatar  avatar Dao Trung Hieu avatar Trong Thao Tran avatar Phan Châu Thắng avatar  avatar Lalit Pagaria avatar  avatar  avatar Caleb avatar Magnus avatar Avnish avatar Ressnn avatar  avatar Catherine Koshka avatar Srinivas Billa avatar Chris Dillard avatar Charles avatar  avatar  avatar  avatar

Watchers

nashid avatar  avatar  avatar

repohyper's Issues

Guidance on using the analysis tool for type resolution

I hope this message finds you well. I am currently exploring the code from this project that involves generating call graphs for software analysis. My work specifically focuses on the challenges of type resolution.

Despite going through the available documentation and examples, I've found it somewhat challenging to understand how to leverage this tool in the context of type resolution.

Specifically, I'm interested in:

  • Any existing functionality within the tool that facilitates type resolution.
  • Recommendations for extending this tool to support type resolution, if necessary.

Code for the Analysis of a Python Project

Firstly, thanks for sharing this repository.

After checking code and paper, I see that this project utilizes AST for analysis. Given Python's dynamic typing nature, it will be good to know how your code does this analysis. Could you kindly elaborate on how the graph analysis is conducted, especially considering Python's lack of static types? Also, do you provide code for the following?

  • Import Relations Analysis: Identification of relationships through import statements within the project scope, excluding external modules.
  • Invoke Relations Understanding: Determination of caller and callee relationships between functions or methods.
  • Class Hierarchy Relations: Understanding of inheritance relationships between classes.

I would appreciate if you could guide me to the specific parts of the code responsible for these analyses. Also, how could I apply your framework to other Python projects.

Any assistance or documentation on setting up and running this analysis would be highly appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.