Giter VIP home page Giter VIP logo

heloc-'s Introduction

HELoC: Hierarchical Contrastive Learning of Code Representations

A PyTorch Implementation of "HELoC: Hierarchical Contrastive Learning of Code Representations"

Map Any Code Snippet into Vector Embedding with HELoC

HELoC is a self-supervised hierarchical contrastive learning model of code representation. Its key idea is to formulate the learning of AST hierarchy as a pretext task of self-supervised contrastive learning, where cross-entropy and triplet losses are adopted as learning objectives to predict the level and learn the hierarchical relationships between nodes, which makes the representation vectors of nodes with greater differences in AST levels farther apart in the embedding space. By using such vectors, the structural similarities between code snippets can be measured more precisely. HELoC is self-supervised and can be applied to many source code related downstream tasks after pre-training.

Requirements

pytorch 1.7.0
python 3.7.8
dgl 0.5.3
flair 0.7
pycparser 2.20
javalang 0.13.0
gensim 3.8.3

Run

pre-training

=======

Usage

We extract the AST node embedding and path embedding in the following two steps:

  1. run python parsercode.py --lang oj/ python parsercode.py --lang gcj/ python parsercode.py --lang bcb to generate initial encoding.
  2. run python pre_training.py --dataset_nodeemb [The path to the dataset in which the nodes have been encoded]

Application of HELoC in downstream tasks

We evaluate HELoC model on two tasks, code classification and code clone detection. It is also expected to be helpful in more downstream tasks. In the code classification task, we evaluate HELoC on two datasets: GCJ and OJ. In the code clone detection task, we further evaluate HELoC on three datasets: BCB, GCJ and OJClone.

Code Classification

run python cla.py --nodedataset_path [node emb path] --pathdataset_path [path emb path] --pre_model [pre_model]

Code Clone Detection

run python clo.py --dataset [The name of the dataset] --pair_file [The path of the clone pairs] --nodedataset_path [node emb path] --pathdataset_path [path emb path] --pre_model [pre_model]

heloc-'s People

Contributors

code-rep avatar

Stargazers

Hongyu Zhang avatar Chen Lyu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.