Giter VIP home page Giter VIP logo

funcom's Introduction

Funcom

Funcom Source Code Summarization Tool - Public Release

This repository contains the public release code for Funcom, a tool for source code summarization. Code summarization is the task of automatically generating natural language descriptions of source code.

Publications related to this work include:

LeClair, A., McMillan, C., "Recommendations for Datasets for Source Code Summarization", in Proc. of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'19), Short Research Paper Track, Minneapolis, USA, June 2-7, 2019.

LeClair, A., Jiang, S., McMillan, C., "A Neural Model for Generating Natural Language Summaries of Program Subroutines", in Proc. of the 41st ACE/IEEE International Conference on Software Engineering (ICSE'19), Montreal, QC, Canada, May 25-31, 2019.
https://arxiv.org/abs/1902.01954

Example Output

Randomly sampled example output from the ast-attendgru model compared to reference good human-written summaries:

PROTOTYPE OUTPUT - HUMAN REFERENCE
returns the duration of the movie - get the full length of this movie in seconds
write a string to the client - write a string to all the connected clients
this method is called to indicate the next page in the page - call to explicitly go to the next page from within a single draw
returns a list of all the ids that match the given gene - get a list of superfamily ids for a gene name
compares two nodes by their UNK - compare nodes n1 and n2 by their dx entry
this method updates the tree panel - updates the tree panel with a new tree
returns the number of residues in the sequence - get number of interacting residues in domain b
returns true if the network is found - return true if passed inet address match a network which was used
log status message - log the status of the current message as info

Update Notice

This repository is archival for the ICSE'19 paper mentioned above. It is a good place to get started, but you may also want to look at our newer projects:

https://github.com/aakashba/callcon-public

https://github.com/Attn-to-FC/Attn-to-FC

https://github.com/acleclair/ICPC2020_GNN

USAGE

Step 0: Dependencies

We assume Ubuntu 18.04, Python 3.6, Keras 2.2.4, TensorFlow 1.12. Your milage may vary on different systems.

Step 1: Obtain Dataset

We provide a dataset of 2.1m Java methods and method comments, already cleaned and separated into training/val/test sets:

https://s3.us-east-2.amazonaws.com/icse2018/index.html

(Note: this paper is now several years old. Please see an update of data here: https://github.com/aakashba/callcon-public)

Extract the dataset to a directory (/scratch/ is the assumed default) so that you have a directory structure:
/scratch/funcom/data/standard/dataset.pkl
etc. in accordance with the files described on the site above.

To be consistent with defaults, create the following directories:
/scratch/funcom/data/outdir/models/
/scratch/funcom/data/outdir/histories/
/scratch/funcom/data/outdir/predictions/

Step 2: Train a Model

you@server:~/dev/funcom$ time python3 train.py --model-type=attendgru --gpu=0

Model types are defined in model.py. The ICSE'19 version is ast-attendgru, if you are seeking to reproduce it for comparision to your own models. Note that history information for each epoch is stored in a pkl file e.g. /scratch/funcom/data/outdir/histories/attendgru_hist_1551297717.pkl. The integer at the end of the file is the Epoch time at which training started, and is used to connect history, configuration, model, and prediction data. For example, training attendgru to epoch 5 would produce:

/scratch/funcom/data/outdir/histories/attendgru_conf_1551297717.pkl
/scratch/funcom/data/outdir/histories/attendgru_hist_1551297717.pkl
/scratch/funcom/data/outdir/models/attendgru_E01_1551297717.h5
/scratch/funcom/data/outdir/models/attendgru_E02_1551297717.h5
/scratch/funcom/data/outdir/models/attendgru_E03_1551297717.h5
/scratch/funcom/data/outdir/models/attendgru_E04_1551297717.h5
/scratch/funcom/data/outdir/models/attendgru_E05_1551297717.h5

A good baseline for initial work is the attendgru model. Comments in the file (models/attendgru.py) explain its behavior in detail, and it trains relatively quickly: about 45 minutes per epoch using batch size 200 on a single Quadro P5000, with maximum performance on the validation set at epoch 5.

Step 3: Inference / Prediction

you@server:~/dev/funcom$ time python3 predict.py /scratch/funcom/data/outdir/models/attendgru_E05_1551297717.h5 --gpu=0

The only necessary input to predict.py on the command line is the model file, but configuration information is read from the pkl files mentioned above. Output predictions will be written to a file e.g.:

/scratch/funcom/data/outdir/predictions/predict-attendgru_E05_1551297717.txt

Note that CPU prediction is possible in principle, but by default the attendgru and ast-attendgru models use CuDNNGRU instead of standard GRU, which necessitates using a GPU during prediction.

Step 4: Calculate Metrics

you@server:~/dev/funcom$ time python3 bleu.py /scratch/funcom/data/outdir/predictions/predict-attendgru_E05_1551297717.txt

This will output a BLEU score for the prediction file.

funcom's People

Contributors

acleclair avatar mcmillco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

funcom's Issues

Several detailed questions about reproduction

Hi, I am really interested in this ICSE paper published by you guys. Recently I have planned to reproduce the experiments but I face some difficulties.

  1. I have reproduced the standard data set under both attendgru and ast-attendgru method. Due to the limit of my devices and network, I only implement model training for 5 epochs under attendgru (about 11 hours) and 3 epochs under ast-attendgru (about 6 hours). I see the original code provided in GitHub using 100 epochs, which means high accuracy. So I am not sure that my BLEU results is OK or not. Here is my results:
Model Ba B1 B2 B3 B4
attendgru, E05 19.14 37.88 21.4 14.66 11.3
attendgru, E03 19.24 38.65 21.77 14.66 11.12
ast-attendgru, E03 19.37 38.74 21.88 14.75 11.27
  1. I still plan to do the challenge data set, but I do not find anything on website given by you. I only find a data set named "sbt", which only contains coms.tok and dats.tok. In your paper, you say "The challenge dataset contains two elements for each
    method: 1) the pre-processed comment, and 2) the SBT-AO representation of the Java code". So I guess "sbt" is the challenge data set. If I make a mistake, could you please tell me where to download the challenge data set, thx.

  2. I see that you guys provide the final ast-attendgru trained model file (.h5) for both standard and challenge data. But to load the model acquires corresponding history configuration file, which I do not find. So I cannot do the prediction for the next step.

My development environment:

  • Google Colab env, GPU with high RAM
  • Keras==2.2.5
  • tensorflow-gpu==1.14
  • h5py==2.10.0

A funcom reproduction instance by using Google Colab

Hi, I have done both standard and challenge experiments by using both attendgru and ast-attendgru methods. The good news was my results were nearly the same as the results your paper provided. Also, I wrote a feasible tutorial and put up here to help others who want to reproduce this interesting experiments.

FileNotFoundError: [Errno 2] No such file or directory: '/scratch/funcom/data/standard/tdats.tok'

Hi,
I follow the instructions in the README.md, The dependencies in step 0 are well installed, the datasets in step 1 are downloaded from https://s3.us-east-2.amazonaws.com/icse2018/index.html and directories are created follow the step 1. When it comes to step 2, i run the command "time python3 train.py --model-type=attendgru --gpu=0", i get the following error:
image

as showed in the screenshoot, i look through the path '/scratch/funcom/data/standard/', there is no file named 'tdats.tok'.

I look the train.py too, around line 104, it opens four files, but only the smls.tok and coms.tok are in the path '/scratch/funcom/data/standard/', the other two files tdats.tok and sdats.tok are not in the path at all.
image

Do you have any ideas what should i do to resolve the problem?
I would be so appriciate if you could give me some suggestions.

Error in provide gradient

I run your codes according to your instruction,
after compiling the model this error appears:
Can you help me to fix this error?

ValueError: No gradients provided for any variable: ['embedding/embeddings:0', 'embedding_1/embeddings:0', 'gru/gru_cell/kernel:0', 'gru/gru_cell/recurrent_kernel:0', 'gru/gru_cell/bias:0', 'gru_1/gru_cell_1/kernel:0', 'gru_1/gru_cell_1/recurrent_kernel:0', 'gru_1/gru_cell_1/bias:0', 'time_distributed/kernel:0', 'time_distributed/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0'].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.