google-research / plur Goto Github PK

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

License: Apache License 2.0

Python 100.00%

machine-learning deep-learning program-synthesis software-engineering research

plur's Introduction

PLUR

Installation

SRC_DIR=${PWD}/src
mkdir -p ${SRC_DIR} && cd ${SRC_DIR}
# For Cubert.
git clone https://github.com/google-research/google-research --depth=1
export PYTHONPATH=${PYTHONPATH}:${SRC_DIR}/google-research
git clone https://github.com/google-research/plur && cd plur
python -m pip install -r requirements.txt
python setup.py install

Test execution on small dataset

cd plur
python3 plur_data_generation.py --dataset_name=manysstubs4j_dataset \
  --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 \
  --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2 \
  --train_data_percentage=40 \
  --validation_data_percentage=30 \
  --test_data_percentage=30

Usage

Basic usage

Data generation (step 1)

Data generation is done by calling plur.plur_data_generation.create_dataset(). The data generation runs in two stages:

Convert raw data to plur.utils.GraphToOutputExample.
Convert plur.utils.GraphToOutputExample to TFExample.

Stage 1 is unique for each dataset, but stage 2 is the same for almost all datasets.

from plur.plur_data_generation import create_dataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

plur_data_generation.py also provides a command line interface, but it offers less flexibility.

python3 plur_data_generation.py --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2

Data loader (step 2)

After the data is generated, you can use PlurDataLoader to load the data. The data loader loads TFExamples but returns them as numpy arrays.

from plur.plur_data_loader import PlurDataLoader
from plur.util import constants

dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
split = constants.TRAIN_SPLIT_NAME
batch_size = 32
repeat_count = -1
drop_remainder = True
train_data_generator = PlurDataLoader(dataset_stage_2_directory, split, batch_size, repeat_count, drop_remainder)

for batch_data in train_data_generator:
  # your training loop...

Training (step 3)

This is where users of the PLUR framework plug in their custom ML models and code to train and generate predictions for PLUR tasks.

We provide the models for GGNN, Transformer and GREAT models from the PLUR paper. See below for sample commands. For the full set of command line FLAGS, see plur/model_design/train.py.

Training

python3 train.py \
 --data_dir=/tmp/manysstubs4j_dataset/stage_2 \
 --exp_dir=/tmp/experiments/exp12345

Evaluation / Generating predictions

python3 train.py \
 --data_dir=/tmp/manysstubs4j_dataset/stage_2 \
 --exp_dir=/tmp/experiments/exp12345 \
 --evaluate=true

Evaluating (step 4)

Once the training is finished and you have generated natural text predictions on the test data, you can use plur_evaluator.py to evaluate the performance. plur_evaluator.py works in offline mode, meaning that it expects one or more files containing the ground truths, and matching files containing the predictions.

python3 plur_evaluator.py --dataset_name=manysstubs4j_dataset --target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt --prediction_file_pattern=/tmp/manysstubs4j_dataset/predictions.txt

When using multiple evaluation "rounds", the evaluator may create multiple targets and predictions files, formatted as ...predictions-0-of-5.txt; you can refer to all of these combined using a Glob file pattern such as ...predictions-?-of-5.txt in the command above.

For more details about how plur_evaluator works see plur/eval/README.md.

Transforming and filtering data

If there is something fundamental you want to change in the dataset, you should apply them in stage 1 of data generation, otherwise apply them in stage 2. The idea is that stage 1 should only be run once per dataset (to create the plur.utils.GraphToOutputExample), and stage 2 should be run each time you want to train on different data (to create the TFRecords).

All transformation and filtering functions are applied on plur.utils.GraphToOutputExample, see plur.utils.GraphToOutputExample for more information.

E.g. a transformation that can be run in stage 1 is that your model expects that graphs in the dataset have no loop, and you write your transformation function to remove loops. This will ensure that stage 2 will read data where the graph has no loops.

E.g. of filters that can be run in stage 2 is that you want to check your model performance on different graph sizes in terms of number of nodes. You write your own filter function to filter graphs with a large number of nodes.

from plur.plur_data_generation import create_dataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
def _filter_graph_size(graph_to_output_example, graph_size=1024):
  return len(graph_to_output_example.get_nodes()) <= graph_size
stage_2_kwargs = dict(
    train_filter_funcs=(_filter_graph_size,),
    validation_filter_funcs=(_filter_graph_size,)
)
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

Advanced usage

plur.plur_data_generation.create_dataset() is just a thin wrapper around plur.stage_1.plur_dataset and plur.stage_2.graph_to_output_example_to_tfexample.

from plur.plur_data_generation import create_dataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

is equivalent to

from plur.stage_1.manysstubs4j_dataset import ManySStubs4jJDataset
from plur.stage_2.graph_to_output_example_to_tfexample import GraphToOutputExampleToTfexample

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
dataset = ManySStubs4jJDataset(dataset_stage_1_directory)
dataset.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

dataset = GraphToOutputExampleToTfexample(dataset_stage_1_directory, dataset_stage_2_directory, dataset_name)
dataset.stage_2_mkdirs()
dataset.run_pipeline()

You can check out plur.stage_1.manysstubs4j_dataset for dataset specific arguments.

from plur.stage_1.manysstubs4j_dataset import ManySStubs4jJDataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'

dataset = ManySStubs4jJDataset(dataset_stage_1_directory, dataset_size='large')
dataset.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

Adding a new dataset

All datasets should inherit plur.stage_1.plur_dataset.PlurDataset, and placed under plur/stage_1/, which requires you to implement:

download_dataset(): Code to download the dataset, we provide download_dataset_using_git() to download from git and download_dataset_using_requests() to download from a URL, which also works with a Google Drive URL. In download_dataset_using_git() we download the dataset from a specific commit id. In download_dataset_using_requests() we check the sha1sum for the downloaded files. This is to ensure that the same version of PLUR downloads the same raw data.
get_all_raw_data_paths(): It should return a list of paths, where each path is a file containing the raw data in the datasets.
raw_data_paths_to_raw_data_do_fn(): It should return a beam.DoFn class that overrides process(). The process() should tell beam how to open the files returned by get_all_raw_data_paths(). It is also here we define if the data belongs to any split (train/validation/test).
raw_data_to_graph_to_output_example(): This function transforms raw data from raw_data_paths_to_raw_data_do_fn() to GraphToOutputExample.

Then add/change the following lines in plur/plur_data_generation.py:

from plur.stage_1.foo_dataset import FooDataset

flags.DEFINE_enum(
    'dataset_name',
    'dummy_dataset',
    (
        'code2seq_dataset',
        'convattn_dataset',
        'dummy_dataset',
        # [...]
        'retrieve_and_edit_dataset',
        'foo_dataset',
    ),
    'Name of the dataset to generate data.')

# [...]
def get_dataset_class(dataset_name):
  """Get the dataset class based on dataset_name."""
  if dataset_name == 'code2seq_dataset':
    return Code2SeqDataset
  elif dataset_name == 'convattn_dataset':
    return ConvAttnDataset
  elif dataset_name == 'dummy_dataset':
    return DummyDataset
  # [...]
  elif dataset_name == 'retrieve_and_edit_dataset':
    return RetrieveAndEditDataset
  elif dataset_name == 'foo_dataset':
    return FooDataset
  else:
    raise ValueError('{} is not supported.'.format(dataset_name))

Evaluation details

The details of how evaluation is performed are in plur/eval/README.md.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an officially supported Google product.

Citation

Please cite the PLUR paper, Chen et al. https://proceedings.neurips.cc//paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html

plur's People

Contributors

Stargazers

Watchers

Forkers

vhellendoorn isabella232 kinggoldxu smoitra87 nashid lamblin python-repository-hub klao-thongchan chenzimin manzagop mustphd qmnxxy akibzaman richlysakowski alignment-lab-ai

plur's Issues

Graph2ToCoPo model

Thanks for the artefact. I have a couple of questions about this artefact. After reading the paper:

(1) One major contribution the paper claims to provide is the open-sourced framework of PLUR that others can use. I am wondering whether you are planning to share the model and training scripts?

(2) I am particularly interested to run your model for Hoppity. Is there a possibility you can share the model even if that is partial?

I am wondering whether you would share the models and steps for the training code?

FailedPreconditionError error during evaluation

At the time of running evaluation for the hoppity dataset, we are encountering the following error:

Traceback (most recent call last):
  File "train.py", line 340, in <module>
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "train.py", line 281, in main
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 132, in evaluate
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 245, in generate_predictions
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 160, in evaluate_chunk
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 329, in _evaluate_chunk
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/plur_data_loader.py", line 459, in __next__
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/data_generation.py", line 33, in __call__
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4635, in __next__
    return nest.map_structure(to_numpy, next(self._iterator))
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 766, in __next__
    return self._next_internal()
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 749, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3017, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7209, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.FailedPreconditionError: {{function_node __wrapped__IteratorGetNext_output_types_10_device_/job:localhost/replica:0/task:0/device:CPU:0}} /arc/project/st-amesbah-1/plur-data/stage_2/tfrecords/test/hoppity_single_ast_diff_dataset-00761-of-01000.tfrecord; Bad file descriptor [Op:IteratorGetNext]

Could you please assist us in debugging this error @smoitra-g?

How to load pertained embedding with PLUR?

Can I load pre-trained embedding lets say from GraphCodeBERT? Would appreciate if you can please provide me with some pointers.

Query about evaluation script

Readme mentions to run plur_evaluator.py.

python3 plur_evaluator.py
   --dataset_name=manysstubs4j_dataset 
   --target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt 
   --prediction_file_pattern=/tmp/manysstubs4j_dataset/predictions.txt

Here the ground truth is targets.txt (`-target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt).

I have two queries:

What's the format of this file?
Should not be this targets.txt file created after running the Data loader script (plur_data_generation.py)?

But after running plur_data_generation.py, I don't find the targets.txt in the /tmp/manysstubs4j_dataset/ folder.

@smoitra87 @VHellendoorn @dan-zheng can you please help me with this query?

Preserving the Characteristics of A Custom Graph

PLUR framework can accept a custom graph and produce the output task accordingly. Also, the framework is capable of incorporating relational information that can be used to represent syntax, data flow and control flow.

I had the following query regarding this point while exploring the PLUR framework:

If a graph with various relational edges are given as an input, does PLUR preserve the characteristics of the edges while feeding into the model or it creates a common outlook for all types of graph? In Summary, Does PLUR preserve the characteristics of a custom graph?

Output Token Generation Capability | Program Repairing

Hello. I am working on a program repairing project using your platform. In the paper, it is written as follows:

The TOCOPO output can be intuitively viewed as a script that describes the task output in terms of tokens, drawn from the output vocabulary, and pointers pointing to some input node, concluding with a DONE token marking the end of the output. Every task can make its own use of these facilities to express a grammar for its output. For example, a classification task can use token outputs, one per expected class; a sequence-prediction task can produce a sequence of tokens; a repair task can use a pointer to point at a particular input node, and a token output to replace that input node.

My query is regarding the highlighted line. Can the output handle the occurrence of multiple errors in an input node i.e. can it produce multiple tokens to handle the errors? or is it limited to a single token per node only? Can you highlight this point?

Hoppity dataset generation

Hi,

I'm trying to run the data generation script using the cooked graphs from my dataset. However, i see in your code that you use something called hoppity_cg.tar.gz (https://github.com/google-research/plur/blob/main/plur/stage_1/hoppity_single_ast_diff_dataset.py#L61) to get some json files. What is this used for? This was not available in the hoppity repo - is this some pre-processing that you have done on your end?

UnicodeDecodeError Query

During following the installation process I faced errors showing...

' UnicodDecodeError: 'charmap [while running 'Read all raw data']' codec can't decode byte 0x8d in position 2625450: character maps to

I am attaching the screenshot for convenience. I am not sure if this is a problem in my device. I am running the code in a Windows OS.

Is it a problem regarding the OS? Do I have to use a Linux based OS to run PLUR (or is it recommended?).

Thank You
.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.