Giter VIP home page Giter VIP logo

infercode's Introduction



Map Any Code Snippet into Vector Embedding with InferCode.

This is a Tensorflow Implementation for "InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees" (ICSE'21). InferCode works based on the key idea of using an encoder to predict subtrees as a pretext task. Then the weights learned from the encoder can be used to transfer for other downstream tasks. This is to alleviate the need for the huge amount of labeled data to build decent code learning models in Software Engineering. With this concept, representation learning models for source code can now learn from unlabeled data.

We trained our model on a dataset comprised of 19 languages, which are: java, c, c++, c#, golang, javascript, lua, php, python, ruby, rust, scala, kotlin, solidity, haskell, r, html, css, bash. We use tree-sitter as the backbone to parse these languages to AST. This is a bit different from the implementation we reported in our paper, which used srcml as the AST parser. The reasons are that we found that tree-sitter supports more language than srcml, and tree-sitter also provides a python binding interface, which makes it easy to parse any code snippet into AST by using python code. A details of our old implementation using srcml can be found in old_version.

Set up

Install the Pypi package (current version is 0.0.28):

pip3 install infercode

Usage

Infercode can be tested/used as a command

infercode <file1>.<ext1> [<file2>.<ext2>...]

where <file> is a file name, and <ext> is the file extension. The file extension will be used to select the programming language for infercode to choose the corresponding parser. It will generate a numpy vector for each file in the argument.

You can also use infercode as a python library for more advanced uses:

from infercode.client.infercode_client import InferCodeClient
import os
import logging
logging.basicConfig(level=logging.INFO)

# Change from -1 to 0 to enable GPU
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"

infercode = InferCodeClient(language="c")
infercode.init_from_config()
vectors = infercode.encode(["for (i = 0; i < n; i++)", "struct book{ int num; char s[27]; }shu[1000];"])

print(vectors)

Then we have the output embeddings:

[[ 0.00455336  0.00277071  0.00299444 -0.00264732  0.00424443  0.02380365
0.00802475  0.01927063  0.00889819  0.01684897  0.03249155  0.01853252
0.00930241  0.02532686  0.00152953  0.0027509   0.00200306 -0.00042401
0.00093602  0.044968   -0.0041187   0.00760367  0.01713051  0.0051542
-0.00033204  0.01757674 -0.00852873  0.00510181  0.02680481  0.00579945
0.00298177  0.00650377  0.01903037  0.00188015  0.00644581  0.02502727
-0.00599149  0.00339381  0.01834774 -0.0012807  -0.00413265  0.01172356
0.01524384  0.00769007  0.01364587 -0.00340345  0.02757765  0.03651286
0.01334631  0.01464784]
[-0.00017088  0.01376707  0.01347563  0.00545072  0.01674811  0.01347677
0.01061796  0.02521674  0.01205592  0.03466582  0.01449588  0.02479498
-0.00011303  0.01174722  0.00444653  0.01382409 -0.00396148 -0.00195686
0.00527923  0.03169966 -0.00935379  0.01904526  0.02334653 -0.00742705
0.00405659  0.0158342  -0.00599484  0.01687686  0.03012032  0.01365279
0.01936428  0.00576922  0.01786506  0.00244599  0.00816536  0.03116215
-0.00721357  0.01265837  0.029279    0.00394636  0.00475944  0.0057507
0.02005564  0.00345545  0.01078242  0.00763404  0.01771503  0.02223164
0.01541999  0.03995579]]

Note that on the initial step, the script will build tree-sitter parsers from sources into ~/.tree-sitter/bin, download our pretrained model, and store it into ~/.infercode_data/model_checkpoint.

Compare to other work

  • There are a few other techniques for code representation learning, but none of them are designed with the intention to have a pretrained model to convert code to vector. For example, Code2vec (Alon et al.), despite the attractive name, Code2vec is not suitable to convert code to vector since they trained the model to predict the method name. If one wants to reuse the Code2vec model to convert code to vector, their implementation is not ready for this purpose.

  • There are also other pretrained models for code, such as CodeBert, GraphCodeBert, CuBert, etc, but they did not wrap their code into usable inferfaces.

  • None of the above work supports such many languages like InferCode.

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{bui2021infercode,
  title={InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees},
  author={Bui, Nghi DQ and Yu, Yijun and Jiang, Lingxiao},
  booktitle={2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)},
  pages={1186--1197},
  year={2021},
  organization={IEEE}
}

infercode's People

Contributors

bdqnghi avatar golemxlv avatar pilottesting avatar yijunyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infercode's Issues

Preprocessing of datasets

Hello!

Thanks for your works on InferCode, it's awesome!
My name is Maksim Zubkov, and I am doing my bachelor thesis at JetBrains Research on the topic of self-supervised learning techniques on source code. I want to compare the pre-training scheme proposed in your paper with one I investigate in the scope of my research.

I tried to initialize CodeClassificationData to train the model on my date, but I could not find a script to create files with a .pkl extension. Now it seems like I was finally able to run preprocessing. In order to achieve this goal, I followed the following steps:

  1. As suggested in the README, I execute: docker run --rm -v $(pwd):/data -w /data --entrypoint /usr/local/bin/subtree -it yijun/fast examples/raw_code examples/subtrees node_types.csv to create .ids.csv files in examples/subtrees
  2. Then I explored yijun/fast docker image and found binaries /usr/local/bin/pkl. I ran docker with /usr/local/bin/pkl as an entry point which resulted in several .pkl files.
  3. Then I added minor changes to your repo, namely add some __init__.py files
  4. The next step was to deal with the fast_pb2.py file, which I simply copied from graph-ast repo

Finally, I have succeeded to create trees object and run put_trees_into_bucket, but could you please answer several questions:

  1. Is this a correct algorithm to prepare data for your model? If so, I can create a pull request and add all this information to the README? Or maybe I missed some important point?
  2. I didn't got the difference between /usr/local/bin/pkl and /usr/local/bin/pklpos, could you please explain what is the difference?

If I can somehow help you with open-sourcing the code base of InferCode, I will be pleased to help you, if it is possible

TypeError: Argument to set_language must be a Language

python test.py
2021-08-29 16:16:20.939654: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /data/anaconda3/envs/infercode/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Traceback (most recent call last):
File "test.py", line 10, in
infercode.init_from_config()
File "/data/anaconda3/envs/infercode/lib/python3.8/site-packages/infercode/client/infercode_client.py", line 29, in init_from_config
self.init_utils()
File "/data/anaconda3/envs/infercode/lib/python3.8/site-packages/infercode/client/base_client.py", line 112, in init_utils
self.ast_parser = ASTParser(self.language)
File "/data/anaconda3/envs/infercode/lib/python3.8/site-packages/infercode/data_utils/ast_parser.py", line 81, in init
self.parser.set_language(lang)
TypeError: Argument to set_language must be a Language

can not download dataset

Thanks for your great work. But I got the following error while trying to download the dataset. Could you please send me the dataset?

image

How to specific the output vector size?

Thank you for your wonderful work.

I have a question that, it seems the output is a 100-dimensional vector by default, how can I change the output vector size to a specific number?

Configure InferCode parser to parse specific language

Hello,
Is the following correct to setup the parser to read python code please?

from infercode.client.infercode_client import InferCodeClient
import os
import logging
logging.basicConfig(level=logging.INFO)

# Change from -1 to 0 to enable GPU
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"

infercode = InferCodeClient(language="python")

Also for Java and other languages, should we use language="julia", language="java" etc.?

Download the pretrained model to generate embeddings

Thank you InferCode team for this work.

Can you pleas provide a link to your pretrained model? Does this model works with any language please or we have to train it from scratch based on target language? Also, can we provide CSV file that contains one file that has program for each row please?

Regards,

Provision of an explicit license

Hello, if possible could you please provide an explicit license for the usage of the code contained in this repository? Many thanks.

FileNotFoundError

FileNotFoundError: [WinError 2] 系统找不到指定的文件。: 'C:\Users\Lenovo\.tree-sitter\tree-sitter-parsers-Windows' Hello, How to solve this problem?

AttributeError: 'InferCodeClient' object has no attribute 'ast_parser'

Hello,

During running your model on based on your sample below:

from infercode.client.infercode_client import InferCodeClient
import os
import logging
logging.basicConfig(level=logging.INFO)

# Change from -1 to 0 to enable GPU
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"

infercode = InferCodeClient(language="c")
infercode.init_from_config()
vectors = infercode.encode(["for (i = 0; i < n; i++)", "struct book{ int num; char s[27]; }shu[1000];"])

I over multiple strings of codes and it was giving output and stopped accidently after some iterations:

AttributeError: 'InferCodeClient' object has no attribute 'ast_parser'

Can you help with this please?

Execute infercode with 2 files failed

Related to https://github.com/bdqnghi/infercode/blob/master/infercode/__main__.py .

Execute infercode with 1 file succeed.

python3.8 -m infercode file1.c

But execute infercode with 2 files failed.

python3.8 -m infercode file1.c file2.c

Pasted the error message:

Traceback (most recent call last):
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: Key dense_2/kernel not found in checkpoint
         [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1297, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key dense_2/kernel not found in checkpoint
         [[node save_1/RestoreV2 (defined at WorkSpace/infercode/infercode/client/infercode_client.py:45) ]]

Original stack trace for 'save_1/RestoreV2':
  File "usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "WorkSpace/infercode/infercode/__main__.py", line 61, in <module>
    main()
  File "WorkSpace/infercode/infercode/__main__.py", line 53, in main
    infercode.init_from_config()
  File "WorkSpace/infercode/infercode/client/infercode_client.py", line 45, in init_from_config
    self.saver = tf.train.Saver(save_relative_paths=True, max_to_keep=5)
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 835, in __init__
    self.build()
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 847, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 514, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 334, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 582, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1508, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3528, in _create_op_internal
    ret = Operation(
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 69, in get_tensor
    return CheckpointReader.CheckpointReader_GetTensor(
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1308, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1626, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 74, in get_tensor
    error_translator(e)
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/WorkSpace/infercode/infercode/__main__.py", line 61, in <module>
    main()
  File "/WorkSpace/infercode/infercode/__main__.py", line 53, in main
    infercode.init_from_config()
  File "/WorkSpace/infercode/infercode/client/infercode_client.py", line 54, in init_from_config
    self.saver.restore(self.sess, ckpt.model_checkpoint_path)
  File "/home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1313, in restore
    raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key dense_2/kernel not found in checkpoint
         [[node save_1/RestoreV2 (defined at WorkSpace/infercode/infercode/client/infercode_client.py:45) ]]

Original stack trace for 'save_1/RestoreV2':
  File "usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "WorkSpace/infercode/infercode/__main__.py", line 61, in <module>
    main()
  File "WorkSpace/infercode/infercode/__main__.py", line 53, in main
    infercode.init_from_config()
  File "WorkSpace/infercode/infercode/client/infercode_client.py", line 45, in init_from_config
    self.saver = tf.train.Saver(save_relative_paths=True, max_to_keep=5)
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 835, in __init__
    self.build()
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 847, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 514, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 334, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 582, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1508, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3528, in _create_op_internal
    ret = Operation(
  File "home/username/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Wordering how to use InferCode to predict method names

Hi all,

Thanks for bringing this excellent work up.

I have played with InferCode for a while and am thinking about how to properly predict method names.

I figured out that InferCode can generate the encoded vectors. Hence, I am thinking maybe we need additionally train a model to predict names using the vectors. May I know whether it is the right way to go? If so, would you mind sharing some ideas/insights about how the architecture of the new model looks like?

Any suggestion would be very much appreciated. Thanks!

Infercode fine-tuning

Thank you for our awesome work, i can run it successfully. Now i want to fine-tune Infercode model for a downstream task (classification of programs by their functionalities). May I know whether it is the right way to go? If so, would you mind sharing some ideas/insights about how the architecture of the new model looks like?

同时运行两个java文件出现错误

infercode Book.java CoolService.java
[[ 0.38303974 1.4843202 0.61839557 0.14034976 0.02223869 -0.7740696
-0.5937896 -0.58639425 -0.5340922 0.16736332 -0.41043946 -0.11949562
0.92982614 -0.1356623 -0.4808729 0.28710333 -0.34817076 0.5575525
0.15530032 1.6053647 0.5898889 -0.4097566 -0.4019269 -0.6872514
0.55460155 -0.22991975 -0.39823616 -0.48058054 -0.22132947 -0.6536728
-0.27846056 -0.57694393 0.09179881 -0.35203043 -0.3749781 -0.35520336
0.49900222 -0.3916241 -0.78766006 -0.58723456 -0.3593774 -0.4304761
-0.3096843 -0.21838556 2.4091513 0.90175235 -0.5389576 1.4856302
-0.54281265 0.6444931 1.3740956 2.8259175 1.1957991 -0.44408906
-0.18730186 -0.38441432 -0.17307334 0.01391467 -0.37244347 0.25457305
0.15234028 0.9025265 0.5072829 0.01281878 -0.08297866 0.10418227
-0.3482946 -0.24725845 -0.40372172 -0.78977174 -0.3094635 -0.65167886
-0.29972795 1.6762887 0.74273336 -0.80558974 1.6752175 -0.6440183
-0.5221743 0.5496853 1.7443756 -0.04492668 -0.37624377 -0.43881357
0.10893515 -0.62408113 -0.28724825 1.6614692 0.42661044 0.8835053
-0.2010308 0.71462727 -0.36316925 0.09840436 1.3291454 -0.38315466
0.97000957 1.8023095 -0.35803017 -0.8325385 ]]
Traceback (most recent call last):
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: Key dense_2/kernel not found in checkpoint
[[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1297, in restore
sess.run(self.saver_def.restore_op_name,
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key dense_2/kernel not found in checkpoint
[[node save_1/RestoreV2 (defined at /site-packages/infercode/client/infercode_client.py:45) ]]

Original stack trace for 'save_1/RestoreV2':
File "/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/site-packages/infercode/main.py", line 61, in
main()
File "/site-packages/infercode/main.py", line 53, in main
infercode.init_from_config()
File "/site-packages/infercode/client/infercode_client.py", line 45, in init_from_config
self.saver = tf.train.Saver(save_relative_paths=True, max_to_keep=5)
File "/site-packages/tensorflow/python/training/saver.py", line 835, in init
self.build()
File "/site-packages/tensorflow/python/training/saver.py", line 847, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/site-packages/tensorflow/python/training/saver.py", line 875, in _build
self.saver_def = self._builder._build_internal( # pylint: disable=protected-access
File "/site-packages/tensorflow/python/training/saver.py", line 514, in _build_internal
restore_op = self._AddRestoreOps(filename_tensor, saveables,
File "/site-packages/tensorflow/python/training/saver.py", line 334, in _AddRestoreOps
all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
File "/site-packages/tensorflow/python/training/saver.py", line 582, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1508, in restore_v2
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/site-packages/tensorflow/python/framework/ops.py", line 3528, in _create_op_internal
ret = Operation(
File "/site-packages/tensorflow/python/framework/ops.py", line 1990, in init
self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 69, in get_tensor
return CheckpointReader.CheckpointReader_GetTensor(
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1308, in restore
names_to_keys = object_graph_key_mapping(save_path)
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1626, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 74, in get_tensor
error_translator(e)
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator
raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gpu/.conda/envs/infercode/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gpu/.conda/envs/infercode/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/infercode/main.py", line 61, in
main()
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/infercode/main.py", line 53, in main
infercode.init_from_config()
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/infercode/client/infercode_client.py", line 54, in init_from_config
self.saver.restore(self.sess, ckpt.model_checkpoint_path)
File "/home/gpu/.conda/envs/infercode/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1313, in restore
raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key dense_2/kernel not found in checkpoint
[[node save_1/RestoreV2 (defined at /site-packages/infercode/client/infercode_client.py:45) ]]

Original stack trace for 'save_1/RestoreV2':
File "/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/site-packages/infercode/main.py", line 61, in
main()
File "/site-packages/infercode/main.py", line 53, in main
infercode.init_from_config()
File "/site-packages/infercode/client/infercode_client.py", line 45, in init_from_config
self.saver = tf.train.Saver(save_relative_paths=True, max_to_keep=5)
File "/site-packages/tensorflow/python/training/saver.py", line 835, in init
self.build()
File "/site-packages/tensorflow/python/training/saver.py", line 847, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/site-packages/tensorflow/python/training/saver.py", line 875, in _build
self.saver_def = self._builder._build_internal( # pylint: disable=protected-access
File "/site-packages/tensorflow/python/training/saver.py", line 514, in _build_internal
restore_op = self._AddRestoreOps(filename_tensor, saveables,
File "/site-packages/tensorflow/python/training/saver.py", line 334, in _AddRestoreOps
all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
File "/site-packages/tensorflow/python/training/saver.py", line 582, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1508, in restore_v2
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/site-packages/tensorflow/python/framework/ops.py", line 3528, in _create_op_internal
ret = Operation(
File "/site-packages/tensorflow/python/framework/ops.py", line 1990, in init
self._traceback = tf_stack.extract_stack()

Method Name Prediction

I am interested in the paper InferCode. I have one question about the name predction. Do you use the frequency of the normalized method name to filter the rare methods for Java-small data? Code2vec removes the rare names. Which prediction model do you use? Would you like to share the code?

Using F1 score or accuracy

Hello,

If someone uses your embeddings of a source code for any downstream task your model supports, do you recommend using F1 score or accuracy to evaluate your model and why, please? In case your model F1 score is low but high accuracy, what would you recommend?

Training the model

Awesome work. I am trying to train the model on my own data.
Where is the model saved after fine tuning? i tried example from test script. Also for inference how can i load the saved model?
Thanks a lot

Code error for subtrees bucket selection while training

For the old version, I guess there is an error in line 242 of file infercode/old_version/utils/data/tree_loader.py. To train, the subtrees bucket should be all_subtrees_bucket other than random_subtrees_bucket.

def make_minibatch_iterator(self):
        buckets = self.random_subtrees_bucket # line 242
        # This part is important
        if not self.is_training:
            print("Using random subtrees buckets...........")
            buckets = self.random_subtrees_bucket
        else:
            print("Using all subtrees buckets...........")
        bucket_ids = list(buckets.keys())
        random.shuffle(bucket_ids)
        ......

The query for the infercode datasets:

Thanks for your wonderful work, your work gave me a lot of inspiration!May I ask you something about the dataset and dataset processing details that papers do? As you said, there is only part of the data in the current repo, to run through the infercode, there are still missing datasets in "train_path", "train_label_path", "Val path", "val_label_path" and "subtree_vocabulary_path", etc. Would you please give us these data formats or a small number of samples at those files for us to run the infercode? Another question is, how did you extract the subtree of an AST? Can the details of the extraction code be shown?

Looking forward to your reply!

Thanks a lot!

Sincerely student.

Question: Where reflecting the filtering of AST node types?

Hi, thanks a lot for your work!
My name is Zack. I have read the paper and it shows that this work selects several node types during the process. But I cannot find the corresponding section in code. Can you give me some guidance?

Looking forward to your reply! Thank you!

Comparing 2 source codes based on their InferCode represention

Hello Dear Authors,

Once we have the embeddings generated from InferCode for the same code written in 2 different languages ?(Python and Java). How InferCode decides they are similar given that the structure of AST for adding 2 integers in Java and Python is different in terms of their AST nodes?

Where are processing_data.sh, CodeClassificationData, CorderModel, yijun/fast and requirements.txt?

Thanks for your wonderful work!
I'm trying to reproduce the results, but I got stucked while following the steps in README:

  1. in section Data Preparation, I've been told to execute the script source process_data.sh. But I can't find any process_data.sh information in all versions, except README mentions it. The newest version has an process_data.py, but it seem that it's not been implemented yet (only TreeSitterDataProcessor's constructor has been called, no processing)
  2. I try to skip the data process and see that if existing data (downloaded by python3 download_data.py, named OJ_pycparser_train_test_val.zip) can train the model. Then I notice that class CodeClassificationData can only be found in the old version of this repo (since 5a89b64), after that the tree_loader.py is deleted.
  3. Then I try to checkout an older version (5a89b64), and find out that CorderModel can not be found. I change it to InferCodeModel and then encounter path issues on tree_loader.py.

features_file_path_splits[-4] = subtree_features_directory.split("/")[-2] both index change to 0, solved this issue. Then I put the .ids.csv together with .pkl files.

  1. I found #2 and followed the instructuctions. I managed to run examples on CPU tensorflow==1.15.0 installed via conda, but failed with tensorflow-gpu==1.15.0. The README indicate that there is a requirements.txt file, but I couldn't find it in the repo.

So these are my questions:
0. Where are processing_data.sh, CodeClassificationData, CorderModel? Is the deleted CodeClassificationData still validate? What is the relation between CorderModel and InferCodeModel?

  1. It seems that you are trying to migrate AST parser from srcml(yijun/fast) to tree-sitter, and this repo is under heavy development. But result may differ with different AST parser (mentioned in README). So in oreder to replicate this study, should I follow the newest version, or is there any minor verion just to replicate the study?
  2. I install the newest pyarrow==3.0.0, scikit-learn==0.24.1, bidict==0.21.2, and keras-radam==0.15.0 with python==3.7.10. tensorflow==1.15.0 is installed via conda. Is the environment ok? Or better, is there any requirements.txt for reference?
  3. Since I can't find the source code of yijun/fast image, where can I find the details to process the ASTs?
  4. How to use the existing .pkl files (in OJ_pycparser_train_test_val.zip)? Is it relavant to replicate the study?

Sincerely student.

OOM error

I'm trying to get the vector representations of decompiled source codes.

for example:

for src in list:
    with open(src, 'r') as f:
        vector = infercode.encode([f.read()])[0]
        print(vector)

When I ran the code, some decompiled source codes generate OOM error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,37778,1410,7,100] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node network/embedding_lookup_2 (defined at \ProgramData\Anaconda3\envs\infercode\lib\site-packages\infercode-0.0.24-py3.7.egg\infercode\network\infercode_network.py:341) ]]

The line count of source code that triggers OOM error is around 5000.

Is there any solution?

Question: Is there a reason that the batch size is capped at 5?

Hey, first off, love the repo, thanks for providing it.

I just have a question about the batch size that the model can handle.

When I put in a list of more than 5 pieces of code like so:

from infercode.client.infercode_client import InferCodeClient
import os
import logging
logging.basicConfig(level=logging.INFO)

# Change from -1 to 0 to enable GPU
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

infercode = InferCodeClient(language="java")
infercode.init_from_config()

# Here we put in 6 identical i initiailizations
vectors = infercode.encode(["int i = 0;"] * 6)

I get the following error:

AssertionError                            Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 vectors = infercode.encode(["int i = 0;"] * 6)

File ~\Anaconda3\envs\infercode_new_env\lib\site-packages\infercode\client\infercode_client.py:76, in InferCodeClient.encode(self, batch_code_snippets)
     75 def encode(self, batch_code_snippets):
---> 76     tensors = self.snippets_to_tensors(batch_code_snippets)
     77     embeddings = self.sess.run(
     78         [self.infercode_model.code_vector],
     79         feed_dict={
   (...)
     87         }
     88     )
     89     return embeddings[0]

File ~\Anaconda3\envs\infercode_new_env\lib\site-packages\infercode\client\infercode_client.py:62, in InferCodeClient.snippets_to_tensors(self, batch_code_snippets)
     60 def snippets_to_tensors(self, batch_code_snippets):
     61     batch_tree_indexes = []
---> 62     assert len(batch_code_snippets) <= 5
     63     for code_snippet in batch_code_snippets:
     64         # tree-sitter parser requires bytes as the input, not string
     65         code_snippet_to_byte = str.encode(code_snippet)

AssertionError: 

This stems from the code here having an assert that the number of inputs is <=5:

assert len(batch_code_snippets) <= 5

Is there a reason this is hard-coded? Or would it make sense to make a batch_size parameter that maybe defaults to 5 but is adjustable depending on computational capacity?

FileNotFoundError:

[WinError 2] 系统找不到指定的文件。: 'C:\Users\Administrator\.tree-sitter\tree-sitter-parsers-Windows' 使用rm C:\Users\Administrator\.tree-sitter 然后 pip install tree-sitter-parsers 问题依然存在,Windows miniconda python 3.8,在mac和ubuntu上都是一样的报错

'InferCodeClient' object has no attribute 'ast_parser'

l have tried 'InferCodeClient' successfully to generate vectors for c-language with your hints in the Readme file. However, when l want to change the language to java by 'InferCodeClient(language="java")', the encode function meet an error.

image

Can the Infercode as Pypi package support other languages except for c now?

How to initial two InferCodeClient with different languages?

Hi, I tried to use InferCode to generate vectors for java and c languages, and I want to initial two InferCodeClients as the following code shows, but at line 6 infercode_c_predictor.init_from_config() will give the following exception.

from infercode.client.infercode_client import InferCodeClient

infercode_java_predictor = InferCodeClient(language="java")
infercode_java_predictor.init_from_config()
infercode_c_predictor = InferCodeClient(language="c")
infercode_c_predictor.init_from_config()   // Line 6
// Error Message
Exception has occurred: NotFoundError
Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key dense_2/kernel not found in checkpoint
	 [[node save_1/RestoreV2 (defined at /home/******/.local/lib/python3.8/site-packages/infercode/client/infercode_client.py:45) ]]

......

So is there anyway that I get two InferCodeClients with different languages at the same time?

Thank you for your help in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.