Giter VIP home page Giter VIP logo

code2vec's Introduction

Code2vec

A neural network for learning distributed representations of code. This is an official implementation of the model described in:

Uri Alon, Meital Zilberstein, Omer Levy and Eran Yahav, "code2vec: Learning Distributed Representations of Code", POPL'2019 [PDF]

October 2018 - The paper was accepted to POPL'2019!

April 2019 - The talk video is available here.

July 2019 - Add tf.keras model implementation (see here).

An online demo is available at https://code2vec.org/.

See also:

  • code2seq (ICLR'2019) is our newer model. It uses LSTMs to encode paths node-by-node (rather than monolithic path embeddings as in code2vec), and an LSTM to decode a target sequence (rather than predicting a single label at a time as in code2vec). See PDF, demo at http://www.code2seq.org and code.
  • Structural Language Models of Code is a new paper that learns to generate the missing code within a larger code snippet. This is similar to code completion, but is able to predict complex expressions rather than a single token at a time. See PDF, demo at http://AnyCodeGen.org.
  • Adversarial Examples for Models of Code is a new paper that shows how to slightly mutate the input code snippet of code2vec and GNNs models (thus, introducing adversarial examples), such that the model (code2vec or GNNs) will output a prediction of our choice. See PDF (code: soon).
  • Neural Reverse Engineering of Stripped Binaries is a new paper that learns to predict procedure names in stripped binaries, thus use neural networks for reverse engineering. See PDF (code: soon).

This is a TensorFlow implementation, designed to be easy and useful in research, and for experimenting with new ideas in machine learning for code tasks. By default, it learns Java source code and predicts Java method names, but it can be easily extended to other languages, since the TensorFlow network is agnostic to the input programming language (see Extending to other languages. Contributions are welcome. This repo actually contains two model implementations. The 1st uses pure TensorFlow and the 2nd uses TensorFlow's Keras (more details).

Table of Contents

Requirements

On Ubuntu:

  • Python3 (>=3.6). To check the version:

python3 --version

  • TensorFlow - version 2.0.0 (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

  • If you are using a GPU, you will need CUDA 10.0 (download) as this is the version that is currently supported by TensorFlow. To check CUDA version:

nvcc --version

  • For GPU: cuDNN (>=7.5) (download) To check cuDNN version:

cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2

Quickstart

Step 0: Cloning this repository

git clone https://github.com/tech-srl/code2vec
cd code2vec

Step 1: Creating a new dataset from java sources

In order to have a preprocessed dataset to train a network on, you can either download our preprocessed dataset, or create a new dataset of your own.

Download our preprocessed dataset of ~14M examples (compressed: 6.3GB, extracted 32GB)

wget https://s3.amazonaws.com/code2vec/data/java14m_data.tar.gz
tar -xvzf java14m_data.tar.gz

This will create a data/java14m/ sub-directory, containing the files that hold that training, test and validation sets, and a vocabulary file for various dataset properties.

Creating and preprocessing a new Java dataset

In order to create and preprocess a new dataset (for example, to compare code2vec to another model on another dataset):

  • Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
  • Run the preprocess.sh file:

source preprocess.sh

Step 2: Training a model

You can either download an already-trained model, or train a new model using a preprocessed dataset.

Downloading a trained model (1.4 GB)

We already trained a model for 8 epochs on the data that was preprocessed in the previous step. The number of epochs was chosen using early stopping, as the version that maximized the F1 score on the validation set. This model can be downloaded here or using:

wget https://s3.amazonaws.com/code2vec/model/java14m_model.tar.gz
tar -xvzf java14m_model.tar.gz
Note:

This trained model is in a "released" state, which means that we stripped it from its training parameters and can thus be used for inference, but cannot be further trained. If you use this trained model in the next steps, use 'saved_model_iter8.release' instead of 'saved_model_iter8' in every command line example that loads the model such as: '--load models/java14_model/saved_model_iter8'. To read how to release a model, see Releasing the model.

Downloading a trained model (3.5 GB) which can be further trained

A non-stripped trained model can be obtained here or using:

wget https://s3.amazonaws.com/code2vec/model/java14m_model_trainable.tar.gz
tar -xvzf java14m_model_trainable.tar

This model weights more than twice than the stripped version, and it is recommended only if you wish to continue training a model which is already trained. To continue training this trained model, use the --load flag to load the trained model; the --data flag to point to the new dataset to train on; and the --save flag to provide a new save path.

A model that was trained on the Java-large dataset

We provide an additional code2vec model that was trained on the "Java-large" dataset (this dataset was introduced in the code2seq paper). See Java-large

Training a model from scratch

To train a model from scratch:

  • Edit the file train.sh to point it to the right preprocessed data. By default, it points to our "java14m" dataset that was preprocessed in the previous step.
  • Before training, you can edit the configuration hyper-parameters in the file config.py, as explained in Configuration.
  • Run the train.sh script:
source train.sh
Notes:
  1. By default, the network is evaluated on the validation set after every training epoch.
  2. The newest 10 versions are kept (older are deleted automatically). This can be changed, but will be more space consuming.
  3. By default, the network is training for 20 epochs. These settings can be changed by simply editing the file config.py. Training on a Tesla v100 GPU takes about 50 minutes per epoch. Training on Tesla K80 takes about 4 hours per epoch.

Step 3: Evaluating a trained model

Once the score on the validation set stops improving over time, you can stop the training process (by killing it) and pick the iteration that performed the best on the validation set. Suppose that iteration #8 is our chosen model, run:

python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test data/java14m/java14m.test.c2v

While evaluating, a file named "log.txt" is written with each test example name and the model's prediction.

Step 4: Manual examination of a trained model

To manually examine a trained model, run:

python3 code2vec.py --load models/java14_model/saved_model_iter8.release --predict

After the model loads, follow the instructions and edit the file Input.java and enter a Java method or code snippet, and examine the model's predictions and attention scores.

Configuration

Changing hyper-parameters is possible by editing the file config.py.

Here are some of the parameters and their description:

config.NUM_TRAIN_EPOCHS = 20

The max number of epochs to train the model. Stopping earlier must be done manually (kill).

config.SAVE_EVERY_EPOCHS = 1

After how many training iterations a model should be saved.

config.TRAIN_BATCH_SIZE = 1024

Batch size in training.

config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE

Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.

config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10

Number of words with highest scores in $ y_hat $ to consider during prediction and evaluation.

config.NUM_BATCHES_TO_LOG_PROGRESS = 100

Number of batches (during training / evaluating) to complete between two progress-logging records.

config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100

Number of training batches to complete between model evaluations on the test set.

config.READER_NUM_PARALLEL_BATCHES = 4

The number of threads enqueuing examples to the reader queue.

config.SHUFFLE_BUFFER_SIZE = 10000

Size of buffer in reader to shuffle example within during training. Bigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.

config.CSV_BUFFER_SIZE = 100 * 1024 * 1024 # 100 MB

The buffer size (in bytes) of the CSV dataset reader.

config.MAX_CONTEXTS = 200

The number of contexts to use in each example.

config.MAX_TOKEN_VOCAB_SIZE = 1301136

The max size of the token vocabulary.

config.MAX_TARGET_VOCAB_SIZE = 261245

The max size of the target words vocabulary.

config.MAX_PATH_VOCAB_SIZE = 911417

The max size of the path vocabulary.

config.DEFAULT_EMBEDDINGS_SIZE = 128

Default embedding size to be used for token and path if not specified otherwise.

config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE

Embedding size for tokens.

config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE

Embedding size for paths.

config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE

Size of code vectors.

config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE

Embedding size for target words.

config.MAX_TO_KEEP = 10

Keep this number of newest trained versions during training.

config.DROPOUT_KEEP_RATE = 0.75

Dropout rate used during training.

config.SEPARATE_OOV_AND_PAD = False

Whether to treat <OOV> and <PAD> as two different special tokens whenever possible.

Features

Code2vec supports the following features:

Choosing implementation to use

This repo comes with two model implementations: (i) uses pure TensorFlow (written in tensorflow_model.py); (ii) uses TensorFlow's Keras (written in keras_model.py). The default implementation used by code2vec.py is the pure TensorFlow. To explicitly choose the desired implementation to use, specify --framework tensorflow or --framework keras as an additional argument when executing the script code2vec.py. Particularly, this argument can be added to each one of the usage examples (of code2vec.py) detailed in this file. Note that in order to load a trained model (from file), one should use the same implementation used during its training.

Releasing the model

If you wish to keep a trained model for inference only (without the ability to continue training it) you can release the model using:

python3 code2vec.py --load models/java14_model/saved_model_iter8 --release

This will save a copy of the trained model with the '.release' suffix. A "released" model usually takes 3x less disk space.

Exporting the trained token vectors and target vectors

Token and target embeddings are available to download:

[Token vectors] [Method name vectors]

These saved embeddings are saved without subtoken-delimiters ("toLower" is saved as "tolower").

In order to export embeddings from a trained model, use the "--save_w2v" and "--save_t2v" flags:

Exporting the trained token embeddings:

python3 code2vec.py --load models/java14_model/saved_model_iter8.release --save_w2v models/java14_model/tokens.txt

Exporting the trained target (method name) embeddings:

python3 code2vec.py --load models/java14_model/saved_model_iter8.release --save_t2v models/java14_model/targets.txt

This saves the tokens/targets embedding matrices in word2vec format to the specified text file, in which: the first line is: <vocab_size> <dimension> and each of the following lines contains: <word> <float_1> <float_2> ... <float_dimension>

These word2vec files can be manually parsed or easily loaded and inspected using the gensim python package:

python3
>>> from gensim.models import KeyedVectors as word2vec
>>> vectors_text_path = 'models/java14_model/targets.txt' # or: `models/java14_model/tokens.txt'
>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)
>>> model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings
>>> model.most_similar(positive=['download', 'send'], negative=['receive'])

The above python commands will result in the closest name to both "equals" and "to|lower", which is "equals|ignore|case". Note: In embeddings that were exported manually using the "--save_w2v" or "--save_t2v" flags, the input token and target words are saved using the symbol "|" as a subtokens delimiter ("toLower" is saved as: "to|lower"). In the embeddings that are available to download (which are the same as in the paper), the "|" symbol is not used, thus "toLower" is saved as "tolower".

Exporting the code vectors for the given code examples

The flag --export_code_vectors allows to export the code vectors for the given examples.

If used with the --test <TEST_FILE> flag, a file named <TEST_FILE>.vectors will be saved in the same directory as <TEST_FILE>. Each row in the saved file is the code vector of the code snipped in the corresponding row in <TEST_FILE>.

If used with the --predict flag, the code vector will be printed to console.

Extending to other languages

This project currently supports Java and C# as the input languages.

April 2020 - an extension for code2vec that addresses obfuscated Java code was developed by @basedrhys, and is available here: https://github.com/basedrhys/obfuscated-code2vec.

January 2020 - an extractor for predicting TypeScript type annotations for JavaScript input using code2vec was developed by @izosak and Noa Cohen, and is available here: https://github.com/tech-srl/id2vec.

June 2019 - an extractor for C that is compatible with our model was developed by CMU SEI team. - removed by CMU SEI team.

June 2019 - an extractor for Python, Java, C, C++ by JetBrains Research is available here: PathMiner.

In order to extend code2vec to work with other languages, a new extractor (similar to the JavaExtractor) should be implemented, and be called by preprocess.sh. Basically, an extractor should be able to output for each directory containing source files:

  • A single text file, where each row is an example.
  • Each example is a space-delimited list of fields, where:
  1. The first "word" is the target label, internally delimited by the "|" character.
  2. Each of the following words are contexts, where each context has three components separated by commas (","). Each of these components cannot include spaces nor commas. We refer to these three components as a token, a path, and another token, but in general other types of ternary contexts can be considered.

For example, a possible novel Java context extraction for the following code example:

void fooBar() {
	System.out.println("Hello World");
}

Might be (in a new context extraction algorithm, which is different than ours since it doesn't use paths in the AST):

foo|Bar System,FIELD_ACCESS,out System.out,FIELD_ACCESS,println THE_METHOD,returns,void THE_METHOD,prints,"hello_world"

Consider the first example context "System,FIELD_ACCESS,out". In the current implementation, the 1st ("System") and 3rd ("out") components of a context are taken from the same "tokens" vocabulary, and the 2nd component ("FIELD_ACCESS") is taken from a separate "paths" vocabulary.

Additional datasets

We preprocessed additional three datasets used by the code2seq paper, using the code2vec preprocessing. These datasets are available in raw format (i.e., .java files) at https://github.com/tech-srl/code2seq/blob/master/README.md#datasets, and are also available to download in a preprocessed format (i.e., ready to train a code2vec model on) here:

Java-small (compressed: 366MB, extracted 1.9GB)

wget https://s3.amazonaws.com/code2vec/data/java-small_data.tar.gz

This dataset is based on the dataset of Allamanis et al. (ICML'2016), with the difference that training/validation/test are split by-project rather than by-file. This dataset contains 9 Java projects for training, 1 for validation and 1 testing. Overall, it contains about 700K examples.

Java-med (compressed: 1.8GB, extracted 9.3GB)

wget https://s3.amazonaws.com/code2vec/data/java-med_data.tar.gz

A dataset of the 1000 top-starred Java projects from GitHub. It contains 800 projects for training, 100 for validation and 100 for testing. Overall, it contains about 4M examples.

Java-large (compressed: 7.2GB, extracted 37GB)

wget https://s3.amazonaws.com/code2vec/data/java-large_data.tar.gz

A dataset of the 9500 top-starred Java projects from GitHub that were created since January 2007. It contains 9000 projects for training, 200 for validation and 300 for testing. Overall, it contains about 16M examples.

Additionally, we provide a trained code2vec model that was trained on the Java-large dataset (this model was not part of the original code2vec paper, but was later used as a baseline in the code2seq paper which introduced this dataset). Trainable model (3.5 GB):

wget https://code2vec.s3.amazonaws.com/model/java-large-model.tar.gz

"Released model" (1.4 GB, cannot be further trained).

wget https://code2vec.s3.amazonaws.com/model/java-large-released-model.tar.gz

Citation

code2vec: Learning Distributed Representations of Code

@article{alon2019code2vec,
 author = {Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran},
 title = {Code2Vec: Learning Distributed Representations of Code},
 journal = {Proc. ACM Program. Lang.},
 issue_date = {January 2019},
 volume = {3},
 number = {POPL},
 month = jan,
 year = {2019},
 issn = {2475-1421},
 pages = {40:1--40:29},
 articleno = {40},
 numpages = {29},
 url = {http://doi.acm.org/10.1145/3290353},
 doi = {10.1145/3290353},
 acmid = {3290353},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {Big Code, Distributed Representations, Machine Learning},
}

code2vec's People

Contributors

ameerhajali avatar asvyatkovskiy avatar danglotb avatar dependabot[bot] avatar eladn avatar itakeshi avatar joaorura avatar lidiancracy avatar mdrafiqulrabin avatar urialon avatar yahave avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

code2vec's Issues

Cannot run preprocess.sh for raw data

Hello Urialon

I tried to run preprocess.sh for raw java, but It always gives me errors No such file or directory: 'java': 'java'

Could you check or give me solutions?

I really look forward to hearing from you

Regards
Dung

Question About Raw Dataset

HI, Thanks for your contributions.
I'm concerned with shadowing code2vec, and wishing to do it from the scratch.(including preprocess raw data to preprocessed dataset)

So, I wanna ask you that offers raw dataset or not!
Is the only way to get raw dataset is crawling myself? ๐Ÿ˜ข

I look forward to hearing from you soon.
Sincerely,
Hyuna Shin

[Request] restore paper-version

Hi, thank you for providing the official implementation of you outstanding paper.

I found that you recently removed the paper-version branch from the repo. 1c92dab
I'd like to investigate your original implementation in detail in comparison with the current master branch.
I would be grateful if you could consider to restore the branch.

Thank you.

Need help with using the pre-trained model to get vector of an out of vocabulary method name

Hi,
I would like to use the pre-trained model to find the most similar name of a given word, as similar to the "Most similar feature" in the website https://code2vec.org .

However, when I input a method name that is out of the method name vocabulary (I guest that), for example "bitcounts", I got the error "word 'bitcounts' not in vocabulary".

Could you show me how to over-come this issue.

My code is as bellow:

from gensim.models import KeyedVectors as word2vec

vectors_text_path = 'C:/workplace/code2vec/models/java14_model/targets.txt'
model = word2vec.load_word2vec_format(vectors_text_path, binary=False)

print("input method name: ")
method_name = input()

print("most similar: ")
rs = model.wv.similar_by_word(method_name)
print(rs)

Thank you in advance

Questions on code2vec's usage

Hi,

First of all, I would like to thank you for this good job, it is awesome!

Then I would like to ask two questions:

  1. I want to train a model of the name convention of the test suite. What should I do to train the model? Should make the scripts pointings to the root of the test suite, e.g. src/test/java/ in a typical maven project? Or Should I extract each test method from their test classes and create a large fake java file with all (and only) the test methods, such as the Input.java file when we predict the label of a method?

  2. What would you suggest in the case that we do not have a lot of data to train the model?

Thank you again.

Error on evaluating a trained model

When I follow the steps given in the README.md I get the following error:

screenshot 2018-11-27 at 4 49 34 pm

screenshot 2018-11-27 at 4 49 49 pm

I am using the release model from the s3 link provided and I also renamed saved_model_iter8.release.data-00000-of-00001 to saved_model_iter8.release as recommended in README.md but still I get the same error

Problems when Extending to Python

Hello! I tried to extend the code2vec to Python, but I encountered some problems.
I successfully use PathMiner to create the preprocess data in a triple components format (Token, Path, Token), but I am not able to run bash train.sh because I still need dataname.dict.c2s.

What is dataname.dict.c2s? And how can I deal with it?

Cannot run Step 4: Manual examination of a trained model

Hello,
I'm trying to run the command
python3 code2vec.py --load models/java14_model/saved_model_iter8.release --predict
but i found these error below.:

~/code2vec$ python3 code2vec.py --load models/java14_model/saved_model_iter8.release --predict
2019-11-13 00:48:00.400601: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:48:00.400716: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2019-11-13 00:50:06.233454: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-11-13 00:50:06.808643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-13 00:50:06.813225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 950M computeCapability: 5.0
coreClock: 0.928GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2019-11-13 00:50:06.813965: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:50:06.814661: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:50:06.815327: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:50:06.816016: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:50:06.816684: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:50:06.817445: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-11-13 00:50:11.048508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-13 00:50:11.048636: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1592] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-11-13 00:50:11.050286: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-13 00:50:11.541359: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2712000000 Hz
2019-11-13 00:50:11.542624: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55cd55e30800 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-13 00:50:11.542733: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2019-11-13 00:50:12.295303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-13 00:50:12.297222: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55cd56f8ec30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2019-11-13 00:50:12.297333: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 950M, Compute Capability 5.0
2019-11-13 00:50:12.297654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 00:50:12.297710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]
2019-11-13 00:50:12,299 INFO
2019-11-13 00:50:12,300 INFO
2019-11-13 00:50:12,300 INFO ---------------------------------------------------------------------
2019-11-13 00:50:12,300 INFO ---------------------------------------------------------------------
2019-11-13 00:50:12,301 INFO ---------------------- Creating word2vec model ----------------------
2019-11-13 00:50:12,301 INFO ---------------------------------------------------------------------
2019-11-13 00:50:12,301 INFO ---------------------------------------------------------------------
2019-11-13 00:50:12,301 INFO Checking number of examples ...
2019-11-13 00:50:12,302 INFO ---------------------------------------------------------------------
2019-11-13 00:50:12,302 INFO ----------------- Configuration - Hyper Parameters ------------------
2019-11-13 00:50:12,303 INFO CODE_VECTOR_SIZE 384
2019-11-13 00:50:12,303 INFO CSV_BUFFER_SIZE 104857600
2019-11-13 00:50:12,304 INFO DEFAULT_EMBEDDINGS_SIZE 128
2019-11-13 00:50:12,304 INFO DL_FRAMEWORK tensorflow
2019-11-13 00:50:12,304 INFO DROPOUT_KEEP_RATE 0.75
2019-11-13 00:50:12,305 INFO EXPORT_CODE_VECTORS False
2019-11-13 00:50:12,305 INFO LOGS_PATH None
2019-11-13 00:50:12,305 INFO MAX_CONTEXTS 200
2019-11-13 00:50:12,306 INFO MAX_PATH_VOCAB_SIZE 911417
2019-11-13 00:50:12,306 INFO MAX_TARGET_VOCAB_SIZE 261245
2019-11-13 00:50:12,306 INFO MAX_TOKEN_VOCAB_SIZE 1301136
2019-11-13 00:50:12,306 INFO MAX_TO_KEEP 10
2019-11-13 00:50:12,306 INFO MODEL_LOAD_PATH models/java14_model/saved_model_iter8.release
2019-11-13 00:50:12,307 INFO MODEL_SAVE_PATH None
2019-11-13 00:50:12,307 INFO NUM_BATCHES_TO_LOG_PROGRESS 100
2019-11-13 00:50:12,307 INFO NUM_TEST_EXAMPLES 0
2019-11-13 00:50:12,307 INFO NUM_TRAIN_BATCHES_TO_EVALUATE 1800
2019-11-13 00:50:12,307 INFO NUM_TRAIN_EPOCHS 20
2019-11-13 00:50:12,308 INFO NUM_TRAIN_EXAMPLES 0
2019-11-13 00:50:12,308 INFO PATH_EMBEDDINGS_SIZE 128
2019-11-13 00:50:12,308 INFO PREDICT True
2019-11-13 00:50:12,308 INFO READER_NUM_PARALLEL_BATCHES 6
2019-11-13 00:50:12,308 INFO RELEASE False
2019-11-13 00:50:12,309 INFO SAVE_EVERY_EPOCHS 1
2019-11-13 00:50:12,309 INFO SAVE_T2V None
2019-11-13 00:50:12,309 INFO SAVE_W2V None
2019-11-13 00:50:12,309 INFO SEPARATE_OOV_AND_PAD False
2019-11-13 00:50:12,309 INFO SHUFFLE_BUFFER_SIZE 10000
2019-11-13 00:50:12,309 INFO TARGET_EMBEDDINGS_SIZE 384
2019-11-13 00:50:12,310 INFO TEST_BATCH_SIZE 1024
2019-11-13 00:50:12,310 INFO TEST_DATA_PATH None
2019-11-13 00:50:12,310 INFO TOKEN_EMBEDDINGS_SIZE 128
2019-11-13 00:50:12,310 INFO TOP_K_WORDS_CONSIDERED_DURING_PREDICTION 10
2019-11-13 00:50:12,310 INFO TRAIN_BATCH_SIZE 1024
2019-11-13 00:50:12,311 INFO TRAIN_DATA_PATH_PREFIX None
2019-11-13 00:50:12,311 INFO USE_TENSORBOARD False
2019-11-13 00:50:12,311 INFO VERBOSE_MODE 1
2019-11-13 00:50:12,311 INFO _Config__logger <Logger code2vec (INFO)>
2019-11-13 00:50:12,311 INFO context_vector_size 384
2019-11-13 00:50:12,312 INFO entire_model_load_path models/java14_model/saved_model_iter8.release__entire-model
2019-11-13 00:50:12,312 INFO entire_model_save_path None
2019-11-13 00:50:12,312 INFO is_loading True
2019-11-13 00:50:12,312 INFO is_saving False
2019-11-13 00:50:12,312 INFO is_testing False
2019-11-13 00:50:12,313 INFO is_training False
2019-11-13 00:50:12,313 INFO model_load_dir models/java14_model
2019-11-13 00:50:12,313 INFO model_weights_load_path models/java14_model/saved_model_iter8.release__only-weights
2019-11-13 00:50:12,313 INFO model_weights_save_path None
2019-11-13 00:50:12,313 INFO test_steps 0
2019-11-13 00:50:12,313 INFO train_data_path None
2019-11-13 00:50:12,313 INFO train_steps_per_epoch 0
2019-11-13 00:50:12,313 INFO word_freq_dict_path None
2019-11-13 00:50:12,313 INFO ---------------------------------------------------------------------
2019-11-13 00:50:12,314 INFO Loading model vocabularies from: models/java14_model/dictionaries.bin ...
2019-11-13 00:50:28,240 INFO Done loading model vocabularies.
2019-11-13 00:50:30,001 INFO Done creating code2vec model
2019-11-13 00:50:38.261557: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 00:50:38.261595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]
WARNING:tensorflow:From /home/sidekoiii/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Traceback (most recent call last):
File "code2vec.py", line 36, in
predictor = InteractivePredictor(config, model)
File "/home/sidekoiii/code2vec/interactive_predict.py", line 16, in init
model.predict([])
File "/home/sidekoiii/code2vec/tensorflow_model.py", line 321, in predict
self._build_tf_test_graph(reader_output, normalize_scores=True)
File "/home/sidekoiii/code2vec/tensorflow_model.py", line 287, in _build_tf_test_graph
input_tensors = _TFEvaluateModelInputTensorsFormer().from_model_input_form(input_tensors)
File "/home/sidekoiii/code2vec/tensorflow_model.py", line 528, in from_model_input_form
path_source_token_strings=input_row[5],
IndexError: tuple index out of range

However I can run this command successfully
python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test data/java14m/java14m.test.c2v

code vector vs target embedding

Hi Uri,
I'm confused about code vecor and target embedding. In Readme:

Exporting the trained target (method name) embeddings:
python3 code2vec.py --load models/java14_model/saved_model_iter8 --save_t2v models/java14_model/targets.txt
Exporting the code vectors for the given code examples
The flag --export_code_vectors allows to export the code vectors for the given examples.
  1. To my understanding, code vectors are generated after combined context vectors and attention weights, target embeddings are weights between code vector layer and softmax layer. So, they are different. Each method name can be represented by code vector or target embedding (if the method name is in the target/tag dictionary). Is my understanding right?
  2. When exporting the code vectors for the given code examples, the output file doesn't contain method names. Is it possible to add method named in the output file?
  3. When using the gensim python package, in your example, you use target embeddings. Can we use code vectors as well?
    Thanks.

Encountered unexpected token when export_code_vectors

Hi, thanks for your fantastic work.

When I use code2vec to export vectors, and I try this code as Input.java:

public JdkCommunicatorLogger(Logger logger,Level logLevel,Level errorLogLevel){   if (logger == null) {     throw new IllegalArgumentException("logger is required");   }   if (logLevel == n      ull)     {     throw new IllegalArgumentException("logLevel is required");   }   if (errorLogLevel == null) {     throw new IllegalArgumentException("exceptionLogLevel is required");   }   this.logger=l      o    gger;   this.logLevel=logLevel;   this.errorLogLevel=errorLogLevel; }

I encountered this error:

Exception in thread "main" com.github.javaparser.ParseProblemException: Encountered unexpected token: "ull" <IDENTIFIER>
    at line 1, column 215.

	at com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)
	at com.github.javaparser.JavaParser.parse(JavaParser.java:210)
	at JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:70)
	at JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:40)
	at JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)
	at JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:39)
	at JavaExtractor.App.main(App.java:33)

So, how can I avoid this error? Or can this be done?

Error while training from scratch

Hello! I've tried to train a model from scratch and received following error:

Average loss at batch 12300: 0.001828, throughput: 1302 samples/sec
Number of waiting examples in queue: 300000
2018-10-24 00:21:02.697593: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 4294967296 bytes on host: CUDA_ERROR_INVALID_VALUE: invalid argument
2018-10-24 00:21:02.787920: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 4294967296

It seems like something happened right at the end of first epoch.
I'm training a model in cloud with Tesla K80 and ~12 GB of RAM.
Do you have any idea on what could've caused an error?

Probability shown for a codeblock in the code2vec.org and after downloading the project are different

Hi,

I have used the released model you have provided in the repository.
However, if I test it using the same codeblock in code2vec.org and locally, the values are different. I have just followed the instructions provided in the readme to run and predict.
Here's the codeblock to test:

void f(int[] list){
    for(int i =1; i<list.length;i++){
        int currentElement = list[i];
        int k;
        for(k =i-1;k>=0&&list[k]>currentElement;k--){
            list[k+1]=list[k];
        }
        list[k+1]=currentElement;
    }

}

The probability shown in code2vec site is insertionSort 82.35% , sortByInsertionSort
6.49%

The probability shown in the git project is (0.549103) predicted: ['sort', 'by', 'insertion', 'sort'], (0.095443) predicted: ['min', 'key'], (0.074358) predicted: ['insertion', 'sort']

Is the model used in the website and provided in the git repository different? Or I am missing something?

Thank you,
Zayed

Question on how to export_code_vectors

Thanks for the impressive work.

I want to use the word2vec to transfer learning for some other downstream tasks: I have got some java code snippets and I want to get the representation of these methods. I found the parameter export_code_vectors but it seems that it is not implemented. Is that deleted or can you give me some advice on how to use it?

Some questions about Dataset

Hi, when I begin to checked the dataset "java14m.dict.c2v", I found the format like this
"[path, -1354628, 30]". "Path" and "30" are "Xs" and "Xt", and "-1354628" is the path. And the number is node.
I have a question about that:
why the path < 0 ? because of "โ†‘" "โ†“" ? So could you tell me how to represent it?
I'm confused about that, although the Definition2 and Defination3 have mentioned it.

Thank you~

Question: Is there already a pre trained keras model?

I have read there is already a keras implementation but as far as I understand this one is not compatibel with the weights of the pre trained models on S3. Is this correct? I'm wondering because me and my study group want to research if it is possible to use this network for transferlearning by replacing the final layer. However we find it easier to work with keras if possible.

Preprocessing on code2seq dataset

Thanks for opening source this, it's great work.

I have been trying to run it on another dataset, specifically the java-small dataset from your code2seq work which I found at https://urialon.cswp.cs.technion.ac.il/publications/.

The issue is that the preprocess.sh script seems to get "stuck" extracting paths from the training set.

I say "stuck" because I'm not actually sure if the script has frozen or it just seems to take a long time (has currently been running for >3 hours).

Do you have an ETA on how long it took from your code2seq experiments? Or will these scripts not work with those datasets?

Thanks in advance.

Using code2vec for binary classification

Hi,

I am trying to use code2vec for doing binary classification for a different language. So, instead of method names, i have True and False as the labels. My Custom feature extractor still generates the path vectors from AST in a similar manner. I have a relatively small training dataset.

The pre-process step works fine and generates *.c2v files in data. However, while training, I am facing this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [4128] vs. [6880] [[{{node LogicalAnd}}]]

The sizes of my training set is 4128, validation set is 644, test set is 641 . Do I need to modify something in common.py for this?

Let me know if you need some other info or have better suggestions.

Thanks in advance.

Size of code vector is 384 and not 128

In the code2vec paper published at POPL19, it says that the size of the code-vector is d with d=128. Moreover, also the vectors of the contexts after the dense layer should, according to the paper, have size 128. However, when debugging the code I noticed that their size is 3*d=384. Am I missing something?

Thank's for answering :)

When is dictionaries.bin created?

On downloading the trained model, I get dictionaries.bin. However, on training the network with preprocessed data from https://s3.amazonaws.com/code2vec/data/java14m_data.tar.gz, I'm unable to generate dictionaries.bin at the end of the training.

Because of this, I am unable to use Interactive predict for manual examination of the model. Am I missing something here? Please let me know at which phase, dictionaries.bin would be generated.

--release and --export_code_vectors don't work

Hi Sir,
Your work is wonderful. The code is very readable and the instruction is very understandable. Thanks a lot.
Recently I'm applying your model to another data set and everything works well except --release and --export_code_vectors. Could you please give me a hand?
I run the code through Google Colab: python 3.6.8, tensorflow-gpu 2.0.0-beta1
The commands I used are as follows:
!python3 code2vec.py --framework keras --load models/test2vec/saved_model --release
!python3 code2vec.py --framework keras --load models/test2vec/saved_model --test data/my_dataset/my_dataset.test.c2v --export_code_vectors
The two commands run well and no errors, but I can not see any files generated, i.e. *.release and *.vectors can not be generated. I don't know why.
Could you please give me some hints? Thanks.

JavaExtractor: Incorrect path length calculation

In FeatureExtractor, the length of a path is calculated as int pathLength = sourceStack.size() + targetStack.size() - 2 * commonPrefix; (line 140). I believe this is calculation is one more than the actual path length as it is counting the commonNode twice.

I suggest calculating the length as currentSourceAncestorIndex + currentTargetAncestorIndex + 1

Question about Dataset

Hello, I'm interested in this project.
But when I try to evaluate your network through your java14m dataset, there is a small error in the Dataset.map function in the self._create_dataset_pipeline. The error is "TypeError: Tensor is unhashable if Tensor equality is enabled. Instead, use tensor.experimental_ref() as the key.".
Coud you give me some help for this question? Thank you.

Questions about JavaExtractor's hash function

Hi, I am trying to extend code2vec for Javascript. So far,I have been able to extract paths. I have a few questions about the final form of my_dataset.val.c2v.
What was the hash function used for paths? Did you use a standard hash function like sha1 or md5
Do you unhash the hashed string somewhere in the program?
Are the arrows in the path (up, down) really important?

Error when MAX_CONTEXTS in config.py is changed

I am experimenting with code2vec for a binary classification problem for java methods, where the feature extractor generates the path vectors from AST as expected. The value of MAX_CONTEXTS in config.py is first set to 400. The preprocessing step works fine too with MAX_CONTEXTS set to 400 (in preprocess.sh) as the expected *.c2v files are generated.

Then when I start training (train.sh), I get the following error:

File "/home/lv71161/akarmakar/miniconda3/envs/c2v/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 401 in record 0

Could you tell what might be the issue here?
What am I doing wrong in the case where I set MAX_CONTEXTS = 400?
MAX_CONTEXTS = 200 works perfectly fine!

Here is the screenshot for your reference:
Screen Shot 2019-09-30 at 5 02 09 PM

This website (https://code2vec.org/) don't generate the AST

Hi~
When I used some examples in java-small dataset just like this,
@test public void f(){
assert MIDDLE_VERSIONS_ENTITY_NAME.equals(metadata().getEntityBinding(MIDDLE_VERSIONS_ENTITY_NAME).getTable().getName());
}
the website didn't show me the AST, I want to know whether the example must be out of the training dataset.

Thanks~

Where can i find the word_freq_dict_path

Hi, I tried to run the model using your example. However if I run python3 code2vec.py --load models/java14_model/saved_model_iter8 --predict I get the error:

Loading word frequencies dictionaries from: None ..

Where can i find .dict.c2v file that is needed for self.config.word_freq_dict_path?

PS I have already downloaded the weights of the network

BUG: Config.NUM_EXAMPLES does not exist.

Hello,

I tried to play with Code2Vec, and specialized an already trained model with a my own dataset.

As suggested in #5, I used the following command line:

python3 -u code2vec.py --load ${trained_model} --data ${data} --test ${test_data} --save ${model_dir}/saved_model

However, I have the following errors:

Traceback (most recent call last):
  File "code2vec.py", line 38, in <module>
    model.train()
  File "/tmp/code2vec/model.py", line 72, in train
    self.config.NUM_EXAMPLES / self.config.BATCH_SIZE * self.config.SAVE_EVERY_EPOCHS), 1)
AttributeError: 'Config' object has no attribute 'NUM_EXAMPLES'

The field NUM_EXAMPLES does not exist in Config object. I hacker model.py and replace this value by 1 (I don't know what I'm doing, but just to see what happen! :-))

It seems that this field is also used in PathContextReader

Starting training
Traceback (most recent call last):
  File "code2vec.py", line 38, in <module>
    model.train()
  File "/tmp/code2vec/model.py", line 76, in train
    config=self.config)
  File "/tmp/code2vec/PathContextReader.py", line 15, in __init__
    self.batch_size = config.TEST_BATCH_SIZE if is_evaluating else min(config.BATCH_SIZE, config.NUM_EXAMPLES)
AttributeError: 'Config' object has no attribute 'NUM_EXAMPLES'

According to the code in common.py, there is no such field in the Config object:

 def __init__(self):
        self.NUM_EPOCHS = 0
        self.SAVE_EVERY_EPOCHS = 0
        self.BATCH_SIZE = 0
        self.TEST_BATCH_SIZE = 0
        self.READING_BATCH_SIZE = 0
        self.NUM_BATCHING_THREADS = 0
        self.BATCH_QUEUE_SIZE = 0
        self.TRAIN_PATH = ''
        self.TEST_PATH = ''
        self.MAX_CONTEXTS = 0
        self.WORDS_VOCAB_SIZE = 0
        self.TARGET_VOCAB_SIZE = 0
        self.PATHS_VOCAB_SIZE = 0
        self.EMBEDDINGS_SIZE = 0
        self.SAVE_PATH = ''
        self.LOAD_PATH = ''
        self.MAX_TO_KEEP = 0
        self.RELEASE = False
        self.EXPORT_CODE_VECTORS = False

What would be the default value of this field?

Raw dataset~

I am so sorry to bother you again.
I want to know where I can find the raw data and I checked the issues list, but I cannot find it on http://urialon.cswp.cs.technion.ac.il/publications/. And I just see [PDF][Slides] [Video][Code and trained model][BibTeX] in the block of code2vector.

Best regards,
Wang Kun

what is TARGET_HISTOGRAM_FILE?

Hi @urialon,
I'm looking at your preprocessor.sh file but I don't know how this histo.tgt.c2v file is generated so I have two questions:
1/ How this file being generated?
2/ What this file does?
Thanks.

JavaExtractor: Potential bug in FeatureExtractor

In FeatureExtractor line 182, if (i == 0 || s_ParentTypeToAddChildId.contains(currentNode.getUserData(Common.PropertyKey).getRawType())) the rawType of currentNode is checked. In the rest of the code (lines 156 and 167), the rawType of currentNode.getParentNode() is checked. Is this an intentional difference? If so why?

How can I keep track of the methods' location?

This is what I would like to do:
Given a folder with source code files, generate a file with the vectors that represent each method, but instead of having only the name of the methods I would like some additional data to know the location of the methods (file name, start line, start column, end line, end column)

Is there any way I can achieve this output?

Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/java14m/saved_model_iter8

Hi! I tried to run code2vec model and faced with problem associated with mistake in README

I used trained model from https://s3.amazonaws.com/code2vec/model/java14m_model.tar.gz
When I run:
python3 code2vec.py --load models/java14m/saved_model_iter8 --predict
I saw this
Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/java14m/saved_model_iter8

The reason is that archive contains model with name "saved_model_iter8.release"
If we run like below all will be fine
python3 code2vec.py --load models/java14m/saved_model_iter8.release --predict

Please fix README or rename model in archive. Thanks

AST Visualizer From Demo Website

Hi. Is the source code for the AST visualizer from the demo website available? I mean the script to which you feed the output of the /predict endpoint. It is very nice and I would like to use it for a university project if possible. Thank you!

Code Representation for code clones

Hello and thank you for your job!
I'm now working in code clones problem and I would like to use your approach for code representation. When I read your paper I wonder I can use your trained model for my problem, but I see nothing about that. How I can use your's model? As I see your output predictions contains only names for methods and attention scores, but I need just a vector of input java code. Is it possible?
Thank you!

Fine-tuning for a different task

We are trying to fine-tune the code2vec model for a different task (a binary classification task).

We have previously tried to generate the codevectors for our code-snippets and feed them to a classifier (i.e. using code2vec as a feature extractor). The results haven't been what we expected so we were looking into fine-tuning code2vec with our data. We tried importing the meta_graph and wanted to retrieve the last tensor before classification in order to "connect" it to a binary output. Something along the lines of what is described in the section 'Using a pre-trained graph in a new graph' in https://blog.metaflow.fr/tensorflow-saving-restoring-and-mixing-multiple-models-c4c94d5d7125

Unfortunately we couldn't figure out the name of the tensor we required to be used in get_tensor_by_name(). We even opened the graph in Tensorboard to examine it but were a bit overwhelmed by the size of the graph.

Do you have any pointers as to how to modify the graph for a different classification task all while using the pre-trained weights from training code2vec with java-large?

Finding semantic code clones

Hi, am I right that I can use ur NN for finding clones with similar semantics via vectors distance?

Thx for the answer.

SyntaxError: invalid syntax` File "/code/code2vec/vocabularies.py", line 45

When I try to run the training with tf2.0 in docker with gpu support, I get the following error:

docker run --runtime=nvidia -it -v $(realpath ~/Code):/code -u $(id -u):$(id -g) tensorflow/tensorflow:2.0.0a0-gpu-py3 bash

]Traceback (most recent call last):
  File "code2vec.py", line 1, in <module>
    from vocabularies import VocabType
  File "/code/code2vec/vocabularies.py", line 45
    self.word_to_index: Dict[str, int] = {}
                      ^
SyntaxError: invalid syntax```

Getting whole AST like the paper

Hi,
I am currently working on a research project regarding code similarity. For that, it would be really helpful for us if we could get the whole AST. From the code, I can get different number attention paths, but not the whole tree like the paper. Any guideline for that is really appreciated.

Cannot get code2vec for short methods

When I feed tool with a short method like this one.

@Override
public void andReturn(Object value) {
this.throwWrappedIllegalStateException();
}

I donot get the cod2vec, why?

Problems using code2vec to generate java code vectors

Hi, i am trying to use code2vec to generate representations(vectors) for some java code. I am using the following command:

python code2vec.py --load models/java14_model/saved_model_iter8.release --test --export_code_vectors

and a variation:

python code2vec.py --load models/java14_model/saved_model_iter8.release --test <file_with_filepaths> --export_code_vectors

in both cases i got the following errors

2019-11-02 23:46:57.585327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Traceback (most recent call last):
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Expect 201 fields but have 7 in record
[[{{node IteratorGetNext}}]]
[[IteratorGetNext/_19]]
(1) Invalid argument: Expect 201 fields but have 7 in record
[[{{node IteratorGetNext}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "code2vec.py", line 31, in
eval_results = model.evaluate()
File "/home/gustavoeloi/GU/UNICAMP/tese/code/code2vec/tensorflow_model.py", line 159, in evaluate
self.eval_original_names_op, self.eval_code_vectors],
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Expect 201 fields but have 7 in record
[[node IteratorGetNext (defined at /home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
[[IteratorGetNext/_19]]
(1) Invalid argument: Expect 201 fields but have 7 in record
[[node IteratorGetNext (defined at /home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'IteratorGetNext':
File "code2vec.py", line 31, in
eval_results = model.evaluate()
File "/home/gustavoeloi/GU/UNICAMP/tese/code/code2vec/tensorflow_model.py", line 122, in evaluate
input_tensors = input_iterator.get_next()
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
name=name)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2500, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 793, in _apply_op_helper
op_def=op_def)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3360, in create_op
attrs, op_def, compute_device)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3429, in _create_op_internal
op_def=op_def)
File "/home/gustavoeloi/programs/anaconda3/envs/tf_gpu2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1751, in init
self._traceback = tf_stack.extract_stack()

what should i do. Or, what am I doing wrong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.