hkust-knowcomp / r-net Goto Github PK

View Code? Open in Web Editor NEW

581.0 34.0 212.0 193 KB

Tensorflow Implementation of R-Net

License: MIT License

Python 98.35% Shell 1.65%

squad tensorflow nlp machine-comprehension r-net

r-net's Introduction

R-Net

A Tensorflow implementation of R-NET: MACHINE READING COMPREHENSION WITH SELF-MATCHING NETWORKS. This project is specially designed for the SQuAD dataset.
Should you have any question, please contact Wenxuan Zhou ([email protected]).

Requirements

There have been a lot of known problems caused by using different software versions. Please check your versions before opening issues or emailing me.

General

Python >= 3.4
unzip, wget

Python Packages

tensorflow-gpu >= 1.5.0
spaCy >= 2.0.0
tqdm
ujson

Usage

To download and preprocess the data, run

# download SQuAD and Glove
sh download.sh
# preprocess the data
python config.py --mode prepro

Hyper parameters are stored in config.py. To debug/train/test the model, run

python config.py --mode debug/train/test

To get the official score, run

python evaluate-v1.1.py ~/data/squad/dev-v1.1.json log/answer/answer.json

The default directory for tensorboard log file is log/event

See release for trained model.

Detailed Implementaion

The original paper uses additive attention, which consumes lots of memory. This project adopts scaled multiplicative attention presented in Attention Is All You Need.
This project adopts variational dropout presented in A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.
To solve the degradation problem in stacked RNN, outputs of each layer are concatenated to produce the final output.
When the loss on dev set increases in a certain period, the learning rate is halved.
During prediction, the project adopts search method presented in Machine Comprehension Using Match-LSTM and Answer Pointer.
To address efficiency issue, this implementation uses bucketing method (contributed by xiongyifan) and CudnnGRU. The bucketing method can speedup training, but will lower the F1 score by 0.3%.

Performance

Score

	EM	F1
original paper	71.1	79.5
this project	71.07	79.51

Training Time (s/it)

	Native	Native + Bucket	Cudnn	Cudnn + Bucket
E5-2640	6.21	3.56	-	-
TITAN X	2.56	1.31	0.41	0.28

Extensions

These settings may increase the score but not used in the model by default. You can turn these settings on in config.py.

Pretrained GloVe character embedding. Contributed by yanghanxy.
Fasttext Embedding. Contributed by xiongyifan. May increase the F1 by 1% (reported by xiongyifan).

r-net's People

Contributors

Stargazers

Watchers

Forkers

meinwerk webblearning mennianshi prpankajsingh levstyle jadielam huangziliandy sth4k xiongyifan burglarhobbit romxz jinyeong yanghanxy chenghuige vishwajeet93 yvesharrison kamalkraj hualichenxi xiongfeihtp hexingwei jiths tim5go resec hpk23 wangxueliangustc alexbrandon limisi rubeeny locussam nke001 revodata nealrichardrui chunlinx pchankh jiangziguo cceyda sathishreddy zumbalamambo royshan adripurkayastha javacjh jackalhan jianbotang seanliu96 cutecha rafaelsimonmaia xxcharles zhlj98 bityangke mars-wei agnon1573 kinghuin yafuilee giantleap2012 chiuyeelau onisimchukv z00bean poojithansl paulyangsz los-phoenix zhanzecheng huyingxi liu4lin simplejian aiedward cosecant-csc jackysnake lgdkobe24 zjulins lindw desert0616 konroyzhu weiyegd zhaodaolimeng jasonshiyong peteykun yuhuizhou nipengmath chavesliu apollos wanghm92 shubhampachori12110095 tusharbihani 312shan wolfhu ccmjimmy2007 xuwenshen luckmoon web199195 augmen wangsc522 caoxu915683474 yq911122 ericchansen newenglandml pandagui reloadbrain warynice mercury20160104 db-li

r-net's Issues

Pre trained Model.

Could you please share your trained model. I don't have GPU and don't have that much of time to train in CPU. I would be grateful if could share.

Error: ``Cannot colocate nodes...''

Hi Wenxuan,

Many thanks for this wonderful and educational implementation.
I have encountered an error while trying python config.py --mode debug.

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'global_norm/L2Loss_32' and 'gradients/match/bw/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
Identity:
L2Loss: CPU
[[Node: global_norm/L2Loss_32 = L2LossT=DT_FLOAT, _class=["loc:@gradients/match/bw/CudnnRNN_grad/CudnnRNNBackprop"]]]

All the configs are by default.

Error:No OpKernel was registered to support Op 'CudnnRNNParamsSize'

Hi,
I ran the python3 ./config.py --mode train as the readme suggests, and then the error in the title occurred. I really don't know what went wrong, please help me!!!
the error information is posted as follows:

/usr/lib/python3.4/importlib/_bootstrap.py:321: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  return f(*args, **kwds)
Building model...
2018-01-29 16:26:08.747856: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1293, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1354, in _extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNParamsSize' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

	 [[Node: match/CudnnRNNParamsSize_1 = CudnnRNNParamsSize[S=DT_INT32, T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", rnn_mode="gru", seed=87654321, seed2=0](match/CudnnRNNParamsSize_1/num_layers, match/CudnnRNNParamsSize_1/num_units, match/CudnnRNNParamsSize_1/input_size)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "config.py", line 127, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "config.py", line 108, in main
    train(config)
  File "/home/yx/ssd/R-Net-master/main.py", line 46, in train
    sess.run(tf.global_variables_initializer())
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNParamsSize' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

	 [[Node: match/CudnnRNNParamsSize_1 = CudnnRNNParamsSize[S=DT_INT32, T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", rnn_mode="gru", seed=87654321, seed2=0](match/CudnnRNNParamsSize_1/num_layers, match/CudnnRNNParamsSize_1/num_units, match/CudnnRNNParamsSize_1/input_size)]]

Caused by op 'match/CudnnRNNParamsSize_1', defined at:
  File "config.py", line 127, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "config.py", line 108, in main
    train(config)
  File "/home/yx/ssd/R-Net-master/main.py", line 35, in train
    model = Model(config, iterator, word_mat, char_mat)
  File "/home/yx/ssd/R-Net-master/model.py", line 43, in __init__
    self.ready()
  File "/home/yx/ssd/R-Net-master/model.py", line 107, in ready
    ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train)
  File "/home/yx/ssd/R-Net-master/func.py", line 24, in __init__
    [gru_bw.params_size()], -0.1, 0.1), validate_shape=False)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1305, in params_size
    direction=self._direction)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1208, in cudnn_rnn_opaque_params_size
    name=name)[0]
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py", line 439, in cudnn_rnn_params_size
    seed=seed, seed2=seed2, name=name)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'CudnnRNNParamsSize' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

	 [[Node: match/CudnnRNNParamsSize_1 = CudnnRNNParamsSize[S=DT_INT32, T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", rnn_mode="gru", seed=87654321, seed2=0](match/CudnnRNNParamsSize_1/num_layers, match/CudnnRNNParamsSize_1/num_units, match/CudnnRNNParamsSize_1/input_size)]]

Inference Module

I'd like to use this project's code in a new version of my open source application, "askTanmay". Can you please create a module that simply takes a question & a paragraph, and returns the output from the neural net? It'd be much appreciated. Thanks!

Using the trained model

How is it possible to use the trained model to answer the new questions?

absl.flags._exceptions.IllegalFlagValueError

I got this exception absl.flags._exceptions.IllegalFlagValueError: flag --bucket_range=[40, 401, 40]: Expect argument to be a string or int, found <class 'list'> at flags.DEFINE_integer("bucket_range", [40, 401, 40], "the range of bucket").
Sorry to say that I'm a beginner. Who can tell me why?

Prepro/training wrapper adopted

Hi, thanks for your code! The results are very impressive, the training procedure is computationally cheap and memory efficient! I'm just letting you know that I adopted your training/preprocessing pipeline for my repository at Fast Reading Comprehension which is also in MIT license. Please let me know should this be an issue. Thanks!

softmax_mask

This repo is written very nicely, thanks for sharing!

My question is not related with this system, just more about understanding RNet. I don't get why you use softmax_mask() before softmax operations. Could you explain more? Thanks again!

Why remove usage of tf.layers.dense and why change to use softmax_cross_entropy_with_logits_v2 ?

I see func.py code change from
using tf.layers.dense to new written dense function in func.py, why to make such decision ?
Seems they do the same thing.
Also change from using softmax_cross_entropy_with_logits to softmax_cross_entropy_with_logits_v2, did it improve performance ? Seems if stop_gradient for label they are the same.

What if there is no answer in the document to the given question?

Hi, have you considered training model on dataset with some documents where correct answer to the question does not exist (I mean other dataset, not SQuAD)? How to change the architecture to make it possible? Should we use another loss function?

Seems cudnn gru hidden size 100 is much faster ?

Thanks a lot for sharing this wonderful code, great help.
One thing interesting is when I set hidden size to 100 instead of 75, ie final output dim is 100 * 2 * 3 = 600 while by default is 75 * 2 * 3 = 450.
hidden size 100 batches/s:[4.39] insts/s:[281.20]
hidden size 75 will be batches/s:[3.35] insts/s:[214.44]
So confusing, why bigger hidden size faster ?

multiple GPUs

Does the training support multiple GPUs?

Best validation loss ?

Can you share the best train/val loss you observed ?

Should use train data only to build the word lookup table

word_counter, char_counter = Counter(), Counter()
    train_examples, train_eval = process_file(
        config.train_file, "train", word_counter, char_counter)
    dev_examples, dev_eval = process_file(
        config.dev_file, "dev", word_counter, char_counter)
    test_examples, test_eval = process_file(
        config.test_file, "test", word_counter, char_counter)

word_counter and char_counter should not be updated in processing dev and test data

Question: MS MARCO implementations

I want to implement it for MS MARCO datasets, the change should be in preprocessing datasets only?

it hard to tell this is the R-Net

They have an essential difference between the model proposed by u and the one proposed by MS.

The key components of R-Net are missed in ur model, though ur model is effective

It's really hard to train the original RNET model

About Test & Evaluation (Unanswered Question Issue)

I have trained your model using GeForce GTX 1080 with reduced batch_size=55. After 60000 steps, I used python config.py --mode test and got the following result:
Exact Match: 69.69964664310955, F1: 78.46608330209344
Then I tried to remap the answer_dict so that the id in the answer_dict can stay consistent with dev-v1.1.json. I stored the remapped answer_dict to the json file, then used the official evaluation script to test the result.
the command I used:
python evaluate-v1.1.py dev-v1.1.json predictions-remapped.json
the result I got:
Exact Match: 67.18070009460737, F1: 75.63031756686154
I noticed that there were many unanswered questions. The output in the console looked like this:

...
Unanswered question 5733fe73d058e614000b6740 will receive score 0.
Unanswered question 57373d0cc3c5551400e51e85 will receive score 0.
Unanswered question 57377ec7c3c5551400e51f07 will receive score 0.
Unanswered question 57377ec7c3c5551400e51f09 will receive score 0.
Unanswered question 5737a0acc3c5551400e51f4b will receive score 0.
...

I have also checked the answer_dict (without remapping the id), the ids(=tot, tot+=1) should be series of continuous integers, however it seems that some of the numbers are missing.

Is this a desired behavior? Or does it have something to do with the preprocessing/testing procedure?

Thank you very much, looking forward to your reply.

bucket range type error

Hello, i got some issue.
When i run 'python config.py --mode prepro' or other mode i got the error below.
How to solve this ?

`Traceback (most recent call last):
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_flag.py", line 166, in _parse
return self.parser.parse(argument)
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_argument_parser.py", line 152, in parse
val = self.convert(argument)
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_argument_parser.py", line 268, in convert
type(argument)))
TypeError: Expect argument to be a string or int, found <class 'list'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "config.py", line 80, in
flags.DEFINE_integer("bucket_range", [40, 401, 40], "the range of bucket")
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/tensorflow/python/platform/flags.py", line 58, in wrapper
return original_function(*args, **kwargs)
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_defines.py", line 315, in DEFINE_integer
DEFINE(parser, name, default, help, flag_values, serializer, **args)
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_defines.py", line 81, in DEFINE
DEFINE_flag(_flag.Flag(parser, serializer, name, default, help, **args),
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_flag.py", line 107, in init
self._set_default(default)
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_flag.py", line 196, in _set_default
self.default = self._parse(value)
File "/home/m13514063/anaconda3/envs/micky/lib/python3.5/site-packages/absl/flags/_flag.py", line 169, in _parse
'flag --%s=%s: %s' % (self.name, argument, e))
absl.flags._exceptions.IllegalFlagValueError: flag --bucket_range=[40, 401, 40]: Expect argument to be a string or int, found <class 'list'>`

Return answer and confidence

Could we modify inference to also return the confidence associated with the answer?

Another idea is to have the user ask for x answers from inference. It could return all x answers, each with their confidences, as a JSON.

I'm working on it myself. Any guidance on how to access the confidence of a given answer?

Tensor reshape error when running the model on test mode

I get an error stating:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 400 values, but the requested shape has 1000   
         [[Node: Reshape_5 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](DecodeRaw_5, Reshape_5/shape)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,1000], [?,100], [?,1000,16], [?,
100,16], [?,1000], [?,1000], [?]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_FLOAT, DT
_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
         [[Node: IteratorGetNext/_109 = _Recv[client_terminated=false, recv_device="/job:localhost/re
plica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device
_incarnation=1, tensor_name="edge_60_IteratorGetNext", tensor_type=DT_INT32, _device="/job:localhost/
replica:0/task:0/device:GPU:0"]()]]

when I run the command python3 config.py --mode test. If I change the files to dev instead of test and pass is_Test=False in the get_record_parser argument, it is working as expected. What could be the reason behind this error?

Native GRU using Forward GRU instead of Backward GRU.

In your func.py file, I believe line 95-96 should be using gru_bw instead of gru_fw.

Currently,
out_bw, _ = tf.nn.dynamic_rnn(gru_fw, inputs_bw, seq_len, initial_state=init_bw, dtype=tf.float32)
But should be,
out_bw, _ = tf.nn.dynamic_rnn(gru_bw, inputs_bw, seq_len, initial_state=init_bw, dtype=tf.float32)

Isn't it?

bucket not correct ?

in key_func t = tf.clip_by_value(buckets, 0, c_len) return tf.argmax(t)
but https://www.tensorflow.org/api_docs/python/tf/argmax say
"Note that in case of ties the identity of the return value is not guaranteed."

if buckets is [40,100] input legnth is 30 then [30, 30] whith tf.argmax might return 0 or 1

So might change to do bucket like
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py

Wrong attention at Question-Passage Matching Layer

In the original paper,
ct = att(uQ, [uPt, vPt−1])
the Q-P attention takes into account the last state's output vPt−1
while the implementation does not.

The Q-P attention and P-P self attention are somewhat different in the paper. The latter does not take into account the last output state of RNN.

French ?

Hello,
What does it take to have it in French ?
Thx

Reprocessing Data Causing Error

The version 2 of your R-Net implementation worked perfectly. Following the preprocess->train->test procedure I got the expected result. However, It seemed that if I preprocessed the data again, then the saved model failed to work correctly as it should be during the test procedure.

I have checked the generated files in the data folder, it seemed that test_eval.json and word_emb.json changed each time after I had executed the preprocess command. Since the eval_examples were not shuffled and the glove.840B.300d.txt remained unchanged, this should not happened. This unexpected behavior really confused me.

CudnnRNNParamsSize Error

Hello,
When i use CPU, it works normally. But when i use GPU, i got this error:

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'CudnnRNNParamsSize' with these attrs. Registered devices: [CPU], Registered kernels:
device='GPU'; T in [DT_DOUBLE]; S in [DT_INT32]
device='GPU'; T in [DT_FLOAT]; S in [DT_INT32]

Can someone help me solve this problem ?
Thanks.

Porting the code to MS MARCO dataset

So I have ported this implementation of RNET on the MS-MARCO dataset, and the rouge-l scores after training are quite low compared to the reported scores on MS-MARCO in the RNET paper.

ROUGE-L dev scores
ROUGE-L train scores

What might have gone wrong?

Here is a link for my implementation reference: rnet-msmarco

character embedding is all zero

Hi,

Thanks for sharing this nice R-NET implementation!

I've got one question regarding the character embedding.
I checked the data/char_emb.json file, and found out that all vectors are with value zero, which is quite weird.
And then I checked the code for creating the character embedding,
in prepro.py， line 100:

        for token in filtered_elements:
            embedding_dict[token] = [0. for _ in range(vec_size)]

It seems that you encode each character to a vector of zeros with length vec_size.
I don't understand the meaning for this, beacuse this makes all characters identical in tf.nn.embedding_lookup.
Could you explain why are you doing in this way? Thank you!

Test with CPU

I was trying to run "config.py --mode test --use_cudnn=False" in CPU mode using a model that was trained with a GPU. But "cudnn_gru" and "native_gru" do not seem to be compatible.

Is there a solution to this problem?

examples in func(build_features)

Thank you and great work!
I noticed the code in func build_features in prepro.py that
start, end = example["y1s"][-1], example["y2s"][-1]
y1[start], y2[end] = 1.0, 1.0
Does it mean although you store the all the answer_spans for an question in y1s and y2s in func process_file you only use the last answer_span for train when building features?

incorporating Context Vectors?

How can I use Context Vectors as in https://github.com/salesforce/cove?

Using fasttext pre-trained embedding

Hi, If the boolean flags for fasttext changed to True, the pre processing does not work properly. The modification for config.py is more that changing the size of embedding or not?

Meaning of "num_steps" and "global_step"

Hi,
When i read your code, I was confused about the variable num_steps and global_step , does num_steps mean epoch? and the final value of global_step is equal to the value of checkpoint in the config.py file?

dense function

There is a variable in dense function:
W = tf.get_variable("W", [dim, hidden])
Is that mean they share the weights? But in the paper, there are different weights.
I'am confused about this.

Error while compiling

FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "config.py", line 86, in
flags.DEFINE_list("bucket_range", [40, 361, 40], "range of bucket")
AttributeError: module 'tensorflow.python.platform.flags' has no attribute 'DEFINE_list'

I am trying to fine tune. But dimensionality mismatched of the tensors. Kindly give some solution.

hyperparameter

i want to try R-Net for question answering on my language.
what hyperparameters that i need to try ?

Inaccurate prediction after reloading Model

This might be related to the known bug #13254 as well:
I can train the model with "CUDNN" option as either 'true' (much faster) or 'false', both completed successfully and test accuracy were expected.
But when I reload the model from the saved state, in both cases, models are providing wrong answers (though no python errors)..

Questions:
1. would this bug impact non-CUDNN model option as well?
2. if yes, is there any work-around?

Why cudnn gru include 'multi_rnn_cell/cell_0/cudnn_compatible_gru_cell' ?

I find vars in the model generated
encoding/bw_0/cudnn_gru/opaque_kernel (DT_FLOAT) [86184]
encoding/bw_0/cudnn_gru/rnn/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/candidate/hidden_projection/bias (DT_FLOAT) [76]
encoding/bw_0/cudnn_gru/rnn/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/candidate/hidden_projection/kernel (DT_FLOAT) [76,76]
encoding/bw_0/cudnn_gru/rnn/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/candidate/input_projection/bias (DT_FLOAT) [76]
encoding/bw_0/cudnn_gru/rnn/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/candidate/input_projection/kernel (DT_FLOAT) [300,76]
encoding/bw_0/cudnn_gru/rnn/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/gates/bias (DT_FLOAT) [152]
encoding/bw_0/cudnn_gru/rnn/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/gates/kernel (DT_FLOAT) [376,152]

Why there is multi_rnn_cell/cell_0/cudnn_compatible_gru_cell ?
For when I only use your cudnn_gru code as encoder in other applications, and check the model saved I find only
rnet/main/encoding/bw_0/cudnn_gru/opaque_kernel (DT_FLOAT) [86184]
no multi_rnn_cell.

TypeError: init() got an unexpected keyword argument 'input_size'

$ python config.py --mode train
Building model...
Traceback (most recent call last):
File "config.py", line 127, in
tf.app.run()
File "/Users//ve_tf1.5_py3/venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
sys.exit(main(argv))
File "config.py", line 108, in main
train(config)
File "/Users//R-Net-tf/main.py", line 35, in train
model = Model(config, iterator, word_mat, char_mat)
File "/Users//R-Net-tf/model.py", line 43, in init
self.ready()
File "/Users//R-Net-tf/model.py", line 92, in ready
).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train)
File "/Users//R-Net-tf/func.py", line 18, in init
num_layers=1, num_units=num_units, input_size=input_size)
TypeError: init() got an unexpected keyword argument 'input_size'

predict new data with model

How to use the model for predict just 1 context and question ?

Error when context is too short

What is the best way to deal with short contexts?

I believe this implementation fails when the context is less than 15 tokens due to outer = tf.matrix_band_part(outer, 0, 15), which is located at https://github.com/HKUST-KnowComp/R-Net/blob/master/inference.py#L111.

Code Snippet

    with tf.variable_scope("predict"):
            outer = tf.matmul(tf.expand_dims(tf.nn.softmax(logits1), axis=2),
                              tf.expand_dims(tf.nn.softmax(logits2), axis=1))
            outer = tf.matrix_band_part(outer, 0, 15)
            self.yp1 = tf.argmax(tf.reduce_max(outer, axis=2), axis=1)
            self.yp2 = tf.argmax(tf.reduce_max(outer, axis=1), axis=1)

Example

For example, consider 2 contexts.

The N64 was released in Japan on June 23, 1996.
The N64, Nintendo's third game console, was released in Japan on June 23, 1996.

When was the N64 released? works for the 2nd context, but the 1st context throws an error.

Modified `main` From `inference.py`

if __name__ == "__main__":
    infer = Inference()
    context = \
        "The N64 was released in Japan on June 23, 1996."
    ques1 = "When was the N64 released?"
    ans1 = infer.response(context, ques1)
    print("Answer 1: {}".format(ans1))

Error Message

/home/eric/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
WARNING:tensorflow:From /home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
Traceback (most recent call last):
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call    return fn(*args)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_upper must be negative or less or equal to number of columns (12) got: 15
         [[Node: predict/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](predict/MatMul, predict/MatrixBandPart/num_lower, predict/MatrixBandPart/num_upper)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "inference.py", line 231, in <module>
    ans1 = infer.response(context, ques1)
  File "inference.py", line 140, in response
    model.c: context_idxs, model.q: ques_idxs, model.ch: context_char_idxs, model.qh: ques_char_idxs})
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    run_metadata)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_upper must be negative or less or equal to number of columns (12) got: 15
         [[Node: predict/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](predict/MatMul, predict/MatrixBandPart/num_lower, predict/MatrixBandPart/num_upper)]]

Caused by op 'predict/MatrixBandPart', defined at:
  File "inference.py", line 194, in <module>
    infer = Inference()
  File "inference.py", line 127, in __init__
    self.model = InfModel(self.word_mat, self.char_mat)
  File "inference.py", line 55, in __init__
    self.ready()
  File "inference.py", line 111, in ready
    outer = tf.matrix_band_part(outer, 0, 15)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3464, in matrix_band_part
    num_upper=num_upper, name=name)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op    op_def=op_def)
  File "/home/eric/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): num_upper must be negative or less or equal to number of columns (12) got: 15
         [[Node: predict/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](predict/MatMul, predict/MatrixBandPart/num_lower, predict/MatrixBandPart/num_upper)]]

Which call method is adopted in calss cudnn_gru? 'call' or 'call'?

In API document of tensorflow, there are two ``calls'' methods in tf.contrib.cudnn_rnn.CudnnGRU, which is '_call_' and 'call'. And when I see such codes:
gru_fw = tf.contrib.cudnn_rnn.CudnnGRU(1, num_units)

gru_fw(outputs[-1] * mask_fw, initial_state=(init_fw, ))

I infer that you adopt the second call method as below
call( inputs, initial_state=None, training=True )
Is that right? And why not write as
gru_fw.call((outputs[-1] * mask_fw, initial_state=(init_fw, )))?

About GPU RAM Requirement

Sorry for disturbing you, but may I ask about the minimum GPU RAM required for this model? I will be very grateful if you are willing to share the version of GPU you have used and the time it took to get the result.

training error

As i told before in other issue,
When i use CPU, it works normally. But when i use GPU, i got this error:

JianBoTang solve it with:
I uninstall the tensorflow, and install tensorflow-gpu. it works.
therefore, the tensorflow==1.4.0 in requirements.txt should be modified to tensorflow-gpu==1.4.0

But if i uninstall my tensorflow 1.4.0 (tensorflow-gpu 1.4.0 still remain), i got this error:
ImportError: No module named 'tensorflow.python.ops.rnn_cell'

If i have both tensorflow and tensorflow-gpu: i got 'CudnnRNNParamsSize' error.

why only use the last answer ?

what if a question has multiple proper answers?
start, end = example["y1s"][-1], example["y2s"][-1]
y1[start], y2[end] = 1.0, 1.0

Error when using the pre-trained model in CPU

I tried using R-Net with the pre-trained model (CPU version) at Release version.

But I got an error following:

My system:

Python: 3.5
Tensorflow: 1.7.0

about the shape and mask

    with tf.variable_scope("match"):
        self_att = dot_attention(
            att, att, mask=self.c_mask, hidden=d, keep_prob=config.keep_prob, is_train=self.is_train)
        rnn = gru(num_layers=1, num_units=d, batch_size=N, input_size=self_att.get_shape(
        ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train)
        match = rnn(self_att, seq_len=self.c_len)

in this variable_scope the shape of "att" is [batch_size, c_max_len, d]
the shape of mask is [batch_size, c_max_len].
Is that right

What is the reason switching to TF1.5.0?

I was trying to make the code running for days and just noticed the TF version increased to 1.5.0.
I am currently using TF1.4.0 and wondering if this version update a must or not, is there any possibility I can modify the code and make it fit TF1.4.0.

One more reason asking this is because I was using a version before March 4th now and running using GPU with 24G RAM.But I encountered OOM issue that should not appear and have no idea why.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[18496,100]
	 [[Node: emb/char/bidirectional_rnn/fw/fw/while/fw/gru_cell/mul_1 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](emb/char/bidirectional_rnn/fw/fw/while/fw/gru_cell/split:1, emb/char/bidirectional_rnn/fw/fw/while/Switch_2:1)]]
	 [[Node: match/dot_attention/attention/Tile/multiples/_155 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2927_match/dot_attention/attention/Tile/multiples", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]