Giter VIP home page Giter VIP logo

kaggle-web-traffic's Introduction

Kaggle Web Traffic Time Series Forecasting

1st place solution

predictions

Main files:

  • make_features.py - builds features from source data
  • input_pipe.py - TF data preprocessing pipeline (assembles features into training/evaluation tensors, performs some sampling and normalisation)
  • model.py - the model
  • trainer.py - trains the model(s)
  • hparams.py - hyperpatameter sets.
  • submission-final.ipynb - generates predictions for submission

How to reproduce competition results:

  1. Download input files from https://www.kaggle.com/c/web-traffic-time-series-forecasting/data : key_2.csv.zip, train_2.csv.zip, put them into data directory.
  2. Run python make_features.py data/vars --add_days=63. It will extract data and features from the input files and put them into data/vars as Tensorflow checkpoint.
  3. Run trainer: python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500. This command will simultaneously train 3 models on different seeds (on a single TF graph) and save 10 checkpoints from step 10500 to step 11500 to data/cpt. Note: training requires GPU, because of cuDNN usage. CPU training will not work. If you have 3 or more GPUs, add --multi_gpu flag to speed up the training. One can also try different hyperparameter sets (described in hparams.py): --hparam_set=definc, --hparam_set=inst81, etc. Don't be afraid of displayed NaN losses during training. This is normal, because we do the training in a blind mode, without any evaluation of model performance.
  4. Run submission-final.ipynb in a standard jupyter notebook environment, execute all cells. Prediction will take some time, because it have to load and evaluate 30 different model weights. At the end, you'll get submission.csv.gz file in data directory.

See also detailed model description

kaggle-web-traffic's People

Contributors

arturus avatar demmojo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kaggle-web-traffic's Issues

Is this actually "encoder-decoder" vs. standard many-to-many?

Thanks for sharing your code, very helpful.

From your computational graph and model code, it looks like the "decoder" at each timestep takes 2 inputs:

  1. the previous hidden state from the decoder (hidden state of deocder GRU cell at previous timestep)
  2. a concatenated vector of inputs = [previous prediction, features, attention] where attention is optional.

The first timestep decoder cell gets the "encoded state" as the last hidden state of the encoder. But future decoder timesteps do NOT get this encoded representation again. So the computational graph does not look like that in the original RNN encoder-deocder paper or like in the seq2seq encoder-decoder section of the Deep Learning book. I.e. it seems this model architecture is more like a standard many-to-many RNN but not encoder-decoder, right? I.e. you do not feed in the encoded state "c" again?

Thanks

Cho Encoder-decoder:
https://arxiv.org/pdf/1406.1078.pdf
Fig 1. on pg. 2
cho encoder-decoder

Deep Learning Book:
http://www.deeplearningbook.org/contents/rnn.html
Section 10.4 Encoder-Decoder Sequence-to-Sequence Architectures
pg. 391 Fig 10.12
encoder decoder dl book

Your model:
https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md#model-core
kaggle model

answer of this kaggle competition

may I ask that do you have the answer of this kaggle competition?
I am not sure if kaggle will release the true result after this competition is end or not?
thank you very much.

AttributeError: "NoneType' object has no attribute 'set_index'

When I execute "python make_features.py data/vars --add_days=63", I got following error:

Traceback (most recent call last):
File "make_features.py", line 349, in
run()
File "make_features.py", line 273, in run
df, nans, starts, ends = prepare_data(args.start, args.end, args.valid_thres
hold)
File "make_features.py", line 176, in prepare_data
df = read_x(start, end)
File "make_features.py", line 75, in read_x
df = read_all()
File "make_features.py", line 48, in read_all
scraped = read_file('2017-08-15_2017-09-11_new')
File "make_features.py", line 36, in read_file
df = read_cached(file).set_index('Page')
AttributeError: 'NoneType' object has no attribute 'set_index'

I'm using Python 3.6.3, pandas 0.22.0. Thanks!

update to tensorflow 1.8

@Arturus Thx for sharing, do u have plan to update the tf version? I have some problem in updating the version from 1.4 to 1.8, the problem is CudnnGRU api changed alot.

make_features.py

Dear Arturus:
I just run "python make_features.py data/vars --add_days=63" but come cross that error:

File "make_features.py", line 310, in run
dow_norm = features_days.dayofweek.values / week_period
AttributeError: 'numpy.ndarray' object has no attribute 'values'

How can i solve it?

Did someone make an error when running lag_indexes method in make_features file

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: DatetimeIndex(['2015-04-01', '2015-04-02', '2015-04-03', '2015-04-04',\n '2015-04-05',\n ...\n '2015-06-26', '2015-06-27', '2015-06-28', '2015-06-29',\n '2015-06-30'],\n dtype='datetime64[ns]', length=92, freq=None). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

Unknown: CUDNN_STATUS_EXECUTION_FAILED,what is wrong? Thanks

UnknownError Traceback (most recent call last)
/home/xxx/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:

/home/xxx/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1340 return self._call_tf_sessionrun(
-> 1341 options, feed_dict, fetch_list, target_list, run_metadata)
1342

/home/xxx/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1428 self._session, options, feed_dict, fetch_list, target_list,
-> 1429 run_metadata)
1430

UnknownError: 2 root error(s) found.
(0) Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(953): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[{{node model/cudnn_gru/CudnnRNN}}]]
[[ConstantFoldingCtrl/model/absolute_difference/assert_broadcastable/AssertGuard/Switch_0/_30]]
(1) Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(953): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[{{node model/cudnn_gru/CudnnRNN}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

cudnn_gru ValueError when forward_split= True

First of all thank you for upgrading your code and having fixed all issues recently!

When I run train with --no_forward_split everything works ok, however when running train() to eval with forward_split=True I get a ValueError: Variable cudnn_gru_1/opaque_kernel does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?
Any idea on how to fix this or what is causing this issue?

Could it be related to the fact that now we are instantiating two models? train_model and forward_eval_model ?

Thank you

CPU version error

Dear Arturus:
I just use tf.nn.dynamic_rnn+GRUCell to place cudnn but come cross that error:
"Input tensor 'rnn/while/Exit_3:0' enters the loop with shape (283, 128), but has shape (?, 128) after one iteration. To allow the shape to vary across iterations, use the shape_invariants argument of tf.while_loop to specify a less-specific shape."
when my code run the loop"_, _, _, targets_ta, outputs_ta = tf.while_loop(cond_fn, loop_fn, loop_init)"
How can i solve it?

Question about .params_size() in model.py

Hello @Arturus ! Firstly thank you for sharing your code and congratulations with the first place)

I'm digging code and stuck on this few lines - https://github.com/Arturus/kaggle-web-traffic/blob/master/model.py#L72

Can you please clarify what is params_size? As stated in TF docs it returns "size of the opaque parameter buffer". The hell is this - hyperparameter size/memory limit/batch size that can fit into my particular GPU memory?

And leading question - why do we need to check the values of the two returns of the one same function build_rnn?

Thank you in advance!

Run submission-final for only one model

First of all, thanks for the excellent code. Now the problem:
Since I only have one GPU (Nvidia Quadro), I was able to run only one model by means of:

python trainer.py --name s32 --hparam_set=s32 --n_models=1 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500

When I try to execute the submission-final file, I changed the corresponding cell as follows:

for tm in range(1):
tf.reset_default_graph()
t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,
n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))

to account for only one model's checkpoints. However, I am getting an error that I cannot solve:

#tf.reset_default_graph()

#preds = predict(paths, default_hparams(), back_offset=0,

n_models=3, target_model=0, seed=2, batch_size=2048, asgd=True)

t_preds = []

for tm in range(1):

tf.reset_default_graph()

t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,

                n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))


ValueError Traceback (most recent call last)
in ()
6 tf.reset_default_graph()
7 t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,
----> 8 n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))

~/projects/kaggle-web-traffic/trainer.py in predict(checkpoints, hparams, return_x, verbose, predict_window, back_offset, n_models, target_model, asgd, seed, batch_size)
691 else:
692 var_list = None
--> 693 saver = tf.train.Saver(name='eval_saver', var_list=var_list)
694 x_buffer = []
695 predictions = None

~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in init(self, var_list, reshape, sharded, max_to_keep, keep_checkpoint_every_n_hours, name, restore_sequentially, saver_def, builder, defer_build, allow_empty, write_version, pad_step_number, save_relative_paths, filename)
1216 self._filename = filename
1217 if not defer_build and context.in_graph_mode():
-> 1218 self.build()
1219 if self.saver_def:
1220 self._check_saver_def()

~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in build(self)
1225 if context.in_eager_mode():
1226 raise ValueError("Use save/restore instead of build in eager mode.")
-> 1227 self._build(self._filename, build_save=True, build_restore=True)
1228
1229 def _build_eager(self, checkpoint_path, build_save, build_restore):

~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in _build(self, checkpoint_path, build_save, build_restore)
1249 return
1250 else:
-> 1251 raise ValueError("No variables to save")
1252 self._is_empty = False
1253

ValueError: No variables to save

Sorry for the question and thanks in advance for your comment.

Error reproducing competition results

I am trying to reproduce the competition results based on the instructions in the README.

  1. I download and unzip the files from the kaggle competition into the data/ folder

  2. I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder

  3. I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this?
Full output from train command

got an unexpected keyword argument 'input_size'

When I run this command"python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500", an error occured:
/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "trainer.py", line 776, in
train(**param_dict)
File "trainer.py", line 514, in train
all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i))
File "trainer.py", line 471, in create_model
train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed)
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 371, in init
transpose_output=False)
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 72, in make_encoder
static_p_size = cuda_params_size(build_rnn)
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 46, in cuda_params_size
cuda_model = cuda_model_builder()
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 70, in build_rnn
dropout=hparams.encoder_dropout if is_train else 0, seed=seed)
TypeError: init() got an unexpected keyword argument 'input_size'
After that I checked the document and source code of TensorFlow to find that the params "input_size" is actually exists in the definition of CudnnGRU.Can anybody tell me why this could happen? Thanks

'CUDNN_STATUS_EXECUTION_FAILED' occurs

hi, when i run the code on my server ( v100*4 cuda 9.0 cudnn 7.0), it occurs this errors.
Could you please help me ?
which version of cuda and cudnn do you use?

`/home/admin/algomodule/test/kaggle-web-traffic# python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500
WARNING:tensorflow:From /home/admin/algomodule/test/kaggle-web-traffic/model.py:144: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-10-02 06:00:37.510047: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-10-02 06:00:37.909980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:37.911006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:08.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.047527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.048568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:09.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.179680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.180730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0a.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.319747: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.320794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0b.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.320867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-10-02 06:00:40.205535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 06:00:40.205600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-10-02 06:00:40.205610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y
2019-10-02 06:00:40.205616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y
2019-10-02 06:00:40.205631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y
2019-10-02 06:00:40.205641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N
2019-10-02 06:00:40.205992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14941 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:08.0, compute capability: 7.0)
2019-10-02 06:00:40.508989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14941 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
2019-10-02 06:00:40.811745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14941 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0a.0, compute capability: 7.0)
2019-10-02 06:00:41.114312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14941 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0)
1: 0%| | 0/566 [00:00<?, ?it/s]2019-10-02 06:00:47.758076: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.770054: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.782300: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 599, in train
step = trainer.train_step(sess, epoch)
File "trainer.py", line 251, in train_step
results = self._metric_step(Stage.TRAIN, ops, sess, epoch, summary_every=20)
File "trainer.py", line 235, in _metric_step
results = sess.run(ops)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'm_0/cudnn_gru/CudnnRNN', defined at:
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 520, in train
all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i))
File "trainer.py", line 474, in create_model
train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 342, in init
transpose_output=False)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 65, in make_encoder
rnn_out, (rnn_state,) = cuda_model(inputs=rnn_time_input)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 412, in call
training)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 487, in _forward
seed=self._seed)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 922, in _cudnn_rnn
outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 115, in cudnn_rnn
is_training=is_training, name=name)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]`

Cannot colocate nodes `m_2/global_norm/L2Loss` and `m_2/gradients/m_2/CudnnRNN_grad/CudnnRNNBackprop`

windows10

tensorflow1.4

Cannot colocate nodes m_2/global_norm/L2Loss and m_2/gradients/m_2/CudnnRNN_grad/CudnnRNNBackprop because no device type supports both of those nodes and the other nodes colocated with them

Colocation Debug Info:

Colocation group had the following types and devices:

CudnnRNNBackprop: GPU

Identity:

L2Loss: CPU

 [[Node: m_2/global_norm/L2Loss = L2Loss[T=DT_FLOAT, _class=["loc:@m_2/gradients/m_2/CudnnRNN_grad/CudnnRNNBackprop"], _device="/device:GPU:0"](m_2/gradients/m_2/CudnnRNN_grad/tuple/control_dependency_3)]]

requirements.txt file with all versions

Hi Arturus,

Can you please update the requirements.txt file with all the particular versions you have used for this project? also please mention which python version you have used to build this project.

No such file or directory: 'data/vars/feeder_meta.pkl

run

python3.6 trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500

error

Traceback (most recent call last):
  File "trainer.py", line 776, in <module>
    train(**param_dict)
  File "trainer.py", line 416, in train
    inp = VarFeeder.read_vars("data/vars")
  File "/gruntdata/junlong.qjl/kaggle-web-traffic/feeder.py", line 98, in read_vars
    with open(_meta_file(path), mode='rb') as file:
FileNotFoundError: [Errno 2] No such file or directory: 'data/vars/feeder_meta.pkl'

ImportError: cannot import name 'Collection'

When i ran : python make_features.py data/vars --add_days=63 ,got an error:

Traceback (most recent call last):
File "make_features.py", line 10, in
from typing import Tuple, Dict, Collection, List
ImportError: cannot import name 'Collection'

I am using python 3.5 .

License!

Could you please include a license?

SMAC3 parameter tuning code

In hparams.py, there are many parameter sets which can be used. Running the train.py script using the command suggested in the readme uses the parameter set s32, or it is easy to input my own parameters and run with those instead.

But can you also provide the script you used for running SMAC3 on your models for tuning? I know you said the different parameter sets did not have too much difference in performance but would be interested to use this SMAC step as well.

Thanks

Can't generate submission as no EMA checkpoints saved

Hi,

Trying to run the scripts as specified in readme. Getting error on generating submission:

INFO:tensorflow:Restoring parameters from data/feeder.cpt
INFO:tensorflow:Restoring parameters from data/cpt/s32/cpt-1620
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
~/miniconda3/envs/basev1/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

~/miniconda3/envs/basev1/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303 

~/miniconda3/envs/basev1/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive

NotFoundError: Key m_0/m_0/decoder_output_proj/kernel/ExponentialMovingAverage not found in checkpoint
     [[Node: eval_saver/RestoreV2_6 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_eval_saver/Const_0_0, eval_saver/RestoreV2_6/tensor_names, eval_saver/RestoreV2_6/shape_and_slices)]]
     [[Node: eval_saver/RestoreV2_1/_9 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_26_eval_saver/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

It seems that the --no-eval flag causes it not to save the ema checkpoints, could that be? (Specifically the ema_eval_stages list is always empty, unless do_eval is True.)

Thanks.

How to run SMAC

Can you tell how you run the SMAC for optimizing the hyperparameter? How did you create objective function to be minimized.

Dealing with sparsity

Hi, question about how you dealt with sparsity.

In input_pipe.py, there are parameters like "train_completeness_threshold" which determines how many 0's are allowed. It looks like the default is 1 for this value. Further down in the code, there is:
self.max_train_empty = int(round(train_window * (1 - train_completeness_threshold)))
So with the default value of 1, this makes max_train_empty default to 0, i.e. the randomly cropped time series must be completely filled [no missing values] in order to be used in training.

So is this what you did to get your best results, you discarded any time series crop which had holes in it?

Of the ~145 thousand time series in train_1.csv, it looks like about 2/3 of them are dense [no missing values], and any random crop of a dense series will remain dense, and a random crop of a series with holes may get a portion that is dense, so I guess even with the max_train_empty = 0 you still get to use most of the data, right?

How to plot the autocorrelation?

Thanks for sharing. But how to plot the autocorrelation? I wonder which tool you use. Could you please post this snippet code? I tried to use matplotlib API but I can only get discrete points. How to plot continous points? Thanks a lot.

License

This code base is awesome and helps me a lot for learning forecasting using rnn encoder-decoder. I couldn't find a license file. I would appreciate if the author Arturus can provide a license file. Thanks!

which python version do you use?

"from typing import Collection"
I use python 2.7 and 3.5 to run your code.
Because this will result an error, I would like to know which version of python did you use?
python 3.6?
thanks.

make_features.py @ step 2

Hi
thank for your code to let us learn

here have a problem
when I run python make_features.py data/vars --add_days=63
it's show error:

Traceback (most recent call last):
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 757, in astype
    dtype=dtype)
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 308, in __new__
    dtype=dtype, **kwargs)
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\datetimes.py", line 303, in __new__
    int_as_wall_time=True)
  File "D:\Miniconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 376, in _from_sequence
    ambiguous=ambiguous, int_as_wall_time=int_as_wall_time)
  File "D:\Miniconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 1720, in sequence_to_dt64ns
    dtype = _validate_dt64_dtype(dtype)
  File "D:\Miniconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 2016, in _validate_dt64_dtype
    .format(dtype=dtype))
ValueError: Unexpected value for 'dtype': 'datetime64[D]'. Must be 'datetime64[ns]' or DatetimeTZDtype'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "make_features.py", line 349, in <module>
    run()
  File "make_features.py", line 273, in run
    df, nans, starts, ends = prepare_data(args.start, args.end, args.valid_threshold)
  File "make_features.py", line 176, in prepare_data
    df = read_x(start, end)
  File "make_features.py", line 75, in read_x
    df = read_all()
  File "make_features.py", line 46, in read_all
    df = read_file('train_2')
  File "make_features.py", line 37, in read_file
    df.columns = df.columns.astype('M8[D]')
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 760, in astype
    raise TypeError(msg.format(name=type(self).__name__, dtype=dtype))
TypeError: Cannot cast Index to dtype M8[D]

how can I to solve?

OS : win 10
Python Version : 3.6.8
numba : 0.42.0
numpy : 1.16.2
pandas : 0.24.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.