arturus / kaggle-web-traffic Goto Github PK

View Code? Open in Web Editor NEW

1.8K 75.0 671.0 8.97 MB

1st place solution

License: MIT License

Python 25.85% Jupyter Notebook 74.15%

kaggle-web-traffic kaggle time-series timeseries rnn-encoder-decoder rnn tensorflow cudnn cocob seq2seq

kaggle-web-traffic's Introduction

Kaggle Web Traffic Time Series Forecasting

1st place solution

Main files:

make_features.py - builds features from source data
input_pipe.py - TF data preprocessing pipeline (assembles features into training/evaluation tensors, performs some sampling and normalisation)
model.py - the model
trainer.py - trains the model(s)
hparams.py - hyperpatameter sets.
submission-final.ipynb - generates predictions for submission

How to reproduce competition results:

Download input files from https://www.kaggle.com/c/web-traffic-time-series-forecasting/data : key_2.csv.zip, train_2.csv.zip, put them into data directory.
Run python make_features.py data/vars --add_days=63. It will extract data and features from the input files and put them into data/vars as Tensorflow checkpoint.
Run trainer: python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500. This command will simultaneously train 3 models on different seeds (on a single TF graph) and save 10 checkpoints from step 10500 to step 11500 to data/cpt. Note: training requires GPU, because of cuDNN usage. CPU training will not work. If you have 3 or more GPUs, add --multi_gpu flag to speed up the training. One can also try different hyperparameter sets (described in hparams.py): --hparam_set=definc, --hparam_set=inst81, etc. Don't be afraid of displayed NaN losses during training. This is normal, because we do the training in a blind mode, without any evaluation of model performance.
Run submission-final.ipynb in a standard jupyter notebook environment, execute all cells. Prediction will take some time, because it have to load and evaluate 30 different model weights. At the end, you'll get submission.csv.gz file in data directory.

kaggle-web-traffic's People

Contributors

Stargazers

Watchers

Forkers

biroc chechir aihill guitarmind anyuray arunramakani charlesjansen yunxileo veerendrab samithaj jack281291 tomzhang maheshmadhusudanan allensmile paojianghu jdc08161063 xiaoguozhi amoliu zlszhonglongshen alphaseekerli qiqika duke24k chenkaigithub zhouyonglong yuyichen09 houchangtao tartaruszen zsommer xuyuandong zhangcg1987 rspadim jjdblast duolajiang airysen shaqbari roxw utayao siyantao proffl028 drfilter lxiong chenmoshushi chenyyx andrewganjinrui midasc lakezhang rockyzyl nonlining jerusalemsbell duanjx nature0310 feng-1985 amzhanghan hellogiantman1989 drjzhou pythonai timedcy amitanshg dupuleng meplusyou dgq2011 ljch2018 kangkanglee suzyzd junman gaozhenyu liangyaorong kentchun33333 gwnudt zzkgo rickdyang lkl219 humichina dakeli mohsinkhn amirkhango weiningzhang haoxuu goodluckwlx ivivan keyman9848 pengyan-sophia liubai521 winstonhanxb cfandy manasapullannagari embedxj vgoklani ab-be weilai0980 qiujkx sunxingxingtf xuelun nemochina2008 chuan1997 selvamshan yutingliao chiahungtai set-daemon jason-zhangyp

kaggle-web-traffic's Issues

Is there a version which can train the model by CPU?

Unfortunately, I currently do not have GPU support

Is this actually "encoder-decoder" vs. standard many-to-many?

Thanks for sharing your code, very helpful.

From your computational graph and model code, it looks like the "decoder" at each timestep takes 2 inputs:

the previous hidden state from the decoder (hidden state of deocder GRU cell at previous timestep)
a concatenated vector of inputs = [previous prediction, features, attention] where attention is optional.

The first timestep decoder cell gets the "encoded state" as the last hidden state of the encoder. But future decoder timesteps do NOT get this encoded representation again. So the computational graph does not look like that in the original RNN encoder-deocder paper or like in the seq2seq encoder-decoder section of the Deep Learning book. I.e. it seems this model architecture is more like a standard many-to-many RNN but not encoder-decoder, right? I.e. you do not feed in the encoded state "c" again?

Thanks

Cho Encoder-decoder:
https://arxiv.org/pdf/1406.1078.pdf
Fig 1. on pg. 2

Deep Learning Book:
http://www.deeplearningbook.org/contents/rnn.html
Section 10.4 Encoder-Decoder Sequence-to-Sequence Architectures
pg. 391 Fig 10.12

Your model:
https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md#model-core

answer of this kaggle competition

may I ask that do you have the answer of this kaggle competition?
I am not sure if kaggle will release the true result after this competition is end or not?
thank you very much.

AttributeError: "NoneType' object has no attribute 'set_index'

When I execute "python make_features.py data/vars --add_days=63", I got following error:

Traceback (most recent call last):
File "make_features.py", line 349, in
run()
File "make_features.py", line 273, in run
df, nans, starts, ends = prepare_data(args.start, args.end, args.valid_thres
hold)
File "make_features.py", line 176, in prepare_data
df = read_x(start, end)
File "make_features.py", line 75, in read_x
df = read_all()
File "make_features.py", line 48, in read_all
scraped = read_file('2017-08-15_2017-09-11_new')
File "make_features.py", line 36, in read_file
df = read_cached(file).set_index('Page')
AttributeError: 'NoneType' object has no attribute 'set_index'

I'm using Python 3.6.3, pandas 0.22.0. Thanks!

update to tensorflow 1.8

@Arturus Thx for sharing, do u have plan to update the tf version? I have some problem in updating the version from 1.4 to 1.8, the problem is CudnnGRU api changed alot.

make_features.py

Dear Arturus:
I just run "python make_features.py data/vars --add_days=63" but come cross that error:

File "make_features.py", line 310, in run
dow_norm = features_days.dayofweek.values / week_period
AttributeError: 'numpy.ndarray' object has no attribute 'values'

How can i solve it?

whitch pandas version used

pandas==1.1.2 and pandas==1.0.5 are not work

Did someone make an error when running lag_indexes method in make_features file

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: DatetimeIndex(['2015-04-01', '2015-04-02', '2015-04-03', '2015-04-04',\n '2015-04-05',\n ...\n '2015-06-26', '2015-06-27', '2015-06-28', '2015-06-29',\n '2015-06-30'],\n dtype='datetime64[ns]', length=92, freq=None). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

Unknown: CUDNN_STATUS_EXECUTION_FAILED，what is wrong? Thanks

UnknownError Traceback (most recent call last)
/home/xxx/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:

/home/xxx/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1340 return self._call_tf_sessionrun(
-> 1341 options, feed_dict, fetch_list, target_list, run_metadata)
1342

/home/xxx/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1428 self._session, options, feed_dict, fetch_list, target_list,
-> 1429 run_metadata)
1430

UnknownError: 2 root error(s) found.
(0) Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(953): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[{{node model/cudnn_gru/CudnnRNN}}]]
[[ConstantFoldingCtrl/model/absolute_difference/assert_broadcastable/AssertGuard/Switch_0/_30]]
(1) Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(953): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[{{node model/cudnn_gru/CudnnRNN}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

cudnn_gru ValueError when forward_split= True

First of all thank you for upgrading your code and having fixed all issues recently!

When I run train with --no_forward_split everything works ok, however when running train() to eval with forward_split=True I get a ValueError: Variable cudnn_gru_1/opaque_kernel does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?
Any idea on how to fix this or what is causing this issue?

Could it be related to the fact that now we are instantiating two models? train_model and forward_eval_model ?

Thank you

CPU version error

Dear Arturus:
I just use tf.nn.dynamic_rnn+GRUCell to place cudnn but come cross that error:
"Input tensor 'rnn/while/Exit_3:0' enters the loop with shape (283, 128), but has shape (?, 128) after one iteration. To allow the shape to vary across iterations, use the shape_invariants argument of tf.while_loop to specify a less-specific shape."
when my code run the loop"_, _, _, targets_ta, outputs_ta = tf.while_loop(cond_fn, loop_fn, loop_init)"
How can i solve it?

Question about .params_size() in model.py

Hello @Arturus ! Firstly thank you for sharing your code and congratulations with the first place)

I'm digging code and stuck on this few lines - https://github.com/Arturus/kaggle-web-traffic/blob/master/model.py#L72

Can you please clarify what is params_size? As stated in TF docs it returns "size of the opaque parameter buffer". The hell is this - hyperparameter size/memory limit/batch size that can fit into my particular GPU memory?

And leading question - why do we need to check the values of the two returns of the one same function build_rnn?

Thank you in advance!

Run submission-final for only one model

First of all, thanks for the excellent code. Now the problem:
Since I only have one GPU (Nvidia Quadro), I was able to run only one model by means of:

python trainer.py --name s32 --hparam_set=s32 --n_models=1 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500

When I try to execute the submission-final file, I changed the corresponding cell as follows:

for tm in range(1):
tf.reset_default_graph()
t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,
n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))

to account for only one model's checkpoints. However, I am getting an error that I cannot solve:

#tf.reset_default_graph()

#preds = predict(paths, default_hparams(), back_offset=0,

n_models=3, target_model=0, seed=2, batch_size=2048, asgd=True)

t_preds = []

for tm in range(1):

tf.reset_default_graph()

t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,

                n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))

ValueError Traceback (most recent call last)
in ()
6 tf.reset_default_graph()
7 t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,
----> 8 n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))

~/projects/kaggle-web-traffic/trainer.py in predict(checkpoints, hparams, return_x, verbose, predict_window, back_offset, n_models, target_model, asgd, seed, batch_size)
691 else:
692 var_list = None
--> 693 saver = tf.train.Saver(name='eval_saver', var_list=var_list)
694 x_buffer = []
695 predictions = None

~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in init(self, var_list, reshape, sharded, max_to_keep, keep_checkpoint_every_n_hours, name, restore_sequentially, saver_def, builder, defer_build, allow_empty, write_version, pad_step_number, save_relative_paths, filename)
1216 self._filename = filename
1217 if not defer_build and context.in_graph_mode():
-> 1218 self.build()
1219 if self.saver_def:
1220 self._check_saver_def()

~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in build(self)
1225 if context.in_eager_mode():
1226 raise ValueError("Use save/restore instead of build in eager mode.")
-> 1227 self._build(self._filename, build_save=True, build_restore=True)
1228
1229 def _build_eager(self, checkpoint_path, build_save, build_restore):

~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in _build(self, checkpoint_path, build_save, build_restore)
1249 return
1250 else:
-> 1251 raise ValueError("No variables to save")
1252 self._is_empty = False
1253

ValueError: No variables to save

Sorry for the question and thanks in advance for your comment.

Error reproducing competition results

I am trying to reproduce the competition results based on the instructions in the README.

I download and unzip the files from the kaggle competition into the data/ folder
I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder
I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this?
Full output from train command

in make_features.py: TypeError: Cannot cast Index to dtype M8[D]

I'm sure the issue is just that I don't have the correct versions of dependencies. Can you update the requirements.txt to state dependency versions for pandas, numpy, and scipy. I'm afraid I will have to play the guessing game.

got an unexpected keyword argument 'input_size'

When I run this command"python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500", an error occured:
/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "trainer.py", line 776, in
train(**param_dict)
File "trainer.py", line 514, in train
all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i))
File "trainer.py", line 471, in create_model
train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed)
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 371, in init
transpose_output=False)
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 72, in make_encoder
static_p_size = cuda_params_size(build_rnn)
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 46, in cuda_params_size
cuda_model = cuda_model_builder()
File "/content/drive1/Codes/kaggle/Web_traffic_prediction/kaggle-web-traffic-master/model.py", line 70, in build_rnn
dropout=hparams.encoder_dropout if is_train else 0, seed=seed)
TypeError: init() got an unexpected keyword argument 'input_size'
After that I checked the document and source code of TensorFlow to find that the params "input_size" is actually exists in the definition of CudnnGRU.Can anybody tell me why this could happen? Thanks

holidays is not used in the finally version?

Hi. I saw function like make_holidays in your code. And in readme you also said holiday is a import feature.But it seems you drop to use the holiday with country.Why?

'CUDNN_STATUS_EXECUTION_FAILED' occurs

hi, when i run the code on my server ( v100*4 cuda 9.0 cudnn 7.0), it occurs this errors.
Could you please help me ?
which version of cuda and cudnn do you use?

`/home/admin/algomodule/test/kaggle-web-traffic# python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500
WARNING:tensorflow:From /home/admin/algomodule/test/kaggle-web-traffic/model.py:144: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-10-02 06:00:37.510047: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-10-02 06:00:37.909980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:37.911006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:08.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.047527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.048568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:09.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.179680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.180730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0a.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.319747: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.320794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0b.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.320867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-10-02 06:00:40.205535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 06:00:40.205600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-10-02 06:00:40.205610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y
2019-10-02 06:00:40.205616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y
2019-10-02 06:00:40.205631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y
2019-10-02 06:00:40.205641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N
2019-10-02 06:00:40.205992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14941 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:08.0, compute capability: 7.0)
2019-10-02 06:00:40.508989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14941 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
2019-10-02 06:00:40.811745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14941 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0a.0, compute capability: 7.0)
2019-10-02 06:00:41.114312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14941 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0)
1: 0%| | 0/566 [00:00<?, ?it/s]2019-10-02 06:00:47.758076: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.770054: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.782300: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 599, in train
step = trainer.train_step(sess, epoch)
File "trainer.py", line 251, in train_step
results = self._metric_step(Stage.TRAIN, ops, sess, epoch, summary_every=20)
File "trainer.py", line 235, in _metric_step
results = sess.run(ops)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'm_0/cudnn_gru/CudnnRNN', defined at:
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 520, in train
all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i))
File "trainer.py", line 474, in create_model
train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 342, in init
transpose_output=False)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 65, in make_encoder
rnn_out, (rnn_state,) = cuda_model(inputs=rnn_time_input)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 412, in call
training)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 487, in _forward
seed=self._seed)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 922, in _cudnn_rnn
outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 115, in cudnn_rnn
is_training=is_training, name=name)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]`

Cannot colocate nodes 'global_norm/L2Loss' and 'gradients/CudnnRNN_grad/CudnnRNNBackprop'

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'global_norm/L2Loss' and 'gradients/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.

Cannot colocate nodes `m_2/global_norm/L2Loss` and `m_2/gradients/m_2/CudnnRNN_grad/CudnnRNNBackprop`

windows10

tensorflow1.4

Cannot colocate nodes m_2/global_norm/L2Loss and m_2/gradients/m_2/CudnnRNN_grad/CudnnRNNBackprop because no device type supports both of those nodes and the other nodes colocated with them

Colocation Debug Info:

Colocation group had the following types and devices:

CudnnRNNBackprop: GPU

Identity:

L2Loss: CPU

 [[Node: m_2/global_norm/L2Loss = L2Loss[T=DT_FLOAT, _class=["loc:@m_2/gradients/m_2/CudnnRNN_grad/CudnnRNNBackprop"], _device="/device:GPU:0"](m_2/gradients/m_2/CudnnRNN_grad/tuple/control_dependency_3)]]

the training data is lost

I can't get the training data from the url:https://www.kaggle.com/c/web-traffic-time-series-forecasting/data, and it seems to be '404'. So could u please upload them? thx.

requirements.txt file with all versions

Hi Arturus,

Can you please update the requirements.txt file with all the particular versions you have used for this project? also please mention which python version you have used to build this project.

No such file or directory: 'data/vars/feeder_meta.pkl

run

python3.6 trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500

error

Traceback (most recent call last):
  File "trainer.py", line 776, in <module>
    train(**param_dict)
  File "trainer.py", line 416, in train
    inp = VarFeeder.read_vars("data/vars")
  File "/gruntdata/junlong.qjl/kaggle-web-traffic/feeder.py", line 98, in read_vars
    with open(_meta_file(path), mode='rb') as file:
FileNotFoundError: [Errno 2] No such file or directory: 'data/vars/feeder_meta.pkl'

ImportError: cannot import name 'Collection'

When i ran : python make_features.py data/vars --add_days=63 ,got an error:

Traceback (most recent call last):
File "make_features.py", line 10, in
from typing import Tuple, Dict, Collection, List
ImportError: cannot import name 'Collection'

I am using python 3.5 .

Error when reproduce the result : No OpKernel was registered to support Op 'CudnnRNNParamsSize'

any tips to solve the error?

python 3.6 + tensorflow 1.4.1

GPU : Nvidia Tesla M40

TypeError: Cannot cast Index to dtype M8[D]

I have this error msg

Determined shape must either match input shape along split_dim exactly if fully specified, or be less than the size of the input along split_dim if not fully specified.

train阶段，请问这说的是哪两个维度对应啊？？？

License!

Could you please include a license?

SMAC3 parameter tuning code

In hparams.py, there are many parameter sets which can be used. Running the train.py script using the command suggested in the readme uses the parameter set s32, or it is easy to input my own parameters and run with those instead.

But can you also provide the script you used for running SMAC3 on your models for tuning? I know you said the different parameter sets did not have too much difference in performance but would be interested to use this SMAC step as well.

Thanks

Can't generate submission as no EMA checkpoints saved

Hi,

Trying to run the scripts as specified in readme. Getting error on generating submission:

INFO:tensorflow:Restoring parameters from data/feeder.cpt
INFO:tensorflow:Restoring parameters from data/cpt/s32/cpt-1620
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
~/miniconda3/envs/basev1/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

~/miniconda3/envs/basev1/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303 

~/miniconda3/envs/basev1/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive

NotFoundError: Key m_0/m_0/decoder_output_proj/kernel/ExponentialMovingAverage not found in checkpoint
     [[Node: eval_saver/RestoreV2_6 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_eval_saver/Const_0_0, eval_saver/RestoreV2_6/tensor_names, eval_saver/RestoreV2_6/shape_and_slices)]]
     [[Node: eval_saver/RestoreV2_1/_9 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_26_eval_saver/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

It seems that the --no-eval flag causes it not to save the ema checkpoints, could that be? (Specifically the ema_eval_stages list is always empty, unless do_eval is True.)

Thanks.

How to run SMAC

Can you tell how you run the SMAC for optimizing the hyperparameter? How did you create objective function to be minimized.

Dealing with sparsity

Hi, question about how you dealt with sparsity.

In input_pipe.py, there are parameters like "train_completeness_threshold" which determines how many 0's are allowed. It looks like the default is 1 for this value. Further down in the code, there is:
self.max_train_empty = int(round(train_window * (1 - train_completeness_threshold)))
So with the default value of 1, this makes max_train_empty default to 0, i.e. the randomly cropped time series must be completely filled [no missing values] in order to be used in training.

So is this what you did to get your best results, you discarded any time series crop which had holes in it?

Of the ~145 thousand time series in train_1.csv, it looks like about 2/3 of them are dense [no missing values], and any random crop of a dense series will remain dense, and a random crop of a series with holes may get a portion that is dense, so I guess even with the max_train_empty = 0 you still get to use most of the data, right?

Traceback (most recent call last):
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 757, in astype
    dtype=dtype)
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 308, in __new__
    dtype=dtype, **kwargs)
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\datetimes.py", line 303, in __new__
    int_as_wall_time=True)
  File "D:\Miniconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 376, in _from_sequence
    ambiguous=ambiguous, int_as_wall_time=int_as_wall_time)
  File "D:\Miniconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 1720, in sequence_to_dt64ns
    dtype = _validate_dt64_dtype(dtype)
  File "D:\Miniconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 2016, in _validate_dt64_dtype
    .format(dtype=dtype))
ValueError: Unexpected value for 'dtype': 'datetime64[D]'. Must be 'datetime64[ns]' or DatetimeTZDtype'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "make_features.py", line 349, in <module>
    run()
  File "make_features.py", line 273, in run
    df, nans, starts, ends = prepare_data(args.start, args.end, args.valid_threshold)
  File "make_features.py", line 176, in prepare_data
    df = read_x(start, end)
  File "make_features.py", line 75, in read_x
    df = read_all()
  File "make_features.py", line 46, in read_all
    df = read_file('train_2')
  File "make_features.py", line 37, in read_file
    df.columns = df.columns.astype('M8[D]')
  File "D:\Miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 760, in astype
    raise TypeError(msg.format(name=type(self).__name__, dtype=dtype))
TypeError: Cannot cast Index to dtype M8[D]

how can I to solve?

OS : win 10
Python Version : 3.6.8
numba : 0.42.0
numpy : 1.16.2
pandas : 0.24.1