locuslab / tcn Goto Github PK

Sequence modeling benchmarks and temporal convolutional networks

Home Page: https://github.com/locuslab/TCN

License: MIT License

Python 100.00%

tcn's Introduction

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN)

This repository contains the experiments done in the work An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling by Shaojie Bai, J. Zico Kolter and Vladlen Koltun.

We specifically target a comprehensive set of tasks that have been repeatedly used to compare the effectiveness of different recurrent networks, and evaluate a simple, generic but powerful (purely) convolutional network on the recurrent nets' home turf.

Experiments are done in PyTorch. If you find this repository helpful, please cite our work:

@article{BaiTCN2018,
	author    = {Shaojie Bai and J. Zico Kolter and Vladlen Koltun},
	title     = {An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling},
	journal   = {arXiv:1803.01271},
	year      = {2018},
}

Domains and Datasets

Update: The code should be directly runnable with PyTorch v1.0.0 or above (PyTorch v>1.3.0 strongly recommended). The older versions of PyTorch are no longer supported.

This repository contains the benchmarks to the following tasks, with details explained in each sub-directory:

The Adding Problem with various T (we evaluated on T=200, 400, 600)
Copying Memory Task with various T (we evaluated on T=500, 1000, 2000)
Sequential MNIST digit classification
Permuted Sequential MNIST (based on Seq. MNIST, but more challenging)
JSB Chorales polyphonic music
Nottingham polyphonic music
PennTreebank [SMALL] word-level language modeling (LM)
Wikitext-103 [LARGE] word-level LM
LAMBADA [LARGE] word-level LM and textual understanding
PennTreebank [MEDIUM] char-level LM
text8 [LARGE] char-level LM

While some of the large datasets are not included in this repo, we use the observations package to download them, which can be easily installed using pip.

Usage

Each task is contained in its own directory, with the following structure:

[TASK_NAME] /
    data/
    [TASK_NAME]_test.py
    models.py
    utils.py

To run TCN model on the task, one only need to run [TASK_NAME]_test.py (e.g. add_test.py). To tune the hyperparameters, one can specify via argument options, which can been seen via the -h flag.

tcn's People

Contributors

Stargazers

Watchers

Forkers

glarbalytic waleedgondal codeaudit xuehaouwa liaoheping onisimchukv peratham shyamalschandra jfsantos kormilitzin schangpi tpnguyen liangdu ruohoruotsi tony32769 simonsleo winwinjjiang ahmedalesh eternalfeather murari023 jorahn briando2005 shubhampachori12110095 rsoltani linpingchuan axelv franzhell ceshine hbredin paduvi thapliy jeffrey-umbo connectsoumya fuckmi abakane1 yalechang jpatrickpark rajatthomas kashif chenyangsi aoli-al nikolayvoronchikhin eriklangenborg-rs esvhd cbennett fabianfalck winggy lauragpt ltyscu aaronzou haohaohaohaohaohaozhang cu-dl2018spring zzdgit qitong dennistang742 yihengjiang giering jdc08161063 himports levstyle eberli tommylee3003 intuitionmachine merico34 stsaten6 jaykimbravekjh jiangyiheng13 pvcastro snci rubenszimbres jkhlot pandinosaurus sandy4321 sruan2 hedgefair malofleur qingsong99 garygaryry sidneylyzhang alphadl huangxizhi trantorrepository mannykayy robert-junwang yulongpo mayidudu andybert wanchong hyzcn wasacheney bokunwang rufeng105 xyionwu ml-lab bb-beta little1tow xiliangsong zbwade bityangke gzjas

tcn's Issues

Building a non-causal TCN

So in order to make this architecture non-causal, assuming the kernel size is 3, I just need to remove the chomps and make the padding equal to the dilation, right?

Question: layer sizes and time-series classification

Q1
I noticed that in all examples you use the same hidden layer size across your layers. I was wondering if you've tried altering this and instead use a more tradition V shape, where earlier layers have more filters, gradually mapping down to fewer and fewer in deeper layers.

Q2
In the adding problem you use the very last element of the linear layer following the TCN to obtain a single scalar output.

self.linear1 = nn.Linear(num_channels[-1], output_size)
y1 = tcn(x)
out = F.relu(self.linear1(y1[:, :, -1]))

I was wondering if it would make sense to instead learn a combination across all time steps like so:

self.linear1 = nn.Linear(num_channels[-1], output_size)
self.linear2 = nn.Linear(seq_len, 1)
y1 = tcn(x)
y2 = F.relu(self.linear1(y1.transpose(1, 2)))
y3 = self.linear2(y2.transpose(1, 2))[:, :, 0]

Here, we combine all filters from the last layer with a linear layer just like you did, but also combine across all time-steps.. Is there any reason for not doing so?

Thanks a lot!

figure 4 (b) of paper

is the figure 4 (b) of the paper for T=1000 or T=2000? If its for T=1000 then shouldn't the random guess baseline loss be 0.02?

Causality

Hi,
In which part of the code you make sure the network is causal? Is it the Chomp1d ?

Thanks a lot!
Amir

When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN.

It seems to me that LSTM is faster when the sequence length is short (say 28).
When the sequence length is long (say 784), LSTM will be much slower than TCN.

It seems to me for TCN, the computation time is independent of the sequence length.

Am I correct?

Is the code for Residual Blocks provided?

I was wondering if the repository in its current state has code to build the residual blocks shown in the paper (Figure 1 (b)).

I'm trying to use your paper's insights to build CNN architectures for sequence modelling using keras and was a bit confused about implementing the shown residual block of 2 conv layers followed by a residual connection.

Mnist classification problem

Hi, I want to ask a simple problem about the mnist classification example. Images in mnist are treated as sequences by expanding them to 1D, then each image gets a probability distribution. But images have no relations, so what is the meaning of tcn here when you process each sequence separately. It looks like a fully connected layer. Did I misunderstand the process procedure?

output for time series tasks

Hi Jerry, thanks for the nice work, I'm trying to use tcn for a multi-variate time series task, suppose two data points with different sequence length are x1 = [t1, t2, t3, t4], x2 = [t1, t2, t3, t4, t5, t6], and corresponding labels for each time-step are [y1, y2, y3, y4] and [y1, y2, y3, y4, y5, y6], padding them to the same length: x1 = [0, 0, t1, t2, t3, t4], x2 = [t1, t2, t3, t4, t5, t6] for mini-batch input.

My question is, given x1 and x2, how the model output the prediction for each time-step? like for x1 we want the model output [z0, z1, z2, z3, z4, z5] and we only take [z2, z3, z4, z5] for binary cross entropy loss with it's true label [y1, y2, y3, y4].

my current approach is refer to the mnist example, by modify the models forward with:

def forward(self, inputs):
"""Inputs dimension (N, C_in, L_in)"""
seq_len = inputs.shape[2]
y1 = self.tcn(inputs)
o = torch.cat([self.linear(y1[:,:,i]) for i in range(0, seq_len)], dim=1)
return o

this will output the same length as input, but I'm not sure if this is the correct understanding for tcn model, or there should be some other approaches.

Thank you.

invalid index of a 0-dim tensor

Hello,

When I am running pmisst_test.py, I get following error:

(Pdb) train_loss.data[0]
*** IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

It happens on line 98.

Why is there no language model

I want to learn how to process text data using tf.keras and tcn

need clarification about paper and implementation

Hi,

As it is written in the paper, input and output should be the same length and output at time t depends on previous values of input. When I look at the implementation of adding problem, I see that input is 2*T where T is 200,300,400 etc. However output is just a scalar. What is the explanation for this ?

error

Hi,
I am a student. When I run your code, I have a question to think about. Your MNIST pixel task is very good. The data set is very formal. When I have a self-data set, what changes should I make on your basis and how to do them?
Best Regards

what if the input sequence length is not the same?

Hi,thanks for your code,this is not a issue,but my confusion,sorry to put it here:
I want to use TCN for speech recognition,and I have some speech labeled with words,but these speech,they do not have the same length,like speech 1 duration is 10 second,speech 2 is 5 second,how can I use TCN for speech recognition?how can I handle the uncertainty of the input?
Thanks,looking forward for your reply.

Recommendation for image to text

My goal is to train a model that can output sequences of text from image inputs. Using the IAM handwriting dataset for example, we would pass the model an image

and expect it to return "broadcast and television report on his". Historically, the common (i.e. recurrent) way to accomplish this would be an encoder (CNN) + decoder (LSTM) architecture like OpenNMT's implementation. I am interested in replacing the decoder with a TCN, but am unsure how to approach the image data. The CNN encoder will create a batch of N features maps with reduced spatial dimensions (H', W')

The issue is a TCN expects 3D tensors (N, L, C) whereas each "timestep" of the image is 2D (N, H, W, C). Following the p-MNIST example in the paper, we could flatten the image into a 1D sequence with length H' x W'. Then the TCN would effectively snake through the pseudo-timesteps like below

However, if we want one prediction per timestep it makes much more sense to define a left-to-right sequence instead of a snaking one since that's the direction the text is depicted in the image. Did you experiment at all with image to text models, and if so, how did you chose to represent the images?

I also wonder about the loss function for training a TCN decoder. Assuming you divide the image width into more timesteps than your maximum expected sequence length, it seems like connectionist temporal classification (CTC) would be a good choice. Then you do not have to worry about alignment between the target sequence and model's prediction. For instance, "bbb--ee-cau--sssss----e" would be collapsed to "because" by combining neighboring duplicates and removing blanks. Do you agree or is there a different loss function you would suggest?

Repo does not have a license

To encourage exploration of TCN networks, this code needs a license, for example Mozilla Public License 2.0, that allows commercial and non-commercial entities to build upon this work.

Without such an explicit license, this code can not be used and built upon by other entities without exposing them to legal risk.

Inference on Word CNN

How inference the next word / char in the word or character based TCN? Shall I use get_batch like

data, targets = get_batch(data_source, i, args, seq_len=1, evaluation=False)
        output = model(data)

Thank you.

Importance of weight_norm

I am looking for LSTM replacement that can easily be deployed with ONNX. As there are many issues with LSTMS in onnx, I was thinking about using TCNs. Now, it turned out that the pytorch onnx export module cannot find a suitable op for the weight_norm operation. My questions is

how important is this to the model?
can it be safely removed or replaced with a small loss in performance?

How to use cnn for seq2seq tasks ?

Hi. How do I use cnn for seq2seq? I understand that cnn can be used as a encoder. What about decoder? I looked through tensorflow transformer, where you iteratively generate one symbol a time by feeding <start symbol, none, none, ...>, then <start symbol, 1st generated symbol, none, none ,...> and so on into decoder and use a mask to avoid backward information flow. Is that the same approach here ?

Tuning hyperparameters

Hi,
I have a few questions about strategies to tune hyperparameters of TCN. Apparently, like any other neural network-based architecture, the performance of TCN is sensitive to the hyperparameter values whose ~ideal values vary (sometimes significantly) from task to task. It is easy to think about tuning some hyperparameters like kernel size and #levels but not others. Hence my questions are:

Did you experiment with any normalization technique like standard normalization for TCN? Does any normalization technique help in faster convergence to the ~optimal performance? Here I am trying to get hold of tricks for TCNs similar to what exists for feed forward networks. For example, in feedforward networks, if I do a standard normalization then on most cases a learning rate of 1e-4 yields good results.
Is there any hyperparameter search technique that works better for TCNs? Again here, I am trying to get hold of something similar to what exists for feedforward networks where usually searching in a log space yields good results. Specifically, I am wondering about the ways to tune hyperparameters like number hidden units, learning rate, and the choice of optimizer to use.

Thanks!!

How to reproduce results from the paper?

Is it just the testing result in the last epoch using default parameters? I have tried to run add_test.py and below is the result i get for the 10 epochs.

Test set: Average loss: 0.168699
Test set: Average loss: 0.001142
Test set: Average loss: 0.000922
Test set: Average loss: 0.000345
Test set: Average loss: 0.000143
Test set: Average loss: 0.000188
Test set: Average loss: 0.000121
Test set: Average loss: 0.000028
Test set: Average loss: 0.000244
Test set: Average loss: 0.000042

Which one should I use for benchmarking? In the paper, the result of TCN was 5.8e-5 but it seems like we can use 2.8e-5 or 4.2e-5 here.

confused about mnist_pixel task

Thanks for sharing, I'm confused about tcn model for mnist_pixel. In mnist model, why just y1[:, :, -1] is ok? before linear layer. why discard others?

Why conv1d of pytorch is guranteed to be causal?

In your paper, you mentioned that TCN should be causal and in Figure 1 it seems the conv is causal indeed.
But in this implementation, I see the only tweek is chomp1D and you used conv1d directly?
Can we say that pytorch's conv1d is causual?

Using TCN as an Encoder

How can we use the TCN as an encoder? How can we use the network to produce a single vector representation of the input like the context vector for RNNs? Do we take the last timestep of the output like here?

TCN/TCN/mnist_pixel/model.py

Line 15 in 2221de3

o = self.linear(y1[:, :, -1])

RNN/LSTM Baselines?

This is a great set of experiments! I'm wondering if the code for the RNN/LSTM baselines reported in the paper are available somewhere. At present, I only see code for the TCN model.

Thanks!

_

Results for adding problem not reproducible with default values

Hey guys,

Thanks very much for sharing the code for the paper.
I tried running the add_test.py with no change in the default values and the loss does not converge. However for some other random seeds it'll converge.

OS: Ubuntu 16.04
PyTorch version: 1.0.0

Output:

Namespace(batch_size=32, clip=-1, cuda=True, dropout=0.0, epochs=10, ksize=7, levels=8, log_interval=100, lr=0.004, nhid=30, optim='Adam', seed=1111, seq_len=400)
Producing data...
Train Epoch:  1 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.531405
Train Epoch:  1 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.166126
Train Epoch:  1 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.171005
Train Epoch:  1 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.170864
Train Epoch:  1 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.166217
Train Epoch:  1 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.172447
Train Epoch:  1 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.166411
Train Epoch:  1 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167221
Train Epoch:  1 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.166980
Train Epoch:  1 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.170149
Train Epoch:  1 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.167781
Train Epoch:  1 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.173033
Train Epoch:  1 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.167806
Train Epoch:  1 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.176322
Train Epoch:  1 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.174221

Test set: Average loss: 0.162485

Train Epoch:  2 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.164098
Train Epoch:  2 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.165515
Train Epoch:  2 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.169491
Train Epoch:  2 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.170406
Train Epoch:  2 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.164345
Train Epoch:  2 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.171381
Train Epoch:  2 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.165580
Train Epoch:  2 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167373
Train Epoch:  2 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.165166
Train Epoch:  2 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.169122
Train Epoch:  2 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.167244
Train Epoch:  2 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.172299
Train Epoch:  2 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.166954
Train Epoch:  2 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.175234
Train Epoch:  2 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.174419

Test set: Average loss: 0.159353

Train Epoch:  3 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.162857
Train Epoch:  3 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.165053
Train Epoch:  3 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168938
Train Epoch:  3 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.170395
Train Epoch:  3 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.163797
Train Epoch:  3 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.171040
Train Epoch:  3 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164996
Train Epoch:  3 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167644
Train Epoch:  3 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164682
Train Epoch:  3 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168531
Train Epoch:  3 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.167023
Train Epoch:  3 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.172070
Train Epoch:  3 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.165484
Train Epoch:  3 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.174374
Train Epoch:  3 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.174143

Test set: Average loss: 0.159425

Train Epoch:  4 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.162641
Train Epoch:  4 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164765
Train Epoch:  4 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168638
Train Epoch:  4 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.170502
Train Epoch:  4 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.163475
Train Epoch:  4 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.170376
Train Epoch:  4 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164857
Train Epoch:  4 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167976
Train Epoch:  4 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164582
Train Epoch:  4 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168683
Train Epoch:  4 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166684
Train Epoch:  4 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.171560
Train Epoch:  4 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.165219
Train Epoch:  4 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.174026
Train Epoch:  4 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.173639

Test set: Average loss: 0.159309

Train Epoch:  5 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.162744
Train Epoch:  5 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164428
Train Epoch:  5 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168446
Train Epoch:  5 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.170423
Train Epoch:  5 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.163086
Train Epoch:  5 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.170119
Train Epoch:  5 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164834
Train Epoch:  5 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167641
Train Epoch:  5 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164558
Train Epoch:  5 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168754
Train Epoch:  5 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166589
Train Epoch:  5 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.171319
Train Epoch:  5 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.165145
Train Epoch:  5 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.173861
Train Epoch:  5 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.173357

Test set: Average loss: 0.159252

Train Epoch:  6 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.163019
Train Epoch:  6 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164251
Train Epoch:  6 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168384
Train Epoch:  6 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.170225
Train Epoch:  6 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.162895
Train Epoch:  6 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.170042
Train Epoch:  6 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164821
Train Epoch:  6 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167372
Train Epoch:  6 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164525
Train Epoch:  6 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168579
Train Epoch:  6 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166480
Train Epoch:  6 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.171241
Train Epoch:  6 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.165065
Train Epoch:  6 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.173829
Train Epoch:  6 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.173157

Test set: Average loss: 0.159345

Train Epoch:  7 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.163180
Train Epoch:  7 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164179
Train Epoch:  7 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168342
Train Epoch:  7 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.169995
Train Epoch:  7 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.162740
Train Epoch:  7 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.169979
Train Epoch:  7 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164783
Train Epoch:  7 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.167070
Train Epoch:  7 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164500
Train Epoch:  7 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168542
Train Epoch:  7 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166456
Train Epoch:  7 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.171119
Train Epoch:  7 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.164998
Train Epoch:  7 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.173744
Train Epoch:  7 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.172998

Test set: Average loss: 0.159525

Train Epoch:  8 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.163299
Train Epoch:  8 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164069
Train Epoch:  8 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168318
Train Epoch:  8 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.169810
Train Epoch:  8 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.162647
Train Epoch:  8 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.169944
Train Epoch:  8 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164746
Train Epoch:  8 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.166869
Train Epoch:  8 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164391
Train Epoch:  8 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168394
Train Epoch:  8 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166326
Train Epoch:  8 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.171027
Train Epoch:  8 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.164934
Train Epoch:  8 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.173725
Train Epoch:  8 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.172872

Test set: Average loss: 0.159705

Train Epoch:  9 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.163319
Train Epoch:  9 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164034
Train Epoch:  9 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168286
Train Epoch:  9 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.169652
Train Epoch:  9 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.162581
Train Epoch:  9 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.169919
Train Epoch:  9 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164722
Train Epoch:  9 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.166746
Train Epoch:  9 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164353
Train Epoch:  9 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168298
Train Epoch:  9 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166287
Train Epoch:  9 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.170995
Train Epoch:  9 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.164889
Train Epoch:  9 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.173729
Train Epoch:  9 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.172800

Test set: Average loss: 0.159804

Train Epoch: 10 [  3168/ 50000 (6%)]	Learning rate: 0.0040	Loss: 0.163326
Train Epoch: 10 [  6368/ 50000 (13%)]	Learning rate: 0.0040	Loss: 0.164024
Train Epoch: 10 [  9568/ 50000 (19%)]	Learning rate: 0.0040	Loss: 0.168285
Train Epoch: 10 [ 12768/ 50000 (26%)]	Learning rate: 0.0040	Loss: 0.169597
Train Epoch: 10 [ 15968/ 50000 (32%)]	Learning rate: 0.0040	Loss: 0.162569
Train Epoch: 10 [ 19168/ 50000 (38%)]	Learning rate: 0.0040	Loss: 0.169920
Train Epoch: 10 [ 22368/ 50000 (45%)]	Learning rate: 0.0040	Loss: 0.164698
Train Epoch: 10 [ 25568/ 50000 (51%)]	Learning rate: 0.0040	Loss: 0.166688
Train Epoch: 10 [ 28768/ 50000 (58%)]	Learning rate: 0.0040	Loss: 0.164336
Train Epoch: 10 [ 31968/ 50000 (64%)]	Learning rate: 0.0040	Loss: 0.168244
Train Epoch: 10 [ 35168/ 50000 (70%)]	Learning rate: 0.0040	Loss: 0.166240
Train Epoch: 10 [ 38368/ 50000 (77%)]	Learning rate: 0.0040	Loss: 0.170987
Train Epoch: 10 [ 41568/ 50000 (83%)]	Learning rate: 0.0040	Loss: 0.164852
Train Epoch: 10 [ 44768/ 50000 (90%)]	Learning rate: 0.0040	Loss: 0.173748
Train Epoch: 10 [ 47968/ 50000 (96%)]	Learning rate: 0.0040	Loss: 0.172743

Test set: Average loss: 0.159894

Any preprocessing required for the model?

I used zero padding to make records in my batch share the same length. Is there any other preprocessing required in order to use the model? (scaling, etc.)

why temporal pad at all?

Nice work! I'm researching time series regression using machine learning so I'm looking at LSTM, TCN and Transformers based models and getting good results with your model.

One general question? I'm not sure I understand the reason why we pad each layer of a TCN at all. I understand that it ensures each layer produces a sequence of the same length so there's a benefit in that your predictions are aligned with your inputs. But it's very similar to initialising an AR(p) model with a vector of zeros when you predict forward - the initial predictions will all be "wrong" until the effect of the initial state has decayed out. LSTM's also have this issue - most applications seem to set the initial state per batch to zero which results in transients errors at the start of the batch (some authors train a separate model to estimate the initial state which I've had good success with). I would assume this would impact training as well and it seems to make sense to mask out the start of the output sequence when calculating the loss or the model may try and adapt to "fix" the impact of the wrong ic.

Certainly when I train a regression-based TCN I can observe transient errors at the start of the prediction - i.e. the diagram below underpredicts for the first 96 samples (that's 1 day of 15minute electricity consumption) then overpredicts for the first week before settling down. Interested in your thoughts.

Also, one general observation - the prediction from TCN seem noisier than LSTM, I thought the long AR window might filter out more noise than it has. Plus it's quite sensitive to learning rate - low learning rate produces a very noisey output sequence.

Residual block: 1x1 convolution implementation

Hi! I'm learning your TCN architecture and I got stuck with an understanding of the next part.
In the code:
https://github.com/locuslab/TCN/blob/master/TCN/tcn.py#L30-L31

You sequentially add conv layer and chomp1d layer. I see that chomp1d cuts the last padding elements (which are redundant as I understand).

I assume that chomp1d == 1x1 conv described in the paper. But the code doesn't look like the same thing described the paper because:

In TemporalBlock it adds chomp1d after each conv layer.
In the paper, it sums outputs when in the code it just cuts tensor.

My question is: 1x1 convolution from Fig 1.b == chomp1d? If yes then why it's different from the scheme in the paper and does it conceptually affect somehow? If not did you implement it somewhere or it's a good chance to try?)

I didn't work with NN a lot so I don't know some common thing and try to figure it out.

Thank you in advance!
Oleh

Installation question

I'm sorry for asking what I expect has a really obvious answer. I downloaded the zip file and I get the following results when I try add_test.py

(base) F:\PersystCode\Python\TCN-master\TCN\adding_problem>python add_test.py
Traceback (most recent call last):
File "add_test.py", line 5, in
from TCN.adding_problem.model import TCN
ModuleNotFoundError: No module named 'TCN'

Clearly I need to install the TCN module. I tried >python tcn.py install, but that didn't seem to do anything.

some questions about word-level LM

if i have many articles , they have different words. these articles maybe have some relationships. How can i integrate them?

Cutting off effective history when evaluating char_cnn model.

I don't get why on test time (or when evaluating the model on a validation set), we don't compute the loss on all the sequence and not only on a part of the sequence that ensures sufficient history.
The model is not evaluated on the whole dataset but only on a sub-part, are the results reliable ? or even comparable to other models (LSTM, ect ) that doesn't use this method ?

default for cuda in word_cnn is True, but code expects it to be False

Lines 20-22 make the cuda arg True by default:

 parser.add_argument('--cuda', action='store_false',
                    help='use CUDA (default: True)')

Yet lines 60-62 test as if it expects args.cuda to be False by default:

if torch.cuda.is_available():
    if not args.cuda:
        print("WARNING: You have a CUDA device, so you should probably run with --cuda")

Maybe you could swap in the following for the argparser --cuda argument:

parser.add_argument('--cuda', action='store_false', default=False,
                    help='use CUDA (default: True)')

For people wanting to test drive the code on small machines (i.e my laptop) it helps remove one snag. I can do a quick PR if you like. Thanks for sharing the code for these experiments, they are super useful!

Weight initialization method

I notice that in your code, you init all the weight like below. Is there any special reason for doing such way? why not init with xavier or other method?
def init_weights(self): self.conv1.weight.data.normal_(0, 0.01) self.conv2.weight.data.normal_(0, 0.01) if self.downsample is not None: self.downsample.weight.data.normal_(0, 0.01)

Incorrect default hyper-parameters in the code?

I am trying to reproduce the results for the polyphonic music (Nott dataset), but I am having trouble with setting the correct TCN hyperparams. Are the ones shown in Table 2 of the paper the ones I should use? If so, then why are the default parameters set differently, e.g. kernel size to 5 instead of 6 for Nott (or 3 for JSB)?

Also, when I manually set the parameters as shown in the paper, by using

--dropout 0.2 --clip 0.4 --ksize 6 --levels 4 --nhid 150

as parameters, then the TCN model has 2M parameters, but in the paper the model size is given as 1M (roughly). So now I am confused whether this is actually the setting used to produce the paper results?

scaling TCN to multiple GPUs

i am currently training a model that requires a very large number of levels (effective history to be looked at) and i cannot fit it onto 1 GPU ... how can we parallelize TCNs to scale across multiple GPUs to fit one single model?

Can the TCN module be made faster?

I'm using your TCN module for a language modeling task. My code follows the structure of your char_cnn code. It works but the performance is very bad compared to an LSTM network. Each epoch with the TCN network takes about 10 times longer. Do you know if the performance can be improved? Here is the forward method from the TCN class:

    def forward(self, x):
        emb = self.drop(self.encoder(x))
        y = self.tcn(emb.transpose(1, 2))
        o = self.decoder(y.transpose(1, 2))
        return o.contiguous()

Perhaps it is the transpose calls that is making the code slow?

mnist_pixel missing "processed" folder

The mnist_pixel scripts presumes that data/mnist/processed folder exists

And I also ran into the the pytorch version issue with data[0] should be data.item() or you get a IndexError: invalid index of a 0-dim tensor.

Residual block with or without relu to identity connection?

Hey, I was just looking at Figure 1b) in the paper showing the residual block, which adds only applies an optional 1x1 conv to the identity connection before adding the result to the output from the convolutional chain
But in the code, I see that you apply relu after adding the two outputs? Is that a mistake? I can't see it in Figure 1b).

See here:
https://github.com/locuslab/TCN/blob/master/TCN/tcn.py#L45

Use of spatial dropout

Could you clarify if spatial dropout is used? The paper suggests that it is, but the code seems to use standard dropout.

Where are the poly_music datasets located?

I haven't managed to find the datasets referenced in the utils inside the poly_music example.
With the observations package there is a jsb_chorales only dataset, but it is not pre-processed to be used by this algorithm, I have tried to adapt it but without success. Can you provide the link to download the pre-processed data .mat files?

how to calculate the TCN receptive area?

If i want to use TCN with a long step length, how i calculate the receptive area to guide the num_channels design?

issue about poly_music data

I'd like to ask how midi files of Nottingham dataset can converse to discrete squences ,by time or by frames?

Is TCN suitable for time series regression?

Hello,

Thank you for your great paper and sharing!

I'm wondering how to use TCN to solve time series regression problems. In my time series scenario, data for each moment contains multiple variables and each variable is a real number. For example, data for time step 0 is something like "vector_0 = <0.1, 0.2, 0.3, ...>", and I want to use historical k vectors to predict the next vector data.

I have developed a LSTM model for this question. The input shape of LSTM model is "batch_size, time_steps(k), input_size(length of each vector)", and the prediction result is the last value of LSTM. Then I could calculate the MSE loss and do backward. How can I use TCN to solve this problem?

Best Regards

Weight initialization might be superfluous

It appears that the way weight_norm works in Pytorch makes this line in TCN/tcn.py superfluous, because this line registers a forward prehook that overwrites module.weight every forward pass. (FYI)

Shorter than seq_len sequences, padding from left?

Sorry, silly question..

Let's say my seq_len is 100 and these are time-ordered, so t0 < t1 < ... < t100 and I have a few instances in my batch which are shorter. So I'll pad them with zeros naturally from the left, i.e:
batch_size=2

0, 0, 0, t1 < t2 < ... < t97
0, 0, 0, 0, t1 < t2 < ... < t96

Am I right in assuming that TCN will read (enforce causality) from left to right, i.e. the future time points have to be always to the right of the older ones?

Thanks in advance..

Causal Transposed Convolution

Hi,

Thanks for this great paper ... I am trying to use this architecture in auto-encoder setting such that the encoder part is a stack of strided-dilated-causal conv layers and now thinking about the decoder part.

In terms of up-sampling using transposed convolutions, does it follow the same intuition in order to have causal up-sampling (i.e. to exclude the reconstructions of future part) ? Or shall we generate sample-by-sample without transposed conv layers ?

With many thanks in advance
Best Regards

Do TCN stride over time or over sequence length ?

So, you guys stated that tcn can be used as a dropped in replacement for lstm.

Lets assume I have a batch of images of shape: N x T x C x H x W.
I reshape the images to be of size N X T X (-1). This is the x given to the forward function of the TCN. TCN is initialized with number of inputs T and the number of channels is [T] * num_levels_tcn. This means that effectively the TCN slides over (C x H x W) , or am I misunderstanding something ? I was under the impression (from the images in the paper) that TCN would slide over time.

About different sequence input

I have a totally different sequence, the smallest length is about 100 words, the max is about 5000. I attempt to padding zero to the same length of 5000, but the classified result is terrible. but if I just input different size, that's means keep the original and make the batch_size just 1, that's works well. I don't know why this happens.

Is tcn suitable for text classification?

I have a dataset and the text length is very small, and it's a multi-label task with 100+ class, most text length is 2,3,4, and i try to use tcn, but the result is not good(bag than pure cnn), my level is 9, and kernel is 2. Can you give me some suggest？