Giter VIP home page Giter VIP logo

Comments (4)

zhihanyue avatar zhihanyue commented on August 17, 2024 1
  1. Please read the code of model.encode() carefully, and you can confirm that the real behavior is our corrected figure.
  2. I've already explained why your code is inappropriate. The following is what I think you misunderstood:

For example, we assume t is the first timestamp in test set. For the first sample in the test set, the original input is [t-padding, t]. However, your input is [t, t], which feeds only one timestamp as input, resulting in poor performance and biased distribution.

We confirm that our code ensures the following behavior:

  • Behavior 1: In our implementation, assuming t denotes the first timestamp in test set, the input of the first sample in test set is [t-padding, t], and the label of the last sample in valid set is [t-pred_len, t-1]. Only the information in "the input" is used to generate the predicted values.

From Behavior 1, we conclude that there is no data leakage. If you think Behavior 1 is not true, please show the evidence rather than asking for code explaining. I'm sorry I don't have that much time.

Note that we are responsible for Behavior 1. This means that if you find Behavior 1 is not true under specific cases, please reopen the issue. Thanks.

from ts2vec.

zhihanyue avatar zhihanyue commented on August 17, 2024

The figure you provided is wrong. We perform causal inference. For timestamp t, the sliding window is from t-T+1 to t. Therefore, there is no information leakage. The following is the corrected figure.
WX20230526-100917@2x

from ts2vec.

zhihanyue avatar zhihanyue commented on August 17, 2024

I have checked your another issue xingyu617/SimTS_Representation_Learning#5 and seem to understand your misunderstanding. The code you provided is inappropriate.

# original code
# all_repr = model.encode(
#     data[:, :, n_covariate_cols:],
#     casual=True,
#     sliding_length=1,
#     sliding_padding=padding,
#     batch_size=128
# )
# print("all_repr",data.shape, all_repr.shape)
# train_repr = all_repr[:, train_slice]
# valid_repr = all_repr[:, valid_slice]
# test_repr = all_repr[:, test_slice]

# your code
train_data = data[:, train_slice, n_covariate_cols:]
valid_data = data[:, valid_slice, n_covariate_cols:]
test_data = data[:, test_slice, n_covariate_cols:]
print("data shape:", train_data.shape, valid_data.shape, test_data.shape)

train_repr = model.encode(
    train_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
valid_repr = model.encode(
    valid_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
test_repr = model.encode(
    test_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)

For example, we assume t is the first timestamp in test set. For the first sample in the test set, the original input is [t-padding, t]. However, your input is [t, t], which feeds only one timestamp as input, resulting in poor performance and biased distribution.

from ts2vec.

wiwi avatar wiwi commented on August 17, 2024

I have checked your another issue xingyu617/SimTS_Representation_Learning#5 and seem to understand your misunderstanding. The code you provided is inappropriate.

# original code
# all_repr = model.encode(
#     data[:, :, n_covariate_cols:],
#     casual=True,
#     sliding_length=1,
#     sliding_padding=padding,
#     batch_size=128
# )
# print("all_repr",data.shape, all_repr.shape)
# train_repr = all_repr[:, train_slice]
# valid_repr = all_repr[:, valid_slice]
# test_repr = all_repr[:, test_slice]

# your code
train_data = data[:, train_slice, n_covariate_cols:]
valid_data = data[:, valid_slice, n_covariate_cols:]
test_data = data[:, test_slice, n_covariate_cols:]
print("data shape:", train_data.shape, valid_data.shape, test_data.shape)

train_repr = model.encode(
    train_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
valid_repr = model.encode(
    valid_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
test_repr = model.encode(
    test_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)

For example, we assume t is the first timestamp in test set. For the first sample in the test set, the original input is [t-padding, t]. However, your input is [t, t], which feeds only one timestamp as input, resulting in poor performance and biased distribution.

You means that you use padding to seperate training set and testing set. But, I think I didnot misunderstand because the whole seqence are feed into eval_forecasting().
In the train.py:

elif task_type == 'forecasting':
            out, eval_res = tasks.eval_forecasting(model, data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols)

Here, 'data' is came from:

elif args.loader == 'forecast_csv_univar':
        task_type = 'forecasting'
        data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols = datautils.load_forecast_csv(args.dataset, univar=True)
        train_data = data[:, train_slice]

If you still think I misunderstand, please give me more detail information about which code to make sure the training set and testing set are independent. Thanks.

from ts2vec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.