Giter VIP home page Giter VIP logo

bench-vldb20's Introduction

ImputeBench: Benchmark of Imputation Techniques in Time Series

ImputeBench implements over 15 advanced imputation techniques for missing blocks in time series. It evaluates their precision and runtime on various real-world time series datasets using different recovery scenarios. Technical details can be found on our PVLDB 2020 paper: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series . The benchmark can be easily extended with new algorithms (C/C++, Python, or Matlab), datasets, and scenarios.

  • Imputation Algorithms: The original benchmark implements the following algorithms (in C++):

    • CDRec: Scalable Recovery of Missing Blocks in Time Series with High and Low Cross-Correlations, KAIS'20
    • DynaMMo: DynaMMo: mining and summarization of coevolving sequences with missing values, KDD'09
    • GROUSE: Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation, PMLR'16
    • ROSL: Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-Rank Matrices, CVPR'14
    • SoftImpute: Spectral Regularization Algorithms for Learning Large Incomplete Matrices, JMLR'10
    • SPIRIT*: Streaming pattern discovery in multiple time-series, VLDB'05
    • STMVL: ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data, IJCAI'16
    • SVDImpute: Missing value estimation methods for DNA microarrays, BIOINFORMATICS'01
    • SVT: A Singular Value Thresholding Algorithm for Matrix Completion, SIAM J. OPTIM'10
    • TeNMF: Nonnegative Matrix Factorization for Time Series Recovery From a Few Temporal Aggregates, PMLR'17
    • TRMF: Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction, NIPS'16
    • TKCM*: Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series, EDBT'17
  • New Algorithms: We are continuously expanding the benchmark with new algorithms (using their original implementation):

    • DeepMVI: Missing Value Imputation on Multidimensional Time Series, PVLDB'21
    • MPIN: Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation, PVLDB'24
    • IIM*: Learning Individual Models for Imputation, ICDE '19
    • MRNN*: Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks, Trans. On Bio Eng.'19
    • BRITS: BRITS: Bidirectional Recurrent Imputation for Time Series, NeurIPS'18
    • SSA*: Model Agnostic Time Series Analysis via Matrix Estimation, Meas. Anal. Comput. Syst'18
  • Algorithms under Integration:

    • PriSTI: PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation, ICDE'23
    • DAMR: Dynamic Adjacency Matrix Representation Learning for Multivariate Time Series Imputation, SIGMOD'23
    • EDIT: Efficient and Effective Data Imputation with Influence Functions, PVLDB'23
  • Datasets: All the datasets used in this benchmark can be found here.

  • Missingness Patterns: The full list of recovery scenarios can be found here.

  • Notes: The algorithms marked with * cannot handle multiple incomplete time series. They produce results only for the following scenarios: miss_perc, ts_length, and ts_nbr.

Prerequisites | Build | Execution | Extension | Contributors | Award | Citation


Prerequisites

  • Ubuntu 20 or Ubuntu 22 (including Ubuntu derivatives, e.g., Xubuntu) or the same distribution under WSL.
  • Clone this repository.
  • Mono: Install mono from https://www.mono-project.com/download/stable/ and reboot your terminal.

Build

  • Build the Testing Framework using the installation script located in the root folder (takes several minutes)
    $ sh install_linux.sh
  • To evaluate the new algorithms built in Python (SSA, MRNN, BRITS, DeepMVI, and MPIN), please install the following packages (takes several minutes):
    $ sh install_extra.sh

Execution

    $ source bench-env/bin/activate
    $ cd TestingFramework/bin/Debug/
    $ mono TestingFramework.exe [arguments]

Arguments

-alg -d -scen
cdrec airq miss_perc
dynammo bafu ts_length
grouse chlorine ts_nbr
rosl climate miss_disj
softimp drift10 miss_over
svdimp electricity mcar
svt meteo blackout
stmvl temp all
spirit bafu_red
tenmf drift10_red
tkcm all
trmf
all
-------- -------- --------
New algs
-------- -------- --------
ssa
m-rnn
brits

Results

All results and plots will be added to the Results folder. The accuracy results of all algorithms will be sequentially added for each scenario and dataset to: Results/.../.../error/. The runtime results of all algorithms will be added to: Results/.../.../runtime/. The plots of the recovered blocks will be added to the folder Results/.../.../recovery/plots/.

Execution examples

  1. Run a single algorithm (cdrec) on a single dataset (drift10) using one scenario (missing percentage)
    $ mono TestingFramework.exe -alg cdrec -d drift10 -scen miss_perc
  1. Run two algorithms (cdrec, spirit) on a single dataset (drift10) using one scenario (missing percentage)
    $ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc
  1. Run point 2 without runtime results
    $ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc -nort
  1. Run the whole VLDB'20 benchmark (all algorithms, all datasets, all scenarios, precision and runtime)
    $ mono TestingFramework.exe -alg all -d all -scen all

Warning: Running the whole benchmark takes a sizeable amount of time (up to 4 days, depending on the hardware) and produces up to 15GB of output files with all recovered data and plots unless stopped early.

  1. Create patterns of missing blocks on one complete dataset (airq) using one scenario (missing percentage)
    $ mono TestingFramework.exe -alg mvexport -d airq -scen miss_perc

Note: You must run each scenario separately on one or multiple datasets. Each time you execute one scenario, the Results folder will be overwritten with the new files.

  1. Additional command-line parameters
    $ mono TestingFramework.exe --help

Parametrized execution

  • You can parametrize each algorithm using the command -algx. For example, you can run the svdimp algorithm with a reduction value of 4 on the drift dataset and by varying the sequence length as follows:
    $ mono TestingFramework.exe -algx svdimp 4 -d drift10 -scen ts_nbr
  • If you want to run some algorithms with default parameters and some with customized ones, you can use -alg and -algx together. For example, you can run stmvl algorithm with default parameter and cdrec algorithm with a reduction value of 4 on the airq dataset by varying the sequence length as follows:
    $ mono TestingFramework.exe -alg stmvl -algx cdrec 4 -d airq -scen ts_nbr

Remark: The command -algx cannot be executed in a group and thus must precede the name of each algorithm.


Extension


Contributors

Mourad Khayati ([email protected]) and Zakhar Tymchenko ([email protected]).


Award

Imputebench has received the VLDB 2020 Most Reproducible Paper Award.


Citation

@inproceedings{imputebench2020vldb,
 author    = {Mourad Khayati and Alberto Lerner and Zakhar Tymchenko and Philippe Cudr{\'{e}}{-}Mauroux},
 title     = {Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series},
 booktitle = {Proceedings of the VLDB Endowment},
 volume    = {13},
 number    = {5},
 year      = {2020}
}

bench-vldb20's People

Contributors

flavienburon avatar mkhayati avatar zakhartymchenko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bench-vldb20's Issues

BRITS implementation

Hi,

The README file for the repo says that the codebase supports BRITS implementation, though I was unable to locate the files for the same. Also in the arguments section, --alg doesn't seem to support brits flag. Clarity on this would be appreciated

Parikshit

raw results without running the repo?

Hello, this is not at issue at all, I just wanted to know if is possible that you could share the blackout MAE and RMSE results without need to run the experiments? I will appreciate it. Regards

Error on Running MRNN

Hi,

I tried to run MRNN algorithm with AIRQ dataset in MCAR setting but I got the following error :

Traceback (most recent call last):
File "", line 1, in
File "testerMRNN.py", line 14, in mrnn_recov
_, Recover_testX = M_RNN(trainZ, trainM, trainT, testZ, testM, testT)
File "M_RNN.py", line 20, in M_RNN
seq_length = len(trainZ[0,:,0])
IndexError: too many indices for array

How can I resolve this issue?

export the datasets with missing values

Hi,

Is there a way to export the datasets with the generated missing values into files? What command/function should I call if I just want to add the missing data, but not running the algorithms? I want to take them into my own framework to compare my results with those of the benchmark.

Thanks!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.