alan-turing-institute / tcpd Goto Github PK

The Turing Change Point Dataset - A collection of time series for the evaluation and development of change point detection algorithms

Home Page: https://arxiv.org/abs/2003.06222

License: MIT License

Makefile 1.50% Python 97.66% Dockerfile 0.84%

change-detection changepoint change-point change-point-detection dataset

tcpd's Introduction

Turing Change Point Dataset

Welcome to the host repository of the Turing Change Point Dataset, a set of time series specifically collected for the evaluation of change point detection algorithms on real-world data. This dataset was introduced in this paper. For the repository containing the code used for the experiments, see TCPDBench.

Useful links:

Introduction

Change point detection focuses on accurately detecting moments of abrupt change in the behavior of a time series. While many methods for change point detection exists, past research has paid little attention to the evaluation of existing algorithms on real-world data. This work introduces a benchmark study and a dataset (TCPD) that are explicitly designed for the evaluation of change point detection algorithms. We hope that our work becomes a proving ground for the evaluation and development of change point detection algorithms that work well in practice.

This repository contains the code needed to obtain the time series in the dataset. For the benchmark study, see TCPDBench. Note that work based on the dataset should cite our paper:

@article{vandenburg2020evaluation,
        title={An Evaluation of Change Point Detection Algorithms},
        author={{Van den Burg}, G. J. J. and Williams, C. K. I.},
        journal={arXiv preprint arXiv:2003.06222},
        year={2020}
}

The annotations are stored in the annotations.json file, which is the same as that used in the experiments (see here). Annotations are organised in a JSON object by dataset name and annotator id, and use 0-based indexing. See the TCPDBench repository for more information on extending the benchmark with your own methods or datasets.

Getting Started

Many of the time series in the dataset are included in this repository. However, due to licensing restrictions, some series can not be redistributed and need to be downloaded locally. We've added a Python script and a Makefile to make this process as easy as possible. There is also a Dockerfile to facilitate reproducibility.

Using Docker

To build the dataset using Docker, first build the docker image:

$ docker build -t tcpd https://github.com/alan-turing-institute/TCPD.git

then build the dataset:

$ docker run -i -t -v /path/to/where/you/want/the/dataset:/TCPD/export tcpd

Using the command line

To obtain the dataset, please run the following steps:

Clone the GitHub repository and change to the new directory:

$ git clone https://github.com/alan-turing-institute/TCPD
$ cd TCPD

Make sure you have Python (v3.2 or newer) installed, as well as virtualenv:
```
$ pip install virtualenv
```
Next, use either of these steps:
- To obtain the dataset using Make, simply run:
```
$ make
```
  This command will download all remaining datasets and verify that they match the expected checksums.
- If you don't have Make, you can obtain the dataset by manually executing the following commands:
```
$ virtualenv ./venv
$ source ./venv/bin/activate
$ pip install -r requirements.txt
$ python build_tcpd.py -v collect
```
  If you wish to verify the downloaded datasets you can run:
```
$ python ./utils/check_checksums.py -v -c ./checksums.json -d ./datasets
```
It may be convenient to export all dataset files to a single directory. This can be done using Make as follows:
```
$ make export
```

All datasets are stored in individual directories inside the datasets directory and each has its own README file with additional metadata and sources. The data format used is JSON and each file follows the JSON Schema provided in schema.json.

Using the data

For your convenience, example code to load a dataset from the JSON format to a data frame is provided in the examples directory in the following languages:

Python
R

Implementations of various change point detection algorithms that use these datasets are available in TCPDBench. A script to plot the datasets and detection results from TCPDBench is also provided in utils/plot_dataset.py.

The annotations are included in the annotations.json file. They are in the format:

{
  "<dataset>": {
      "annotator_id": [
          <change point index>
          <change point index>
          ...
          ],
      ...
  },
  ...
}

where the annotator_id is a unique ID for the annotator and the change point indices are 0-based. Please also see the documentation in TCPDBench for more information about using the dataset and benchmark in your own work.

License

The code in this repository is licensed under the MIT license. See the LICENSE file for more details. Individual data files are often distributed under different terms, see the relevant README files for more details. Work that uses this dataset should cite our paper.

Notes

If you find any problems or have a suggestion for improvement of this repository, please let us know as it will help us make this resource better for everyone. You can open an issue on GitHub or send an email to gertjanvandenburg at gmail dot com.

tcpd's People

Contributors

Stargazers

Watchers

tcpd's Issues

Download of Data produces Validation Errors (Data Checksum?)

First of all, thank you for your effort in creating a change point detection benchmark. I really appreciate this as the benchmarking of algorithms is not standardized in most publications. Therefore, I wanted to use your dataset to test different algorithms.

I used a virtual environment to install all required packages as described in the Readme. Then I use
python build_tcpd.py -v collect within the activated virtual environment to start the download of data.

After downloading the first files for the Apple dataset (AAPL.csv and apple.json appear in the directory ./datasets/apple). The script throws and Validation error:

Running collect action for dataset: apple ... Traceback (most recent call last):
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 115, in <module>
    main()
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 110, in main
    func(name, script)
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 85, in collect_dataset
    return run_dataset_func(name, script, "collect")
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 81, in run_dataset_func
    func(output_dir=dir_path)
  File "./datasets\apple\get_apple.py", line 213, in collect
    write_patch(json_path, target_path=json_path)
  File "./datasets\apple\get_apple.py", line 68, in wrapper
    raise ValidationError(target)
tcpd.apple.ValidationError: ./datasets\apple\apple.json

If I delete the Apple Dataset from TARGETS in build_tcpd.py, it then proceeds with everything for bee_waggle_6. There it also notes, that the checksum does not match. The script then throws the next Validation error at bitcoin.

Running collect action for dataset: bee_waggle_6 ... Warning: Generated dataset ./datasets\bee_waggle_6\bee_waggle_6.json didn't match a known checksum. This is likely due to rounding differences caused by different system architectures. Minor differences in algorithm performance can occur for this dataset. 
ok
Running collect action for dataset: bitcoin ... Traceback (most recent call last):
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 114, in <module>
    main()
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 109, in main
    func(name, script)
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 84, in collect_dataset
    return run_dataset_func(name, script, "collect")
  File "C:\Users\Lucas\PycharmProjects\TCPD\build_tcpd.py", line 80, in run_dataset_func
    func(output_dir=dir_path)
  File "./datasets\bitcoin\get_bitcoin.py", line 131, in collect
    write_json(csv_path, target_path=json_path)
  File "./datasets\bitcoin\get_bitcoin.py", line 66, in wrapper
    raise ValidationError(target)
tcpd.bitcoin.ValidationError: Validating the file './datasets\bitcoin\bitcoin.json' failed.
Please raise an issue on the GitHub page for this project if the error persists.

Could you help me out, what is going wrong here? I would really appreciate it. Maybe this has something to do with the validation checksum computation, which might be different on my PC. The venv is using Python 3.9.13 on a Windows machine.

Best regards
Lucas

Rank plots

How come the summary for average table (in case of univariate) and the rank plot be different?
amoc 0.702
binseg 0.706
pelt .689
These are the scores for default cover metric. Then why the rankplot shows AMOC after PELT?

Fails to collect iceland_tourism, and subsequent datasets, due to it being an xlsx file

When trying to build the file, using python build_tcpd.py -v collect I encounter the following error for

Running collect action for dataset: iceland_tourism ... Traceback (most recent call last):
and the error message I get is:
xlrd.biffh.XLRDError: Excel xlsx file; not supported

This is because the package xlrd has explicitly removed support for anything other than xls files.

What is the annotator id? Which annotator id I should use?

I noticed that annotation file is the map from annotator id to change points, but I do not get what "annotator id" is meaning. Which one I should use?

Fails to validate Bee Waggle data

Hi,

upon running "make", I am getting the following error:

tcpd.bee_waggle_6.ValidationError: Validating the file './datasets/bee_waggle_6/bee_waggle_6.json' failed.
Please raise an issue on the GitHub page for this project if the error persists.
make: *** [collect] Error 1

Can I do this dataset manually?

Validation issue in ratner stock

I am getting the following error while downloading the data sets

  File "/home/anand/dirichlet/TCPD/build_tcpd.py", line 115, in <module>
    main()
  File "/home/anand/dirichlet/TCPD/build_tcpd.py", line 110, in main
    func(name, script)
  File "/home/anand/dirichlet/TCPD/build_tcpd.py", line 85, in collect_dataset
    return run_dataset_func(name, script, "collect")
  File "/home/anand/dirichlet/TCPD/build_tcpd.py", line 81, in run_dataset_func
    func(output_dir=dir_path)
  File "./datasets/ratner_stock/get_ratner_stock.py", line 170, in collect
    write_patch(json_path, target_path=json_path)
  File "./datasets/ratner_stock/get_ratner_stock.py", line 65, in wrapper
    raise ValidationError(target)
tcpd.ratner_stock.ValidationError: ./datasets/ratner_stock/ratner_stock.json

Please see the downloaded dataset here

I am also facing issue #3

dataset json verification error

Running collect action for dataset: apple ... Traceback (most recent call last):
File "build_tcpd.py", line 115, in
main()
File "build_tcpd.py", line 110, in main
func(name, script)
File "build_tcpd.py", line 85, in collect_dataset
return run_dataset_func(name, script, "collect")
File "build_tcpd.py", line 81, in run_dataset_func
func(output_dir=dir_path)
File "./datasets/apple/get_apple.py", line 134, in collect
write_json(csv_path, target_path=json_path)
File "./datasets/apple/get_apple.py", line 65, in wrapper
raise ValidationError(target)
tcpd.apple.ValidationError: ./datasets/apple/apple.json
make: *** [collect] Error 1
Trying to run make command. It gives this error. What possible reason?