Julia implementation of unsupervised learning methods for time series datasets. It provides functionality for clustering and aggregating, detecting motifs, and quantifying similarity between time series datasets.

License: MIT License

Julia 88.53% TeX 11.47%

clustering optimization energy-systems k-means-clustering k-medoids-clustering hierarchical-clustering representative-days time-series-aggregation julia

timeseriesclustering.jl's Introduction

TimeSeriesClustering is a Julia implementation of unsupervised learning methods for time series datasets. It provides functionality for clustering and aggregating, detecting motifs, and quantifying similarity between time series datasets. The software provides a type system for temporal data, and provides an implementation of the most commonly used clustering methods and extreme value selection methods for temporal data. It provides simple integration of multi-dimensional time-series data (e.g. multiple attributes such as wind availability, solar availability, and electricity demand) in a single aggregation process. The software is applicable to general time series datasets and lends itself well to a multitude of application areas within the field of time series data mining.

The TimeSeriesClustering package was originally developed to perform time series aggregation for energy systems optimization problems. By reducing the number of time steps used in the optimization model, using representative periods leads to significant reductions in computational complexity of these problems. The package was previously known as ClustForOpt.jl.

The package has three main purposes:

Provide a simple process of finding representative periods (reducing the number of observations) for time-series input data, with implementations of the most commonly used clustering methods and extreme value selection methods.
Provide an interface between representative period data and application (e.g. optimization problem) by having representative period data stored in a generalized type system.
Provide a generalized import feature for time series, where variable names, attributes, and node names are automatically stored and can then be used later when the reduced time series is used in the application at hand (e.g. in the definition of sets of the optimization problem).

In the domain of energy systems optimization, an example problem that uses TimeSeriesClustering for its input data is the package CapacityExpansion, which implements a scalable generation and transmission capacity expansion problem.

The TimeSeriesClustering package follows the clustering framework presented in Teichgraeber and Brandt, 2019. The package is actively developed, and new features are continuously added. For a reproducible version of the methods and data of the original paper by Teichgraeber and Brandt, 2019, please refer to v0.1 (including shape based methods such as k-shape and dynamic time warping barycenter averaging).

This package is developed by Holger Teichgraeber @holgerteichgraeber and Elias Kuepper @YoungFaithful.

Installation

This package runs under julia v1.0 and higher. Install using:

import Pkg
Pkg.add("TimeSeriesClustering")

Documentation

Documentation (Stable): Please refer to this documentation for details on how to use TimeSeriesClustering the current version of TimeSeriesClustering. This is the documentation of the default version of the package. The default version is on the master branch.

Documentation (Development): If you like to try the development version of TimeSeriesClustering, please refer to this documentation. The development version is on the dev branch.

See NEWS for significant breaking changes when updating from one version of TimeSeriesClustering to another.

Citing TimeSeriesClustering

If you find TimeSeriesClustering useful in your work, we kindly request that you cite the following paper (link):

  @article{Teichgraeber2019joss,
  author = {Teichgraeber, Holger and Kuepper, Lucas Elias and Brandt, Adam R},
  doi = {https://doi.org/10.21105/joss.01573},
  journal = {Journal of Open Source Software},
  number = {41},
  pages = {1573},
  title = {TimeSeriesClustering : An extensible framework in Julia},
  volume = {4},
  year = {2019}
  }

If you find this package useful, our paper on comparing clustering methods for energy systems optimization problems may additionally be of interest.

Quick Start Guide

This quick start guide introduces the main concepts of using TimeSeriesClustering. The examples are taken from problems in the domain of scenario reduction for energy systems optimization. For more detail on the different functionalities that TimeSeriesClustering provides, please refer to the subsequent chapters of the documentation or the examples in the examples folder, specifically workflow_introduction.jl.

Generally, the workflow consists of three steps:

load data
find representative periods (clustering + extreme period selection)
optimization

Example Workflow

After TimeSeriesClustering is installed, you can use it by saying:

using TimeSeriesClustering

The first step is to load the data. The following example loads hourly wind, solar, and demand data for Germany (1 region) for one year.

ts_input_data = load_timeseries_data(:CEP_GER1)

The output ts_input_data is a ClustData data struct that contains the data and additional information about the data.

ts_input_data.data # a dictionary with the data.
ts_input_data.data["wind-germany"] # the wind data (choose solar, el_demand as other options in this example)
ts_input_data.K # number of periods

The second step is to cluster the data into representative periods. Here, we use k-means clustering and get 5 representative periods.

clust_res = run_clust(ts_input_data;method="kmeans",n_clust=5)
ts_clust_data = clust_res.clust_data

The ts_clust_data is a ClustData data struct, this time with clustered data (i.e. less representative periods).

ts_clust_data.data # the clustered data
ts_clust_data.data["wind-germany"] # the wind data. Note the dimensions compared to ts_input_data
ts_clust_data.K # number of periods

If this package is used in the domain of energy systems optimization, the clustered input data can be used as input to an optimization problem. The optimization problem formulated in the package CapacityExpansion can be used with the data clustered in this example.

timeseriesclustering.jl's People

Contributors

Stargazers

Watchers

timeseriesclustering.jl's Issues

clust_data undefined dictitionary entries

On master:

clust_data = load_timeseries_data(:CEP_GER1)
clust_data.data.keys

Leads to the following output:

 #undef                
 #undef                
    "solar-germany"    
 #undef                
    "wind-germany"     
 #undef                
 #undef                
 #undef                
 #undef                
 #undef                
 #undef                
 #undef                
 #undef                
 #undef                
 #undef                
    "el_demand-germany"

This seems not to be an issue in any of the current implementations, but if we want to call keys (e.g. I am trying to find extremes among all wind periods ), the #undef entries give errors.

OptVariable field with set name

Would it make sense to add a fourth field to OptVariable: sets?
Axes gives the values within each set, but I feel it would be useful to have something call the sets explicitly. For example, I want to find the maximum slack variable and get the corresponding day (set K). Right now I have to hardcode the number of the set within axes (note for myself: get_index_inf(::Array{OptVariable}) ).

Maybe call the field axes_name.

Putting this out here, so we don't forget throughout break.

Loading GER_18 data errs

When I do

data_path=normpath(joinpath(dirname(@__FILE__),"..","data","TS_GER_18"))
ts_input_data = load_timeseries_data(data_path; T=24, years=[2016])

I get the following error:

┌ Error: The time_series dena21 has K=0 != K=366 of the previous
└ @ ClustForOpt ~/.julia/dev/ClustForOpt/src/utils/load_data.jl:80
ERROR: LoadError: BoundsError: attempt to access 0-element Array{Float64,1} at index [1:8784]
Stacktrace:
 [1] throw_boundserror(::Array{Float64,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:484
 [2] checkbounds at ./abstractarray.jl:449 [inlined]
 [3] getindex(::Array{Float64,1}, ::UnitRange{Int64}) at ./array.jl:735
 [4] #add_timeseries_data!#13(::Int64, ::Int64, ::Array{Int64,1}, ::Function, ::Dict{String,Array}, ::SubString{String}, ::DataFrames.DataFrame) at /home/holger/.julia/dev/ClustForOpt/src/utils/load_data.jl:84
 [5] #add_timeseries_data! at /home/holger/.julia/dev/ClustForOpt/src/utils/load_data.jl:0 [inlined]
 [6] #add_timeseries_data!#12(::Int64, ::Int64, ::Array{Int64,1}, ::Function, ::Dict{String,Array}, ::SubString{String}, ::String) at /home/holger/.julia/dev/ClustForOpt/src/utils/load_data.jl:58
 [7] (::getfield(ClustForOpt, Symbol("#kw##add_timeseries_data!")))(::NamedTuple{(:K, :T, :years),Tuple{Int64,Int64,Array{Int64,1}}}, ::typeof(ClustForOpt.add_timeseries_data!), ::Dict{String,Array}, ::SubString{String}, ::String) at ./none:0
 [8] #load_timeseries_data#11(::String, ::Int64, ::Array{Int64,1}, ::Array{String,1}, ::Function, ::String) at /home/holger/.julia/dev/ClustForOpt/src/utils/load_data.jl:29
 [9] (::getfield(ClustForOpt, Symbol("#kw##load_timeseries_data")))(::NamedTuple{(:T, :years),Tuple{Int64,Array{Int64,1}}}, ::typeof(load_timeseries_data), ::String) at ./none:0
 [10] top-level scope at none:0

@YoungFaithful do you get the same error on your computer?

When I test the ClustForOpt version #84 in CapacityExpansion, the transmission cases give errors in the test run (which are the only ones that load GER_18 data), this may be one reason.

Performance check for loaded packages

Loading the package takes time. Test all packages that are loaded with using and check if needed. Test time for each with *@time

Unable to implement example

Hello,

I am trying to test the ClustForOpt.jl software, however when I try running a variation of the example provided in the readme -

using ClustForOpt
ts_input_data = load_timeseries_data("DAM", "GER")

I get the following error:
MethodError: no method matching load_timeseries_data(::String, ::String)
Closest candidates are:
load_timeseries_data(::Any; region, T, years, att) at /home/nvidia/.julia/packages/ClustForOpt/YNrmS/src/utils/load_data.jl:19

Stacktrace:
[1] top-level scope at In[38]:2

Different method to distribute demand to nodes

In the heavy testing I figured out that something with the calculation of the different demands isn't exact. Needs investigation

I suppose Strings. Shall I change it to that?

Originally posted by @YoungFaithful in #2

Normalization for entire tech (mean over all nodes)

Testing fails due to job killed by travis

Testing sometimes fails. The error is something like /home/travis/.travis/functions: line 104: 3596 Killed. This seems to be an issue with memory on travis.

Todo is to rewrite the tests so that they use less memory. Potentially make the kmedoids-exact cbc case smaller, and also reduce the amount of data that is loaded.

Workaround for now is to just rerun the build on travis, so far it has worked the second time.

Clust-values < e-13

If values are too small and close to 0, the constraint is neglected of Gurobi. I fear that I'll need to revise my results in that regard. I think a simple round shall solve that.

Update to v1.0.1 - findmin and find

From documentation:

findmin, findmax, argmin, and argmax used to always return linear indices. They now return CartesianIndexes for all but 1-d arrays, and in general return the keys of indexed collections (e.g. dictionaries) (#22907).

find has been renamed to findall. findall, findfirst, findlast, findnext now take and/or return the same type of indices as keys/pairs for AbstractArray, AbstractDict, AbstractString, Tuple and NamedTuple objects (#24774, #25545). In particular, this means that they use CartesianIndex objects for matrices and higher-dimensional arrays instead of linear indices as was previously the case. Use LinearIndices(a)[findall(f, a)] and similar constructs to compute linear indices.

--> we should doublecheck if we use any of these, and update accordingly. I will do that.

Update to JuMP v0.19

Yay, JuMP v0.19 has been released: https://discourse.julialang.org/t/jump-v0-19-has-been-released/20878
We need to update the optimization part of our package accordingly.

Branch `export`

There is branch called export. Can we delete it or is it still in use for something you are working on, @YoungFaithful ?

get_cep_slack_variables()

get_cep_slack_variables() currently gets variables ["SLACK"]. I think it would be an enhancement to have it catch all variables that have variable_type "sv". What do you think?

0-1 normalization

Implement 0-1 normalization as an option in run_clust().
This would entail a change in naming of the ClustInputData struct:
All normalization methods are of the form (A-x)/y, where x is \mu and y is \sigma for z-normalization, and x is min(A), and y is max(A)-min(A) for 0-1 normalization.
Thus, need to replace mean and sdv with more generic terms.

Import Your Own Data

Hi,
I tried importing my data via your instructions and when I run these lines:

my_path=joinpath(homedir(),"Documents","tutorial","TimeClustInput.csv")
your_data_1=load_timeseries_data(my_path; region="none", T=24)

I get this error:
LoadError: MethodError: objects of type Bool are not callable

Sorry for the basic question!
Dan

ClustConfig?

What do you think about ClustConfig like:

"""
        ClustConfig{method::String
        representation::String
        n_clust::Number
        n_init::Number
        n_seg::Number
        iterations::Number
        norm_op::String
        norm_scope::String
        attribute_weights::Dict{String,Float64}}
Collection of cluster configuration
"""
struct ClustConfig
        method::String
        representation::String
        n_clust::Number
        n_init::Number
        n_seg::Number
        iterations::Number
        norm_op::String
        norm_scope::String
        attribute_weights::Dict{String,Float64}
end

Replace FullInputData struct

Replace FullInputData by ClustInputData with K=1, T=...

Rewrite battery problem

CO2 emissions constraint in CEP does not contain weighted demand.

kmeansexact optimizer type specification

YoungFaithful on Jul 4 Collaborator

Can we not use:
kmexact_optimizer::DataType=DataType or something else defining the Type?
@holgerteichgraeber

Not sure what you are trying to point to?
The issue we had was that the optimizer object itself does not have a specific type.. It's basically > Any
@YoungFaithful
YoungFaithful 21 hours ago Collaborator

I think that changed with the new JuMP update (https://github.com/YoungFaithful /CapacityExpansion.jl/blob/d23ffab3f915d6f02e128774b6d2ae82d1c358f0/src/optim_problems/run_opt.jl#L86)

intra vs. inter -day storage

I've been thinking about naming. Currently, storage can take values "intra" and "inter". I think that is very explicit and I like it. However, I see a challenge for the user that is not too familiar with the model, and may confuse inter and intra (it just happened to me, even though I thought I knew better..).
Maybe naming it storage=simple/seasonal or similar would be more explicit?
This is nothing urgent, but something we should consider before we publish.

CSV 'allowmissing' deprecated

allowmissing is a deprecated keyword argument
└ @ CSV ~/.julia/packages/CSV/MKiwM/src/CSV.jl:157

Register

@JuliaRegistrator register()

Documentation

Created branch dev .
Todos on master:

Documentation clustering
Documentation CEP
Update examples on master to use module

bug - workflow example - OptVariable not defined

Somehow the workflow examples throw an error for me, see below. Do you get the same @YoungFaithful ? OptVariable is defined in datastructs.jl, I can't immediately see why this occurs.

julia> include("workflow_example_attribute_weighting.jl")
ERROR: LoadError: LoadError: LoadError: UndefVarError: OptVariable not defined
Stacktrace:
 [1] top-level scope at none:0
 [2] include at ./boot.jl:317 [inlined]
 [3] include_relative(::Module, ::String) at ./loading.jl:1044
 [4] include(::Module, ::String) at ./sysimg.jl:29
 [5] include(::String) at ./client.jl:392
 [6] top-level scope at none:0
 [7] include at ./boot.jl:317 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1044
 [9] include(::Module, ::String) at ./sysimg.jl:29
 [10] include(::String) at ./client.jl:392
 [11] top-level scope at none:0
 [12] include at ./boot.jl:317 [inlined]
 [13] include_relative(::Module, ::String) at ./loading.jl:1044
 [14] include(::Module, ::String) at ./sysimg.jl:29
 [15] include(::String) at ./client.jl:392
 [16] top-level scope at none:0
in expression starting at /home/holger/.julia/dev/ClustForOpt/src/utils/datastructs.jl:80
in expression starting at /home/holger/.julia/dev/ClustForOpt/src/ClustForOpt_priv_development.jl:30
in expression starting at /home/holger/.julia/dev/ClustForOpt/examples/workflow_example_attribute_weighting.jl:1

OptVariable deprecated call

It seems like some elements got lost in merging the last PR. I tested everything on my mashine before pushing and now I got multiple issues due to older code parts. I don't know, how this could have happened.

REQUIRE to Project.toml

REQUIRE is old way
Project.toml is the new way to describe deps for project and tests
https://github.com/JuliaLang/Pkg.jl/blob/master/Project.toml

32bit system incompatible

explicit Int64 and Float64 statements lead to Errors on 32bit systems

Save Clustering Results

As an intermediate step before the optimization, implement saving ClustResults as jld2. save= keyword argument is already provided.

time series with only zeros hast clust result of NaN

Include time step length delta t in ClustInputData

The time step length is currently assumed to be 1h. Include time step length as a field of ClustInputData, and use as input in the optimization problems.

Automatic testing

Implement automatic testing for clustering methods and extreme value selection.

delete share L and share H in nodes.csv

transfer n_digits_data_round to CapacityExpansion.jl

In run_clust.jl

I think the data rounding is an issue that comes up in the CEP, so it should be handled there. There may be more generic applications of ClustForOpt where this hardcoded value adds to complexity.
I am currently rewriting parts of run_clust.jl. After I push to dev, let's aim at this issue.

Add documentation on testing

Add a short documentation on testing in test/runtest.jl to make future additions to the set of tests easier for other users.

Implement dbaclust()

-Update TimeWarp.jl

Then implement in our package.
Open questions:
How to implement parallelization in our framework
How to deal with multiple attributes/nodes

Cost of conversion for hydrogen includes electricity production costs

Renaming the package from ClustForOpt to TimeSeriesClustering

We are currently in the process of renaming the package. It is unclear if it has to be re-registered, which would require everyone to add it again to their julia package list: JuliaRegistries/General#2770

@mleprovost I saw that you forked the package, appreciated! Because of the rename, there could be a chance that there are hickups, but should be resolved in the next 2-3 days.
In the meantime, you should be able to use the package under its old name as ClustForOpt (just do using ClustForOpt) as is.

Tests without Gurobi

To test our package with Travis or so we would need to make the test it independent of Gurobi

ClustResultAll should only contain one k

Have ClustResultAll only contain one clustering result instead of all from n_clust_ar.

test cases without CapacityExpansion.jl dependency

Let's write the test cases such that it is not dependent on capacity expansion, because that dependency makes tests fail whenever we update ClustForOpt with breaking changes.

Best_ids as k_ids into ClustData

Seasonal storage integrated in alpha phase with passing on best_ids "manually" through the functions, needs to be improved. And it's super exiting! I wonder if you should penalize the loading and unloading of the battery rather than the installed capacity (because of the 6.000 cycle per battery life time and not 25 years with whatsoever)
It would be a much nicer workflow for the Opt part, if best_ids would be included in ClustData - What's your opinion on that one?

Master branch not working

I am trying to test PR #22 and noticed that even the current master branch on holgerteichgraeber/ClustForOpt_priv.jl is not working for any of the three examples we have in the examples folder:

_bat.jl - somehow kwargs was taken out of run_clust() in the last commit, must have slipped through my fingers, that should still be in there. Even when I add it back, it gives an error with gurobi_env. Any idea why?
_attribute_weighting.jl : example is linked to ClustForOpt but should be linked to _priv.
workflow_example.jl : This one is giving me an error in the load_data function.

Could you please check and fix these, @YoungFaithful ? Thank you.
When you have a look, please branch off of holgerteichgraeber/ClustForOpt_priv.jl - master, without any other changes, let's try to get that one running first.