nilmtk / nilmtk Goto Github PK

View Code? Open in Web Editor NEW

796.0 68.0 453.0 51.9 MB

Non-Intrusive Load Monitoring Toolkit (nilmtk)

Home Page: http://nilmtk.github.io

License: Apache License 2.0

Python 91.70% Batchfile 0.02% Shell 0.17% Jupyter Notebook 8.11%

disaggregation python nilm energy forecasting algorithms ipython-notebook energy-disaggregation nilmtk nilm-algorithms

nilmtk's Introduction

NILMTK: Non-Intrusive Load Monitoring Toolkit

Non-Intrusive Load Monitoring (NILM) is the process of estimating the energy consumed by individual appliances given just a whole-house power meter reading. In other words, it produces an (estimated) itemised energy bill from just a single, whole-house power meter.

NILMTK is a toolkit designed to help researchers evaluate the accuracy of NILM algorithms. If you are a new Python user, it is recommended to educate yourself on Pandas, Pytables and other tools from the Python ecosystem.

⚠️It may take time for the NILMTK authors to get back to you regarding queries/issues. However, you are more than welcome to propose changes, support! Remember to check existing issue tickets, especially the open ones.

Documentation

NILMTK Documentation

If you are a new user, read the install instructions here. It came to our attention that some users follow third-party tutorials to install NILMTK. Always remember to check the dates of such tutorials, many are very outdated and don't reflect NILMTK's current version or the recommended/supported setup.

Why a toolkit for NILM?

We quote our NILMTK paper explaining the need for a NILM toolkit:

Empirically comparing disaggregation algorithms is currently virtually impossible. This is due to the different data sets used, the lack of reference implementations of these algorithms and the variety of accuracy metrics employed.

What NILMTK provides

To address this challenge, we present the Non-intrusive Load Monitoring Toolkit (NILMTK); an open source toolkit designed specifically to enable the comparison of energy disaggregation algorithms in a reproducible manner. This work is the first research to compare multiple disaggregation approaches across multiple publicly available data sets. NILMTK includes:

parsers for a range of existing data sets (8 and counting)
a collection of preprocessing algorithms
a set of statistics for describing data sets
a number of reference benchmark disaggregation algorithms
a common set of accuracy metrics
and much more!

Publications

If you use NILMTK in academic work then please consider citing our papers. Here are some of the publications (contributors, please update this as required):

Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt, Alex Rogers, Amarjeet Singh, Mani Srivastava. NILMTK: An Open Source Toolkit for Non-intrusive Load Monitoring. In: 5th International Conference on Future Energy Systems (ACM e-Energy), Cambridge, UK. 2014. DOI:10.1145/2602044.2602051. arXiv:1404.3878.
Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt, Alex Rogers, Amarjeet Singh, Mani Srivastava. NILMTK: An Open Source Toolkit for Non-intrusive Load Monitoring". In: NILM Workshop, Austin, US. 2014 [pdf]
Jack Kelly, Nipun Batra, Oliver Parson, Haimonti Dutta, William Knottenbelt, Alex Rogers, Amarjeet Singh, Mani Srivastava. Demo Abstract: NILMTK v0.2: A Non-intrusive Load Monitoring Toolkit for Large Scale Data Sets. In the first ACM Workshop On Embedded Systems For Energy-Efficient Buildings, 2014. DOI:10.1145/2674061.2675024. arXiv:1409.5908.
Nipun Batra, Rithwik Kukunuri, Ayush Pandey, Raktim Malakar, Rajat Kumar, Odysseas Krystalakos, Mingjun Zhong, Paulo Meira, and Oliver Parson. 2019. Towards reproducible state-of-the-art energy disaggregation. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys '19). Association for Computing Machinery, New York, NY, USA, 193–202. DOI:10.1145/3360322.3360844

Please note that NILMTK has evolved a lot since most of these papers were published! Please use the online docs as a guide to the current API.

Brief history

August 2019: v0.4 released with the new API. See also NILMTK-Contrib.
June 2019: v0.3.1 released on Anaconda Cloud.
Jav 2018: Initial Python 3 support on the v0.3 branch
Nov 2014: NILMTK wins best demo award at ACM BuildSys
July 2014: v0.2 released
June 2014: NILMTK presented at ACM e-Energy
April 2014: v0.1 released

For more detail, please see our changelog.

nilmtk's People

Contributors

Stargazers

Watchers

Forkers

tonicebrian magusverma ajfs ali-mosavian desertnaut balajikalluri kentaroh-toyoda impurator kory16 kanudutta marquist evellynsc pilillo jainpranav scoredata jim-holmstroem bengeek88 tell1 rishibaijal thibautlavril nilmtkmridul paperbackraita jozou prerakmody vermaarjun7 multipath dicksonk bowony josemao fissio hxwang neone4373 gpfisher sangwon3 tulku ahersey pecan-street bafana5 dongchen0523 aagaard haderazzini pauldeng lamnot alexrobson mrkitravee mmottahedi sydbal sidjar abhinavgupta dragongreat mingjunzhong kriechi beckel wilsonkong paritoshgoyal jeonghoonkang fidiaslothes chinhiroshi ahmedbekhit azaher henriqueapsilva salamino15 zsp1197 limh9011 yetanotherion tianyuhu intellifora jacksonisaac bbourdel nurhidayat86 sdanielzafar agirljustsayhello dataprajna luqmanhakim joshuac3 jiansenzheng nblumoe lynnuxram luism78 oengenheiro wittmannf geoffreyonrails shenriet sjjsy itroulli manojmanivannan alabarga alexqiao congyi myz540 luzh14 saeedaghabozorgi phericks alexndra zhanhang2014 zhiweisunday christofernal sommerlynn cube3power karimpedia

nilmtk's Issues

Possible problem in AMPds water pulse

Pulses fall at a certain time, which means the meter might have got reset.

Should we only store power (and not energy)?

Some datasets (e.g. HES) store energy and not power.

If a dataset stores energy (not power) then I think we have two options:

on import, we convert energy to power and only store power
don't do any conversions on import, instead store energy

Some datasets (AMPds) store both energy and power. If a dataset stores both energy and power then I think we should first check to see if these columns basically store the same data and, if so, discard the energy data.

The maths for converting from energy to power is trivial because power is just energy usage over time but things can go wrong when there are missing / corrupt samples.

Advantages of only storing power

All downstream functions (stats, plotting, metrics etc) only have to know how to process power so the code should be more simple

Disadvantages of only storing power

The importer code becomes a little more complex (but the total effect on nilmtk as a whole should be to reduce overall code complexity)
What if we make a mistake in the conversion?

Personally, I would lean towards converting from energy to power on import and only storing power data.

Ultimately, when we try to compare two or more datasets; or try to compare appliance power estimates generated by NILM against ground truth, then we will have to convert from energy to power at that point. In other words, we'll have to convert at some point, so why not standardise on always converting on import?

If our import code does make a mistake during the conversion then at least we'll have a single piece of code to fix. And it's not like we're destroying the raw dataset; we can always re-import ;)

What do you think?

(just for completeness... I should link to two other places where we started discussing this issue:

these comments in the "How to represent a Building in memory?" issue #12
these comments on commit 658cb57

Filter: Ignore time slices for which either of mains or appliance data is missing

The output after filtering should be an intersection of the two timeseries.
Reference: https://github.com/nipunreddevil/indic/blob/master/align_mains_appliances.py#L10

Create an architecture diagram

Now, the wiki page on architecture looks solid. I have added a few more details- specifically the input and output from each module.
Effectively, the diagram is just a better representation of that document.

mains data with greater than x% available in appliance stream as well

mains data with greater than x% available in appliance stream as well. Interesting to know how different approaches fare when say only top 2 loads were instrumented (more scalable and realistic) v/s when everything is instrumented (more lab like)

Calibration

calibration (as Nipun had done in ICMLA paper). REDD data has real power for appliances and apparent power for mains. This needs to be compensated for in the supervised setting

Automatically create building wiring diagram

Not all datasets explicitly describe which appliances are wired to which mains channel or whether or not there's a hierarchy in the metering setup. For example, in my dataset, the power consumed by our kitchen lights shows up on three sensors: the whole house mains, the whole house lighting circuit and the sensor for the kitchen ceiling lights.

Having a model of the wiring is essential so that we don't double count appliances like my kitchen ceiling lights. If we are not aware of the wiring then it'll confuses lots of features like continuity detection and determining amount of energy not submetered per home.

The proposal is to automatically infer the wiring diagram from the sensor data.

This could work something like this :

Could use switch continuity detection code #19 to find events which appear symaltaneously on multiple channels. Some of these events may be caused by activity from a single appliance showing up on multiple channels. Find relationships where activity on one channel is guaranteed to show up on another channel (this tells us that two channels are probably connected but not the order) . Then, to find directionality, check which channel has activity that cannot be explained by the other channel. Maybe run through the data at least twice... Once to form a hypothesised wiring model and again to check that the hypothesis is valid against the data.

It feels like this could be done using some form of Bayesian updating but that's probably overkill.

Standard appliance names

I'd propose that we should create and use a set of standard appliance names. e.g. we could standardise on using "tv" instead of "TV" or "T.V." or "television". This would allow easy comparison between datasets. dataset importers would be responsible for converting from the source dataset's appliance names to our standard names.

Are we happy with this proposal?

If so, is anyone aware of an existing list of appliance names? I think HES has quite an extensive list.

I guess it might also be useful to codify a mapping from each appliance name to standard classes ("hot", "cold", "wet", "computing" etc) used in the industry.

setup.py doesn't compile with local modules

Hi,

I haven't the time to check this but I have my numpy installation in $HOME/.local/lib/python2.7 and the build process fails to find the numpy headers. I try

python setup.py config --include-dirs ~/.local/lib/python2.7

but has no effect on the python setup.py build command. Do you know how to allow user installations when building the package?

Switch continuity

@oliparson's suggestion of having a graph of sample rate on x axis and switch continuity on y

Could work something like this:

For each appliance in a building :

Calculate forward diff of power
Threshold to find state changes (not trivial though... See datasets stats issue #31 for ideas of how to find on/off threshold, also use Hart's edges between steady states... Ultimately use BCP )... For simple switch continuity then just using on/off transitions as event may suffice. But for auto generated wiring, we need to detect all state changes.

now create a series at 1 sec periods with the probability of a state change some time during that second. Just 1/time between samples either side of state change. (could improve this by using data from mains). Probability needs to be zero when state change is impossible.
Align all appliances. Resample to required sample rate by summing probabilities.

To efficiently find candidate times of collisions, convert matrix to bools where prob is >0 and then sum every column per row. If sum is greater than or equal to 2 then that's a candidate collision site.

For each pair of appliances multiply probabilities to find prob of both appliance changing state at that time period.

What exactly to report? Mean time to appliance collision?

MSc projects on nilmtk

At Imperial, we propose projects for MSc students to do for 3 months over summer. We usually advertise these projects in early January. So I thought we could start a list of things that MSc students could have fun doing over 3 months on nilmtk. These projects need to be fairly researchy (I. E they can't just be implementing a website or something like that) some ideas:

implement Hart's nalm algorithms
make a start on the simulator
design and implement their own nilm algo
implement another nilm algo from the literature

Any other ideas?

Filter: Ignore appliances with less than x% contribution to mains

Many algorithms are exponential in the number of appliance and this can simplify them a lot.
Reference: https://github.com/nipunreddevil/indic/blob/master/reject_insignificant_appliances.py

How to handle streaming data?

Use cases:

Once a day, at midnight, we update our disaggregation estimates using new mains data from the building and new environmental data. But we don't want to re-run disaggregation since the beginning of history (because that would be very expensive); we just want to do the necessary work to disaggregate the previous 24 hours. But the disaggregation system is very likely to want to keep track of previous estimates so it knows what appliances were running just before midnight the previous day.

Possible solutions:

For each building, we just append the new mains data to mains. The disaggregation algorithm can find a "bookmark" of where it left off the last time by looking at the building's appliance_estimates (NILM output).

REDD loader

Still to do (I'll update this as I go...)

test on other houses. Pay special attention to DualAppliances. The code breaks for Washer Dryer in Home 2, which has only 1 supply mentioned. Add to DUD_CHANNELS if necessary.
save appliance metadata, including which meters were used, and what each supply supplies for DualSupply
Check if REDD records apparent power for mains and active for circuits? See this discussion
set up wiring

Human friendly string representations of dataset, buildings, electricity

Currently, the object explorer does not reveal much

In [14]: pecan_15min_complete.buildings
Out[14]: 
{u'Home_01': <nilmtk.building.Building at 0x431c590>,
u'Home_02': <nilmtk.building.Building at 0x3380090>,
u'Home_03': <nilmtk.building.Building at 0x4ab4550>,
u'Home_04': <nilmtk.building.Building at 0x4ac35d0>,
u'Home_05': <nilmtk.building.Building at 0x5a9b7d0>,
u'Home_06': <nilmtk.building.Building at 0x3468650>,
u'Home_07': <nilmtk.building.Building at 0x5a9b590>,
u'Home_08': <nilmtk.building.Building at 0x1cefdd0>,
u'Home_09': <nilmtk.building.Building at 0x1cfad90>,
u'Home_10': <nilmtk.building.Building at 0x46144d0>}

Make it easy for users to install nilmtk

At present, users first have to run git clone and then python setup.py install to install nilmtk.

Should we put nilmtk onto PiPy so that users can install nilmtk just by typing 'pip install nilmtk'?

I think that we shouldn't worry about PiPy until we've got all our core functionality done. But, before our first release, we must test that nilmtk installs correctly on Mac, Windows and Linux (I have access to the latter two platforms).

Some useful links:

I wrote a few quick notes on distributing Python packages: http://jack-kelly.com/python_notes

A discussion of the conda package manager : http://technicaldiscovery.blogspot.nl/2013/12/why-i-promote-conda.html

normalisation using voltage

(as per Hart 1992)

AMPds loader

Refer: http://nbviewer.ipython.org/github/nilmtk/nilmtk/blob/master/notebooks/read_AMPds.ipynb

Since the data is well formatted and no data is missing, this one should be easy (atleast the electrical portion of it). The natural gas and water one might be challenging to encode.

PECAN loader

Done with a quick implementation of the 15 minute data. Need to do the 1 minute data as well.

Copy code from Jack's existing code

From https://github.com/JackKelly/pda/ and slicedpy

This code includes:

steady state detectors written in Cython
lots of stats (appliance correlations with weather; appliance on-durations etc) See PDA's README for more details.

On-disk metadata

It occurred to me that we can use the issue queue as a wiki! So we could list all the items we want in the metadata at the top of this issue queue. Please edit and add more! If we start off using this as a hierarchical list; and then we can think about putting it into a suitable JSON form. What do you think?

Data

For whole dataset

full dataset name
abbreviated dataset name
citation(s)
URL(s)
name of the dataset
date when the paper was published

For each building

original building name from original dataset
timezone
geo location
max number of occupants
rooms
floors
date of construction
building size

Utility

Electricity

mains wiring graph (could make this concise by first specifying a default (most appliances are connected to mains split 1) and then just specify the exceptions to that rule (the kitchen lights are connected to the light circuit which is in turn connected to mains 1). networkx has a function for exporting serialised versions of graphs.

Mains

sample rate
measured quantities
meters used (make and model; used as key into meters database)

Circuits

sample rate
measured quantities
meters used

Appliances

use standard names which map into the global appliances database
sample rate
measured quantities
meters used
metedata for each specific appliance (screen size, active period, number installed etc)
control connections (graphical model)

meters

(question: should we store meter metadata in nilmtk or on the electricity disaggregation semantic media wiki and use semantic webby technologies to link nilmtk to the wiki? That might be kind of cool; and would allow for much richer metadata... Like photos and graphs)

model
manufacturer
url
Measuring devices (CT, AC - AC voltage sensor, direct voltage sensor, direct current sensor)
max sample rate

Either per measurement or globally:

relative accuracy
bias

Memory error when using cluster validity measures for CO

Copying the question which I asked on sklearn mailing list

Hi,

I am using kmean++ to cluster my data series. From my domain expertise, I know that the number of cluster varies between 2 and 4. To find this optimum number of clusters, I was doing the following (pseudocode):

for num_cluster in [2, 3, 4]:
    cluster_using_kmeans (num_cluster, data)
    find silhouette coefficient[num_cluster]

Whichever num_cluster would give me the optimum silhouette score, would be the optimum number of clusters.

Problem

I end up with a memory error.
Following is the complete stack trace.

/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, *_kwds) 135 136 """ --> 137 distances = pairwise_distances(X, metric=metric, *_kwds) 138 n = labels.shape[0] 139 A = np.array([_intra_cluster_distance(distances[i], labels, i) /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in pairwise_distances(X, Y, metric, n_jobs, *_kwds) 485 func = pairwise_distance_functions[metric] 486 if n_jobs == 1: --> 487 return func(X, Y, *_kwds) 488 else: 489 return _parallel_pairwise(X, Y, func, n_jobs, **kwds) /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared) 172 # TODO: a faster Cython implementation would do the clipping of negative 173 # values in a single pass over the output matrix. --> 174 distances = safe_sparse_dot(X, Y.T, dense_output=True) 175 distances *= -2 176 distances += XX /usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output) 76 return ret 77 else: ---> 78 return np.dot(a, b) 79 80 MemoryError:
As far as I understand, this is due to the pairwise distance computation. I guess for the same reason, I went out of memory, with DBScan.

My data is uni dimensional with shape (262271,1). I am using scikit-learn version 0.14.1
My system configuration is the following

RAM: 8 GB
Processor: i7
OS: Ubuntu 64 bit

Questions

Is there any suggested better metric/cluster validity or scoring for finding optimum number of states in this case? If yes, is it going to give memory problems? Or, is there a workaround with Silhouette coefficient?
If one were to use DBScan for such datasets, it there some way out to avoid memory issues?

Implement Hart et al's NALM algorithms ;)

Simulator of disaggregated domestic electricity demand at 1 Hz

We're not planning to build the simulator for a while but I wanted to start this feature request so we can keep notes about the simulator.

Some specific feature ideas:

Test the effect of:
- change sample rate
- multiple appliance on at same time
- different types of appliance, etc
Generating scenarios:
- takes a simple text script as its input (purely deterministic) (Text scripts would allow researchers to share their “scenarios”.)
- nice GUI for creating these text scripts
- sample randomly from the probability distributions of time-of-use of each appliance that we calculate from real datasets, so it can automatically create an endless stream of realistic simulations

References

If we want to go down the route of trying to simulate appliances themselves (rather than stitching together appliance waveforms) then Dr Mark Bilton's PhD thesis might be of interest... he built a simulator for appliances down to a remarkable level of detail. Appliances are modelled pretty much from first principals. Appliances are described in terms of a FSM, volumes of physical containers (to simulate heating and cooling), U-values of insulators for fridges, etc etc. Very very detailed stuff. http://www.mendeley.com/c/6313531404/g/3295941/bilton-2011-electricity-demand-measurement-modelling-and-management-of-uk-homes/

And here's the Barker et al paper that Nipun mentioned: "Empirical Characterization and Modeling of Electrical Loads in Smart Homes". http://lass.cs.umass.edu/papers/pdf/igcc13-model.pdf

Handle diffuse environmental sensor data

In my own work, I'm interested in using, for example, weather data recorded from the local metoffice weather station to improve disaggregation performance.

The question is: how do we store this diffuse environmental sensor data? It feels that this data doesn't belong in a Building. And, of course, we were scratching our heads a little bit over how to represent external data in buildings (#12).

I wonder if we should have a new class for Environment data (would we call this class 'Environment' or 'External' or 'ExternalSensors' or 'Weather' or something else?).

Some use cases:

Generate correlations of appliance activity against weather variables (e.g. see Jack's UKERC 2013 poster for some examples)
Improve disaggregation performance by using weather data

Some data sources:

weather data from national weather service
crowd-sourced environmental data
external weather data which happens to be recorded with building's power dataset

For the third option: I'd propose that we don't store any external environmental data inside Building. Instead, if a dataset happens to provide external environmental data recorded at the same geo location as a building, then I'd propose that we put that environmental data into our Environment object (tagged with the geo location of the building) and then provide a reference from the building to the environmental object.

I guess our Environment class would need to store:

a collection (list? dict?) of sub-objects. each sub-object represents a sensing installation and would need to store geo-location of the sensing station and timeseries for the sensor data.

Then we could pass this environment object into our disaggregator.train() and disaggregator.disaggregate() methods, as well as various nilmtk.stats functions.

What do you think?!

SMART* to REDD

Need to modify this in accordance to NILMTK: https://gist.github.com/nipunreddevil/6946148

NB: This converts to REDD and not the proposed new format. However, should not be tricky to convert to the proposed format as well.

Should `Ambient` have a `temperature` attribute or a `weather` attribute?

It would be great if we could store different weather variables like sunshine, temperature outside, temperature per room etc. So I wonder if, instead of a temperature attribute, maybe Ambient could have a weather attribute which is a DataFrame with columns for some combination of temperature, sunshine etc. What do you think?

Add all mathematical conventions

At some common place talk about generic conventions, something like

Symbol	Meaning
\theta_n	Ground truth ...

These can be used consistently across modules, specifically pertinent to metrics.

Column names and joining dataframes from different buildings

Use case:

A user wants to create one large dataframe by joining, say, the mains dataframes from two buildings. At the moment, that is likely to fail because the column names in the two buildings are likely to conflict.

I guess there are three options:

Don't worry about it. It's the user's responsibility to modify the column names so they don't conflict
add something to all column names to make column names globally unique e.g. By adding the abbreviated dataset name (e.g REDD) and building name to every column name
Leave column names as they are but provide a function to easily modify column names in a building. .e.g A building.make_column_names_unique(dataframe) method which appends dataset name and building name to each column name in the provided data frame.

I'm not sure which is best. I might lean towards 1 or 3. What do you think?

Metrics

Just starting a feature request about metrics as a place to store some relevant references.

First, of course, there's Oli's awesome Dec 6th blog post on NILM metrics.

Also, Section 5 of Zeifman & Roth 2011 discusses performance metrics in more depth.

A hodge-podge list of metrics (not all of which we need to implement; it's just a list to aide our memories):

ROC
F-measure
Run time? Algo complexity
RMS Error
Mean Normalized Error
Confusion matrices

Steady_State_Detector depends on slicedpy.Normal

Here is the full stack trace. Should be quick!

In [6]: from nilmtk.feature_detectors import steady_state
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-6-97947d1098b0> in <module>()
----> 1 from nilmtk.feature_detectors import steady_state

/home/nipun/git/nilmtk/nilmtk/feature_detectors.py in <module>()
----> 1 from ._feature_detectors import *

/home/nipun/git/nilmtk/_feature_detectors.pyx in init nilmtk._feature_detectors (nilmtk/_feature_detectors.c:5851)()

ImportError: No module named slicedpy.normal

How to represent a Building in memory?

We were thinking of using just a pandas.DataFrame (a 2D matrix indexed by timestamp) for storing a whole building worth of data. Columns would be labelled things like 'mains', 'kettle', 'fridge' etc.

But how would we answer questions like:

Does the kettle column store active or reactive power?

We could just use column names like 'kettle_active'.

How do we know which columns store mains power?

Complications:

multiple phases
split phases
datasets record different combinations of apparent | active | reactive | voltage

e.g. in REDD, we have two columns labelled 'mains'.

We need a standard way to programmatically figure out which columns hold aggregate data, and to know exactly what those columns record.

At the very least, perhaps we should use more descriptive and standard column names like 'aggregate1_active' etc.

Or should we use two DataFrames per Building: one for mains; one for appliances? Some datasets record mains data and appliance data at different sample rates, so using two DataFrames would make some sense. This would also allow us to easily add an 'appliances_estimated' DataFrame to each building to hold the NILM output for that building, and then we can really easily compare the ground truth to the estimates.

How to store metadata about each Building?

Metadata we might want to store:

geo location
number of occupants
nominal mains voltage
which room is each appliance in?
etc...

Some options:

Add new attributes to the DataFrame

df = pd.DataFrame([])
df.location = {'country': 'UK', 'postcode': 'SE15'}

There's some discussion of this on StackOverflow.

This seems rather fragile. The main problem is that many DataFrame methods (e.g. resample) return a new DataFrame without our newly added attributes.

Use a dict

building = {'aggregate': DataFrame, 
                'appliances_ground_truth': DataFrame, 
                'appliances_estimated': DataFrame
                'location': {'country': 'UK', 'postcode': 'SE15'}, 
                'nominal mains voltage': 230}}

Use a Building class

Use the same attributes as for the dict plus a bunch of useful methods on the Building class like:

get_vampire_power()
get_diff_between_aggregate_and_appliances()
crop(start='1/1/2010', end='1/1/2011'): reduce all timeseries to just these dates
plot_appliance_activity(source='ground_truth'): plot a compact representation of all appliance activity

None of these methods would be especially complex, and all could be implemented as stand-alone functions.

Any thoughts?

There's some fascinating discussion of storing metadata with DataFrames on the Pandas issue queue especially this post by hugadams (and the following answers) are particularly relevant to us. They recommend using a class.

Compared to a class, a dict is probably more simple but probably more fragile, less versatile and possibly less 'semantically appropriate' (by which I mean that, in the real world that we're modelling, the concept of a "Building" is prominent(!) and so it would make sense to use a Building class).

I have no strong feelings either way although I'm probably leaning slightly towards using a class.

Metadata organization

I was thinking that at all levels - Dataset, Building, etc, when we store the metadata (in memory), it goes under a 'metadata' attribute of the object.

For instance, instead of

self.appliance_estimates = None
self.nominal_mains_voltage = None
self.appliances_in_each = {}
self.wiring = None
self.meters = {}
self.appliance_metadata = {}

We have something like

self.metadata = {'appliance_estimates': ,,,}

What do you think?

Dataset Statistics

These are statistics for describing datasets as distinct from NILM performance metrics, which are discussed in issue #30.

There are some stats already implemented in Jack's PDA code that I need to port over.

Here I'll keep a list of stats that we might want... this list isn't yet in priority order... please go ahead and edit this first post to add / edit stats

summary statistics for a single channel:
- gaps : number of gaps; min, mean and max length
- longest uninterrupted recoding
- total time span, uptime, percentage up
- generate time series of booleans indicating when this appliance is on. PDA code
- generate time series of on/off events. PDA code
- kWh of energy use for a given time window. PDA code
- variance, min and maxes for energy use across different time windows (e.g. by how much does my daily energy use vary?)
- automatically determine threshold between "on" and "off"
  ... Not Trivial... Maybe take histogram of powers, then cluster with GMM or maybe DBSCAN, then lowest cluster is off. Other clusters may represent naive approximation of power states, but this may not work reliably. May need to first fill all missing values appropriately. On / off threshold is the low point on the GMM between the lowest component and next one. Be aware of appliances which never turn off like ADSL modems, for these the on/off threshold should be 0.
- Fill missing values. E. G. With my dataset, assume appliance is off if it hasn't logged a reading for 60 seconds. Could also use data from aggregate.
plot compact "heatmap" showing activity of all appliances in a small graph (PDA function).
cluster appliances whose on and/or off times are often similar (broken attempt in PDA)
dropout rate (how many samples are missing from the sample)? PDA func
- hours on. PDA code
histograms of:
- activity (on or off) over time (per day, per week, per month, per year) PDA code
- power consumption over time
- on / off duration. PDA code
- power consumption
correlations:
- between appliances
- between appliances and weather. PDA code
- between different utilities
summary stats per building
- gaps over all channels: number of gaps; min, mean and max length.
- list of time periods when one or more channels has gaps > max_sample_period. (Then write a preprocessing filter to remove these gaps)
- longest uninterrupted recoding
- total time span, uptime, percentage up
- number of sub meters
- measurements (active, voltage, reactive etc) recorded for mains
- mains sample rate
- measurements (Active, reactive etc) recorded for appliances
- appliance sample rate
- sum of Appliances power - total power. How much proportion of total power (or energy) has been sampled..Or how much of the dataset has about 80% of mains energy covered in appliances - Nipun's idea.
- top k appliances (energy)
summary statistics for entire dataset (like table in Oli's blog):
- number of houses
- range duration across houses (e.g. one house was recorded for 1 day; another was recorded from for 1 year)
- number of sub meters per house (range)
- top k appliances (energy) per dataset (to see if there are common culprits) - Nipun's suggestion

It's not entirely obvious to me where each function should be implemented. For example, many of these could go into our new Electricity class or the Building class. Or perhaps some of these should be stand-alone functions (not in a class) kept in our nimltk.stats module?

Jack to do :

think about splitting the max_sample_period code into its own function
use Nipun's find_common_measurements in proportion_of_energy_submetered
modify activity_distribution so it outputs sensible indices (e. G. 'Monday' , 'Tuesday' etc). Just hard code it for each freq.
check TODOs in code
building on a point I made in a comment below... If a channel drops out then there are two possible reasons : 1. The sensor is broken (hence we do not know the state of the appliance) or 2. The appliance and sensor are turned off from the plug. Which assumption we adopt will have an effect on some stats results. E. G. When calculating proportion of energy submetered, should we only use time periods where all channels are functioning and ignore time periods where one or more sensors have dropped out (this is how the function is currently implemented) ? I think we should allow the user to select which of these two assumptions best fits their data, by passing in a boolean parameter.

Proposal for loading, converting and storing datasets

I had previously suggested that we have a nilmtk.format_converters submodule. I wonder if, instead, we should have a nilmtk.dataset submodule which does the following:

We'd have stand-alone functions for loading each type of dataset (load_hes, load_blued, etc). These functions would do:

any necessary unit conversions (e.g. HES records deci-watt-hours(!) so load_hes() would convert to Watts)
convert to standardised appliance names
return an object of the DataSet class.

This class would have attributes for storing:

power data for all homes (homes x appliances x timeslices). Some options for which data structure to use:
- a Pandas Panel? Although I think that a 3D array like a Panel would be really space-inefficient because most datasets do not record from each building for exactly the same time period.
- Perhaps it would make more sense to store individual DataFrames (one per building) in a dict or HDFStore. Using an HDFStore also means we can flush each building to disk as we load it which is important for datasets like HES which take ages to load from the original CSV and we don't want to lose all our converted data if the conversion crashes. Also, we can store metadata in the HDFStore.
any metadata associated with the dataset (country/countries, nominal voltage etc)

DataSet would have methods like:

.export(filename): export the dataset to our standard format (REDD+?).
print_summary_statistics(): print summary stats of dataset (number of houses; uptime; etc)
get_building(building_name) which would return a Pandas DataFrame for just that building. Metadata (geo location, nominal mains voltage, number of occupants etc) can be attached as new attributes to the DataFrame (see the StackOverflow discussion on "Adding meta-information/metadata to pandas DataFrame").

So, in usage, converting from, say, hes to REDD+ would be as simple as:

dataset = nilmtk.dataset.load_hes('hes.csv')
dataset.export('hes_converted.csv')

Any objections / improvements?!

0-indexing or 1-indexing ?

In our last conversation about this, we decided to index many of our data structures from 1 (e.g. buildling_1, mains_1 etc). IIRC, there were two main reasons for indexing from 1:

We were aiming for maximum compatibility with REDD
Humans generally number things from 1
Most (all?) NILM datasets use 1-indexing for buildings and channels

However, there are some problems with indexing from 1:

Our code is Python (possibly with a tiny bit of C/C++) and these languages index everything from 0

So, if we relax our requirement to have strict compatibility with REDD, I wonder if we should use 0-indexing?

I honestly don't know which I prefer! And thoughts?!

tracebase importer

tracebase is a wonderful dataset of appliances. Unlike other datasets, the aim of this dataset is to provide lots of timeseries of appliances. To quote their paper:

The first contribution of this paper is thus the presentation of our tracebase respository, which contains more than a thousand electrical appliance power consumption traces that have been collected in more than ten households and office spaces

As far as I know, there is no information about which building these appliances came from. So it feels like it would be incorrect to import tracebase into a nilmtk.Building. Instead, perhaps we should import using the same datastructure that we use within a Building for individual appliances: i.e. we import tracebase into a dict where keys are strings of the form fridge1 etc; and values are DataFrames.

Any thoughts?

I'd be very keen that nilmtk should have a tracebase importer as tracebase is a really useful dataset.

Split data into `train` and `test`

In some situations, we'll want to split our data into train and test (and possibly validate). Given that we're thinking of passing a whole Building object into the disaggregator for both training and testing, I wonder if we should do something like this:

train, test = building.train_test_split(test_size=0.4)

Or maybe:

train, test = nilmtk.cross_validation.train_test_split(building, test_size=0.4)

(which is more similar to the way scikit-learn does it).

Thoughts?

Go through each existing dataset & validate out nilmtk design

i.e. go through each existing dataset and ask:

Can our current designs for an on-disk data format and in-memory data structure capture everything in the source data set?

Load HES

I already have some scruffy code to load HES into DataFrames. I'll write the code to load HES into our own format.

Create initial directory structure and files

setup.py
examples/ directory
nilmtk.disaggregate.<ms_nalm | hmm | co>.<MS_NALM | HMM | CO> ← use a Disaggregator abstract base class
nilmtk.feature_detectors.<steady_state>
nilmtk.appliance.<Appliance | ApplianceFSM...>
nilmtk.metrics.<f1 | roc | std_performance_metrics | …>
nilmtk.dataset.<dataset | hes | blued | ...><DataSet | HES | BLUED | ...>
nilmtk.preprocessing
nilmtk.tests
nilmtk.plot?? (pretty plotting of appliance estimations, metrics etc) ← look at pymc
nilmtk.stats (histograms of on-duration, weather Correlations etc)

Notes from looking at Pandas and scikit-learn:

they have tests directories both in the source root and in subfolders
they put Cython .pyx and auto-created .c files in the same directory as the submodule

Code example: in wiki or in examples/; but not in both?

Hiya,

@nipunreddevil thanks loads for writing example.py; it's great to have our first bit of code ;)

I think it will be a bit painful to try to keep both example.py and the example code in the wiki in sync so I'd suggest that we use one or the other but not both. I really don't mind which we use ;). If we chose example.py then I'll move my detailed comments from the wiki into example.py.

What do you think?

BTW, as you may know, you can pull the wiki contents using git (if you'd prefer to edit the wiki using your favourite text editor): Here's the clone URL: [email protected]:nilmtk/nilmtk.wiki.git

For now, I'll update the code on the wiki and I'll copy your code from example.py to the wiki.

I suppose one worry with having an example.py is that users might expect all the actual code in the repo to run! And it's going to be a long time before all those functions work ;) If we put the code in the wiki then maybe it's clearer that this specifies the design aim rather than actual runnable code. What do you think?

Implement Combinatorial Optimization NILM algorithm

Consistent naming of buildings

Some datasets have House 01, House 02...
Some have Home 01 and all..

For consistency, is it fine to use Building_01 and so on...?

Start `Disaggregator` abstract base class

train()
disaggregate()

Inconsistency in column headers in PECAN dataset

Home # 9

15 minute dataset

Date & Time use [kW] Grid [kW] AIR1 [kW] AIR* [kVA] BATHROOM1 [kW] BATHROOM1* [kVA] BEDROOM1 [kW] BEDROOM1* [kVA] BEDROOM2 [kW] BEDROOM2* [kVA] DRYE1 [kW] DRYE* [kVA] DRYG1 [kW] DRYG1* [kVA] FURNACE1 [kW] FURNACE1* [kVA] KITCHEN1 [kW] KITCHEN1* [kVA] KITCHEN2 [kW] KITCHEN2* [kVA] OVEN1 [kW] OVEN* [kVA] REFRIGERATOR1 [kW] REFRIGERATOR1* [kVA] SPRINKLER1 [kW] THEATER1 [kW] THEATER1* [kVA] WASHER1 [kW] WASHER1* [kVA]

DRYE* [kVA] should it be DRYE1* [kVA]? Or, are they 2 different appliances and for on DRYE there is apparent measurement, for other there is only real measurement.
Similarly, with OVEN1 [kW] and OVEN* [kVA]

contiguous time slices

Design of our standard on-disk dataset format?

Just some basic questions about our on-disk dataset format.

Name?

Name? REDD+ is nicely descriptive (because our format will be an extension of REDD), but I worry that this name might, um, have political implications? We should perhaps approach the REDD authors before committing to this name (if indeed we like the name!)

Standardise units and names

Should our format standardise the following:

standardise units (use unix timestamp (UTC) for timestamps, Watts, Volts, etc)
standardise column names for aggregate data (e.g. 'main0_apparent', 'mains1_active')
standardise appliance names (perhaps based on the HES appliance naming scheme with apparent/active/reactive appended to the name like kettle_active?)

So, for example, once data is in our format, users will know for sure that the kettle appliance is called 'kettle_active' and not 'Kettle' or 'KETTLE0' or something.

It would be the responsibility of functions like HES.load_buiding to convert from HES's units to our standard units.

I think we should standardise. It feels like The Right Thing To Do; and it should minimise the complexity of our NILM code.

Storing power data on disk

We can probably just use the basic REDD structure (a directory per house; a labels.dat file for channel labels etc).

I think we should allow users the option of saving either CSV or HSF5. CSV files can be slow to load for large datasets (even from an SSD).

It would be good if our format could be loaded by existing tools written to process REDD. i.e. anything we add (e.g. metadata) should be in separate files.

Storing metadata on disk

It sounds like we're thinking of using JSON. Is that OK? Maybe call the file 'metadata.json'. I guess we need a metadata file for the dataset as a whole (storing country of origin, academic references, URL, nominal mains voltage etc) and a metadata file per house (geo location, number of occupants etc). How does that sound?

The metadata ideas described by @nipunreddevil in the wiki look very sensible. I wonder if we can also use some ideas from Xively's metadata spec and/or SensorML. And this blog has some interesting and up-to-date ideas: Data models for the Internet of Things

How to cope with BLUED?

As @oliparson pointed out yesterday, BLUED doesn't record the power consumption of individual appliances. Instead it records whether appliances are on or off. Trying to estimate the individual appliance power consumption is bound to be prone to large uncertainties so I think we shouldn't do that. Instead, it may be simpler to add a new class of file: channel_X_state.dat that has two columns: a timestamp and a state (an integer). We could also use this format for the exporting appliance estimates from NILM algorithms which know about separate appliance states.

Allow multiple columns per file?

For mains, my dataset records active, apparent and voltage for aggregate. It's much for efficient to store this in a single csv file with 4 columns rather than 3 csv files (because the timestamp must be duplicated in each file) . Should we allow multiple columns in our on disk format?

Compact representation?

The standard REDD format is quite space inefficient, especially if we use it for storing appliance state. e.g.:

1303132929 0
1303132930 0
1303132931 0
...
1303170000 0
1303170001 1
...

So perhaps we could allow users to export CSV files as either 'complete' or 'compact' (where only change points would be reported).

Common interface to disaggregation algorithms

We are striving to have a common interface to all NILM algorithms implemented in nilmtk. In terms of running the disaggregation, this should be fairly simple: we just pass a Building object to the disaggregation algo.

But in terms of training, different NILM algorithms have different requirements with respect to the data that they need for training. I made some tentative attempts to flesh this out in the nilmtk wiki.

As far as I can tell, there are three flavours of NILM algo (with respect to training):

Train from appliance data only (e.g. just from tracebase). This would often be appliance data that was not recorded from the test home. e.g. if you show the system enough examples of different 'fridges' from tracebase then it should be able to spot a fridge in most new houses
Train from aggregate and simultaneously recorded appliance data (this data is common in research but is basically impossible to acquire in large-scale end-user disaggregation deployments)
Train unsupervised from aggregate data only (e.g. using FHMMs)

Some NILM algorithms may be trainable on more than one of these options.

The question is: how do we provide a unified interface (defined in nilmtk.Disaggregator)?

It seems there are two options:

Use a single `train()` method

train() would take optional arguments aggregate, appliance and building and maybe a how argument which would be set to things like 'unsupervised training on aggregate' or 'supervised training on aggregate' or 'supervised training on appliances only'. Sometimes it'll be obvious which mode of training is required from the parameters provided (e.g. if you just provide appliance data then it must train supervised on appliances only).

e.g. to train just from appliance data from tracebase, we'd do something like:

appliances = {'fridge': ['fridge1.csv', 'fridge2.csv'],
              'kettle': ['kettle1.csv', 'kettle2.csv']}
disaggregator.train(appliances)

But to train supervised on both mains and appliances we'd do disaggregator.train(aggregate, appliances).

Use multiple train methods

Train from appliance data only

appliances = {'fridge': ['fridge1.csv', 'fridge2.csv'],
              'kettle': ['kettle1.csv', 'kettle2.csv']}
disaggregator.train_on_appliances(appliances)

In this situation, splitting the aggregate dataset into 'train' and
'test' sets is not necessary (because there is already a hard
distinction between the training (appliance) data and the test
(aggregate) data).

Train from both aggregate and associated appliance data

If you have a dataset which records individual appliance
activations simultaneously with aggregate power then train it on a
whole Building object:

train, test = nilmtk.cross_validation.train_test_split(building, test_size=0.4)
disaggregator.train_supervised_on_aggregate(train.aggregate, train.appliances)

train_test_split returns two new Building objects.

Train from aggregate data only

disaggregator.train_unsupervised_on_aggregate(train.aggregate)

Any thoughts? I think I'm starting to lean more towards having a single train method (as Nipun has used in his CO_1d class at the moment).

Another related question: say we have prior information (e.g. vectors describing the probability of a specific appliance being on each hour of the day) then how do we provide that prior information to the NILM algorithm? Maybe we could have methods like set_prior_appliance_usage_patterns()?

nilmtk / nilmtk Goto Github PK

nilmtk's Introduction

NILMTK: Non-Intrusive Load Monitoring Toolkit

Documentation

Why a toolkit for NILM?

What NILMTK provides

Publications

Brief history

nilmtk's People

Contributors

Stargazers

Watchers

Forkers

nilmtk's Issues

Advantages of only storing power

Disadvantages of only storing power

Data

For whole dataset

For each building

Utility

Electricity

Mains

Circuits

Appliances

meters

References

Does the kettle column store active or reactive power?

How do we know which columns store mains power?

How to store metadata about each Building?

Add new attributes to the DataFrame

Use a dict

Use a Building class

Jack to do :

Name?

Standardise units and names

Storing power data on disk

Storing metadata on disk

How to cope with BLUED?

Allow multiple columns per file?

Compact representation?

Use a single train() method

Use multiple train methods

Train from appliance data only

Train from both aggregate and associated appliance data

Train from aggregate data only

Recommend Projects

Recommend Topics

Recommend Org

Use a single `train()` method