dmbee / seglearn Goto Github PK

View Code? Open in Web Editor NEW

567.0 567.0 64.0 19.09 MB

Python module for machine learning time series:

Home Page: https://dmbee.github.io/seglearn/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

data-science machine-learning python time-series

seglearn's People

Contributors

Stargazers

Watchers

seglearn's Issues

Using multiple features for neural network model

Hi I tried to implement multiple features from seglearn example namely Classifying Segments Directly with a Neural Network . Specifically I just add up the FeatureRep() into the pipe definition as below:

pipe = Pype([('seg', Segment(width=100, overlap=0.5, order='C')),
             ('features', FeatureRep()),
             ('crnn', KerasClassifier(build_fn=crnn_model, epochs=1, batch_size=256, verbose=0))])

But then I encountered an error as below:

ValueError: Input 0 of layer sequential_14 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 66]

I'd imagine that I need to modify the input_shape definition of the crnn_model but am not quite sure how to do that. Appreciate any pointers for it, thanks.

Imbalanced learn transformers broken

Example no longer runs successfully with latest imablanced-learn d/t _check_X_y function

Pype and Pipeline version incompatible

Pype version: 1.2.2
Sklearn version: 1.0.1
Python version: 3.8.10

Problem: init() takes 2 positional arguments but 3 were given

Solution: change to super(Pype, self).init(steps, memory=memory) in seglearn/pipe.py

StackedInterp

Create time series interpolator for datasets in long format
eg
[time, var, value]

Documentation question

Hello!

I was reading the user guide, and I think there is a typo in the 1st paragraph of the time series section.

I believe you meant to have <y_i,1, y_i,2, ... y_i,t> instead as you do below.

Since you describe y_i as being a univariate sequence, I am having trouble seeing how the vector of x vectors fits the description, so I think you may have meant to use little-y (non-bold / non-vector y) notation in that sentence. Or, perhaps does this mean that y_i is the target vector for the corresponding samples < x_i,1, ... x_i,t > meaning that y_i corresponds to the X_i sample but is not composed of them. I think the phrasing and use of the word "with" makes it a little confusing if this is the case.

Can I get the full watch dataset?

Hi, I am working on a similar project on segmenting time series. I am looking for more data to test my models. Your paper https://iopscience.iop.org/article/10.1088/1361-6579/aacfd9/pdf, mentioned that there are 280 time series but there are only 140 series in watch_dataset.npy. Would it be possible for me to get a hold of all 280 time series? Thank you

Passing data to temporal_split and other functions

Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to temporal_split but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:

KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000],\n dtype='int64', length=1001)] are in the [columns]"

If, on the other hand, I pass them as df I get:

AttributeError: 'DataFrame' object has no attribute 'ts_data'

The same holds true if I manually split the DataFrames and pass them to seg.fit_transform(X_train, y_train)
I tried to put the date column in the df as well as in the index but the error is still there.
What's wrong?

Info of the Dataframe:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017 entries, 896 to 1912
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1017 non-null   datetime64[ns]
 1   id            1017 non-null   object        
 2   price         1017 non-null   float64       
 3   month         1017 non-null   int64         
 4   year          1017 non-null   int64         
 5   event_name_1  1017 non-null   int64         
 6   event_type_1  1017 non-null   int64         
 7   event_name_2  1017 non-null   int64         
 8   event_type_2  1017 non-null   int64         
 9   snap_CA       1017 non-null   int64         
 10  dow           1017 non-null   int64         
 11  is_weekend    1017 non-null   int64         
 12  is_holiday    1017 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(10), object(1)
memory usage: 111.2+ KB

I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.

Feature Request: Event Segmentation

Hi, seglearn looks very interesting and was hoping it could provide functionality of segmenting timeseries data by events (ie determined by index). For example, lets say I have a list of events that are interesting based on a time series, and would like a window of the previous X observations per event. I would like to use the index to find the location in the time series, get the last X observations. Then iterate over the rest of the indexes to get all windows.

Here is the code I am using to segment my data into (samples, length, dimensions) using the index:

      self.Xs = []
        for i in self.Xtbl.index:
            start = self.X.index.get_loc(i)-sliding_window
            end = self.X.index.get_loc(i)
            window = self.X.iloc[start:end, ].to_numpy()
            self.Xs.append(window)
        self.Xs = np.stack(self.Xs)
        self.ys = self.ytbl.copy()

I would like to use seglearn with this functionality in an pipeline to fit. However, the segment class doesn't appear to segment based on events. Thanks for your consideration.

Dataframe of multiple multivariate time series

I have a z different time series with different lengths. For each time series, there are a different number of time points with timestamps and for each time point, there is an m different features and observed float outcome for this time point. My aim is modeling a regressor (given m features what is the outcome). I have trained a regressor by omitting the temporal dimension of a dataset (train on all data points using m features and predict the outcome), but it resulted in a poor result.
(Multiple multivariate time series with different length and sampling frequency)

My aim is to add temporal dimension for each time point (like adding new features on rolling fashion, for each time point, mean of past values of features, std of past feature values etc). I could not find any example of adding new features to a data frame of Multiple multivariate time series with different length and sampling frequency. Can you help me?

How are multivariate time series handled?

Hi!
I'm classifying multivariate time series with seglearn and it works great!
Right now I'm trying to learn more about this topic, and I would like to know how the multivariate aspect is handled in seglearn. In the docs, the sliding window method is mentioned, but I'm not able to find any more information on my own.
I'd be very thankful if someone could help me out :)

If it is easier to discuss a specific case, this is what I'm doing at the moment:

Data:
n_samples = 6000
n_dimensions = 128
Various time lengths for each sample (300 -700)

Labels:
0 / 1

Classifier:

Pype([('segment', PadTrunc(width=700)),
      ('features', FeatureRep()),
      ('scaler', StandardScaler()),
      ('svc', svm.LinearSVC(class_weight='balanced', max_iter=2000))])

To my understanding, I get 11 features per dimension with FeatureRep(), which would result in a 11x128 matrix per sample, which wont match the standard svm input.
In other words, how are the 128D-timeseries condensed into something that can be fed into the svm?

What the segment class is used for

Hello,
I can't understand the usage of the segment class, in what cases I need to use this transform and how does it help?
I also couldn't find an example as how to incorporate contextual variables?
When I run it on toy data - it is very unclear what happened, since X is unchanged by y was reduced to a single value:

# Single multivariate time series with 3 samples of 4 variables
X = [np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])]
# Time series target
y = [np.array([True, False, False])]
print("X: " , X)
print("y: " ,y)
segment = Segment(width=3, overlap=1)
X, y, _ = segment.fit_transform(X, y)
print('After segmentation:')
print("X:", X)
print("y: ", y)

X : [array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])]
y : [array([ True, False, False])]
After segmentation:
X: [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]]
y:  [False]

Migrate to CircleCI 2.0

Need to migrate as CircleCI 1.0 support has stopped. Integration testing should include Windows, Linux, Python 2x and 3x.

run length encoder

add new preprocessing transform for run length encoding of the target for datasets encoding a categorical target as a time series variable

eg PMAP2 and MHEALTH

http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring
http://archive.ics.uci.edu/ml/datasets/mhealth+dataset

RollingSplit

Implement data splitter with rolling splits similar to sklearn.model_selection.TimeSeriesSplit but with compatibility for data sets with more than a single time series and contextual data.

will it work for multivariate time series prediction both regression and classification

great code thanks
may you clarify :
will it work for multivariate time series prediction both regression and classification
1
where all values are continues values
2
or even will it work for multivariate time series where values are mixture of continues and categorical values
for example 2 dimensions have continues values and 3 dimensions are categorical values

color        weight     gender  height  age

1 black 56 m 160 34
2 white 77 f 170 54
3 yellow 87 m 167 43
4 white 55 m 198 72
5 white 88 f 176 32

question on ts_data

I have a simple 2D array of data that has samples in the first dimension (rows) and sensors in the 2nd. I understand this is called the "wide" format. It's fairly large: 2 million-ish samples from 100+ sensors. It's in order but there is no date/time column. I have a separate set of labels that contains the same number of samples. I don't understand how to convert this into the required format; I get the error "object has no attribute 'ts_data'". What value is supposed to go in the to-be-created ts_data column?

exponential decaying features

to capture long term dynamics

Question about data representation

How can I work with seglearn if I have a data representation that is presented here.

I have two cases. In the first case I have a variable that is time dependent so I would like to extract features from the previous values in order to build the X matrix.

2011-01-01 01:00:00    1.073392
2011-01-01 02:00:00    0.274406
2011-01-01 03:00:00    1.446233
2011-01-01 04:00:00   -0.035727

In the second case I have the same problem but having along one or more dependent (and time dependent) variables that I want to use them in order to predict the third one.

Future warning for transform

Running your example code "Continuous Target and Time Series Regression" I get many future warnings related to "transform.py" as reported below:
"FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future."

Pandas

Switch out TS_Data class for pandas df as the standard class for multivariate ts with context data.

Multiple time series prediction

I need to classify and forecast multiple time series, but I don't know how to operate.

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38
yysrzzl,-16.11,-10.98,-9.92,6.95,10.97,7.37,-49.69,-41.9,-42.43,-44.76,114.57,56.95,34.41,22.79,-41.46,-34.25,-24.03,-20.23,-43.22,-39.03,-38.56,-35.7,101.88,35.52,23.73,31.44,-41.7,-27.24,-7.21,-3.11,39.26,75.3,42.99,32.5,-1.6,-6.45,-7.22,-2.7
jzcsyl,2.89,5.0,0.28,1.46,1.46,-2.61,-2.91,-8.0,-22.86,-57.59,-6.46,-24.47,-42.62,-176.56,-19.62,-106.14,-351.34,,,,,23.85,-18.36,24.24,24.2,47.8,2.71,35.01,36.91,37.48,-3.78,2.35,6.64,4.16,0.14,1.41,2.65,4.04
zcfl,72.1,72.89,72.08,73.31,74.16,76.26,77.25,77.97,80.83,76.78,79.47,81.65,83.45,90.87,92.35,95.44,97.93,139.85,141.63,147.12,150.07,97.15,97.53,96.22,96.2,94.34,94.19,91.83,91.72,91.27,82.03,81.66,81.18,57.57,56.61,54.44,54.91,53.59
ldbl,2.45,2.26,2.03,2.03,2.37,2.17,2.26,2.23,2.06,2.07,2.14,2.24,2.11,2.23,1.97,2.18,2.56,2.03,2.11,2.2,2.38,3.53,3.6,8.17,8.54,10.19,10.72,14.99,16.86,17.34,7.66,7.66,7.96,4.27,4.17,4.11,4.25,4.27

Reverse Transform of Segment.

Hello, I have an issue about the shape of prediction is not same to the original labels due to the segmentation process. Do you have any function to convert the shape of prediction back to the shape before the segmentation?

Is FeatureRep() with sequences of different length possible?

Hi!
I've got multidimensional time-series data, in which the different samples are of different temporal length (between 600 - 1800).

When using PadTrunc() the shorter samples get zeros appended to them, which will affect the features in the following step, for example the mean. And I would thus like to avoid the PadTrunc() step.

Pype([('pad', PadTrunc(width=1000)),
      ('features', FeatureRep('default')),
          ...

In theory I see no problem with calculatuing the features from samples of different lengths, but I cant manage to do it practically.

PadTrunc() reshapes the data to (n_samples, truncation_width, n_channels) , which seems to be the assumed shape for the rest of the pipeline. Without PadTrunc() there is no way to get samples of different lengths in to an array of this type.

Is it possible to do this in seglearn somehow, and if not, what is the consensus with PadTrunc() in order to avoid affecting the features?

Thanks.

list of features in FeatureRep

Hi,
I searched and couldn't find all possible list of features in FeatureRep and their implementation.
Where can I find it?

Thanks!

pip install broken

Description
When installing seglearn both via pip install seglearn and by cloning github repo as described I encounter an error trying to load the package in a clean python environment.

steps to reproduce

create clean environment

$ conda activate test
$ pip install seglearn matplotlib tensorflow

try to activate seglearn in python

import matplotlib.pyplot as plt
from tensorflow.python.keras.layers import Dense, LSTM, Conv1D
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import train_test_split

from seglearn.datasets import load_watch
from seglearn.pipe import Pype
from seglearn.transform import Segment```

3. error output

ModuleNotFoundError Traceback (most recent call last)
~/OneDrive/for JingJin/seglearn.py in
5 from tensorflow.python.keras.wrappers.scikit_learn import KerasClassifier
6 from sklearn.model_selection import train_test_split
----> 7 from seglearn.datasets import load_watch
8 from seglearn.pipe import Pype
9 from seglearn.transform import Segment

~/OneDrive/for JingJin/seglearn.py in
6 from sklearn.model_selection import train_test_split
7
----> 8 from seglearn.datasets import load_watch
9 from seglearn.pipe import Pype
10 from seglearn.transform import Segment

ModuleNotFoundError: No module named 'seglearn.datasets'; 'seglearn' is not a package


Thank you very much for your help.

Interp bug

Interp transform has a bug, which deletes the last variable column for multivariate time series. The fix is pushed to dev branch and will shortly be merged into master.

Package in conda-forge channel

Are you planning to add this package to the conda-forge channel?

Pype broken with scikit-lean 0.24

When using Pype with scikit-lean version 0.24 I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/seglearn/pipe.py", line 59, in __init__
    super(Pype, self).__init__(steps, memory)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 74, in inner_f
    return f(**kwargs)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 118, in __init__
    self._validate_steps()
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 157, in _validate_steps
    self._validate_names(names)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 70, in _validate_names
    invalid_names = set(names).intersection(self.get_params(deep=False))
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 137, in get_params
    return self._get_params('steps', deep=deep)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 29, in _get_params
    out = super().get_params(deep=deep)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py", line 195, in get_params
    value = getattr(self, key)
AttributeError: 'Pype' object has no attribute 'scorer'

Example to reproduce the error:

python=3.8
scikit-learn=0.24
seglearn=1.2.1

from seglearn.transform import SegmentX
from seglearn.pipe import Pype

pipe = Pype([('segment', SegmentX())])   # will crash on creation

From a quick view on the seglearn's source code

seglearn/seglearn/pipe.py

Lines 58 to 64 in 9000eee

 def __init__(self, steps, scorer=None, memory=None): 

 super(Pype, self).__init__(steps, memory) 

 self.scorer = scorer 

 self.N_train = None 

 self.N_test = None 

 self.N_fit = None 

 self.history = None

one solution could be to move the call of the super's __init__ to the end

def __init__(self, steps, scorer=None, memory=None):
    self.scorer = scorer
    self.N_train = None
    self.N_test = None
    self.N_fit = None
    self.history = None
    super(Pype, self).__init__(steps, memory)

Postprocessing

New feature: ReconstructTs

Should go in new postprocessing module and reconstruct time series target labels from predictions on segments and mapped to the original data samples.

This could be implemented using interpolation (nearest neighbor) for categorical targets and anything else for continuous targets.

I don't think this can be integrated in the current pipeline atm. Another option would be to design another pipeline class that has this implemented as its last step.

	def __init__(self, steps, scorer=None, memory=None):
	super(Pype, self).__init__(steps, memory)
	self.scorer = scorer
	self.N_train = None
	self.N_test = None
	self.N_fit = None
	self.history = None

dmbee / seglearn Goto Github PK

seglearn's People

Contributors

Stargazers

Watchers

Forkers

seglearn's Issues

Recommend Projects

Recommend Topics

Recommend Org