fraunhoferportugal / tsfel Goto Github PK
View Code? Open in Web Editor NEWAn intuitive library to extract features from time series.
Home Page: https://tsfel.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
An intuitive library to extract features from time series.
Home Page: https://tsfel.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
Hi. Thanks for the amazing work.
I would like to use tsfel on Colab.
Could you please tell me how to access the that colab file?
It would be nice to have additional examples of TSFEL apart from HAR.
Dear all,
Playing with your tool I wanted to obtain the spectral features for a given signal
cfg = tsfel.get_features_by_domain(domain='spectral')
len(cfg['spectral'].keys())
26
26 spectral features, nice!
But, when I calculated those features
#Fs previously obtained from data
X = tsfel.time_series_features_extractor(cfg, data,fs=Fs)
X.size
335
335 elements!.
I would love to iterate among several time-series, and obtain a feature matrix. I would like to know a priori the size of the X features, as I may create a specific variable array to store the values.
I know that certain features of the signal are computed in time slots (such as FFT_mean_coeff), but is really time consuming to annotate how many results per feature I should expect.
Thus, is there any option to know a priori how many elements will be in the X series?
Thanks for the useful library.
I would like to use it to extract features from a time-series dataset composed of experiments characterized by 30 synchronous times-series, each collected at 1 Hz.
By running your code I get 11.700 features.
Are there any means to automatically reduce the features to the subset your tool evaluate as the most important ones?
Since I have 400 experiments run a feature selection process afterwards does not seem a good option.
Thanks
Hello! First of all, thank you for developing this great library.
I've been using it lately, and the dataset I'm working on has a low, non-whole sampling frequency of 1/60 Hz. Since tsfel assumes the sampling frequency value type as integer, it aproximates that value to zero, which impossibilitates the extraction of some features due to the occurence of "ZeroDivisionError".
Hello there,
Not an issue, just a suggestion.
I saw that you created a function in signal_processing called correlated_features which identifies highlighy correlated features given a threshold and then returns a list of features to drop. There is an example on how to use it in the notebook TSFEL_HAR_Example.ipynb, but you don't use that in pipeline.
I've used that function as a base to create a class that can be used in pipelines. Please note that I think you should update your function to take the absolute number of the correlation as currently you are only dropping high positively correlated features (with correlation bigger than positive threshold). However, correlations have range [-1,1], so I would have thought you also want to drop high negatively correlated features (i.e. with correlation smaller than negative threshold) . See my code below.
class CorrelationThreshold(BaseEstimator, TransformerMixin):
"""Feature selector that removes all correlated features.
This feature selection algorithm looks only at the features (X), not the
desired outputs (y), and can thus be used for unsupervised learning.
Parameters
----------
threshold : float, default=0.95
Features with a training-set correlation higher than this threshold will
be removed. The default is to keep all features with non-zero variance,
i.e. remove the features that have the same value in all samples.
Attributes
----------
selected_features_ : list, shape (n_features)
Returns a list with the selected feature names.
"""
def __init__(self, threshold = 0.95):
self.threshold = threshold
self.to_drop = None
self.to_keep = None
def fit (self, X, y = None ):
'''
Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute variances.
y : any, default=None
Ignored. This parameter exists only for compatibility with
sklearn.pipeline.Pipeline.
Returns
-------
self
'''
corr_matrix = np.absolute(X.corr())
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
self.to_drop = [column for column in upper.columns if any(upper[column] > self.threshold)]
self.to_keep = list(set(X.columns) - set(self.to_drop))
return self
def transform(self, X, y = None):
X_selected = X[self.to_keep]
return X_selected
def get_support(self):
return self.to_keep
Please note that this is a topic in discussion in the sklearn community too: #13405, #14698. One thing that should be further discussed/analysed/improved is how to choose which of the highly correlated variables should be dropped. I think some , f-value correlation with the y label could be appropriate - only applicable to supervised learning problems.
Hello! Thank you in advance for providing this useful package.
I am currently using your package (tsfel) to extract features from a dataframe with physiological signals as columns.
I have a very specific question regarding the overlap of the moving window. I was wondering how the overlap is rounded when an uneven windowsize is used. For example, if I have a windowsize of 1281 and an overlap of 0.7, it means that my window will shift (1-0.7)*1281 = 384.3 samples down in my dataframe for each feature extraction step. Is this rounded up or down (ceil or floor)? so 384 or 385?
Kind regards,
Maarten
I've noticed a small inconsistency between the usage of np.fft.fft and the numpy documentation.
Namely in tsfel/feature_extraction/features_utils.py
in line 52 following (calc_fft).
The numpy documentation at https://numpy.org/doc/stable/reference/generated/numpy.fft.fft.html#numpy.fft.fft
uses fftfreq to extract the center of each frequency bin. While the TSFEL implementation uses np.linspace.
The output of both functions differs slightly, likely due to accountancy of the zero-frequency component.
Example:
fs = 10
signal_length = 50
a = np.linspace(0, fs // 2, signal_length // 2) # used in calc_fft
b = np.fft.rfftfreq(signal_length, d=1/fs) # numpy documentation
a, b, a.shape, b.shape
yields:
(array([0. , 0.20833333, 0.41666667, 0.625 , 0.83333333,
1.04166667, 1.25 , 1.45833333, 1.66666667, 1.875 ,
2.08333333, 2.29166667, 2.5 , 2.70833333, 2.91666667,
3.125 , 3.33333333, 3.54166667, 3.75 , 3.95833333,
4.16666667, 4.375 , 4.58333333, 4.79166667, 5. ]),
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4,
2.6, 2.8, 3. , 3.2, 3.4, 3.6, 3.8, 4. , 4.2, 4.4, 4.6, 4.8, 5. ]),
(25,),
(26,))
The two parts would be equivalent when using np.linspace(0, fs // 2, signal_length // 2 + 1)
.
As noted the output differs slightly. If I'm reading this correctly, the linspace is calculated for 25 frequencies including the zero-frequency component, while the rfftfreq is calculated for 25 frequencies plus zero-frequency component.
It takes a long time to extract all domain features so, How can I extract only relevant features from all three domains?
Please help!!!
Hi, I am curious whether we could use this to extract features from multi-dimensional time series with variable lengths?
Give a parameter instead of the question in input for removing highly correlated features at correlation_report.
Hi folks! Thanks for this fantastic contribution. I'm excited to test the capabilities of this package.
I have a hard time to extract features constructed by tsfel for a univariate time series rolled by date. For example, I have a pandas dataframe with m dates and n features, and I want to estimate the tsfel feature set given fixed window size. As a result, I should get a dataframe of shape m dates and n times y (number of variables derived from tsfel). Any comments are welcome.
Thanks in advance!
I am trying to use tsfel.dataset_features_extractor on the data file https://github.com/numenta/NAB/blob/master/data/realAWSCloudwatch/ec2_cpu_utilization_24ae8d.csv
The following message is printed:
Features files saved in:
But there are no results in the directory. I expect some sort of error message or warning if nothing is produced. It is not obvious how I can debug this.
I'm using:
X_train = tsfel.time_series_features_extractor(cfg_file, X_train_sig, fs=fs, window_size=window_size)
There is a minimal sample window size of 12?
If I use a smaller window_size then I get errors like below (I'm running through Julia so the messages are not so nice)
File "..../tsfel/feature_extraction/calc_features.py", line 297, in time_series_features_extractor for i, feat in enumerate(features):
hypotenuse
tsfel/tsfel/feature_extraction/features.py
Line 201 in 4e07830
I use time_series_features_extractor to extract multiple time series, which is a dataframe with columns [ts1, ts2]
. Now I have several such tables, and I run time_series_features_extractor on each, and want to concatenate the result by axis=1. But I find the numbers of features extracted are different. Can you tell me in what situation a feature will be dropped from result? It seems 780 features will be extracted by default.
Hi,
I am using tsfel for 564 timeseries analysis, I want to extract the features of each time course and get a dataframe containing the features for all time courses (they should have the same column and each row represents a specific time course)
So I used a loop for this, my dataset has nan in some time course.
My code looks like this, but it only shows the feature extraction started, never finished and could not return the features.
Any suggestions on this?
Many thanks!
Good day
I have experimental results from modal hammer testing of a bolted beam structure. These results are in the form of a frequency response function (FRF), with amplitude as the y axis and frequency as the x axis.
I would love to use your code in the spectral domain for feature extraction of my signals. The issue that I'm facing is that your code calculates the fft within the functions. Is there a way to use your spectral domain code without calculating the fft in the function codes? (As in, allow me to input an fft as the input signal to the code)
In essence, if I use your code as is, I would be calculating the fft of an fft.
I would greatly appreciate feedback.
Thank you
Hi,
I have another question related to the input format of the data for tsfel.
Let's say I have the following dataframe (time_df):
id timestamp ch0 ch1 ch2
0 1 0.5 0.8 0.9
0 2 0.9 0.9 0.8
...
0 100 0.8 0.8 0.8
1 1 0.9 0.1 0.1
....
1 50 0.9 0.9 0.9
2,
3, etc..
Where id is the measurement number, and ch0, ch1, ch2 are the channels recorded.
Let's say for measurement 0, I had 100 points per channel; for measurement 1, I had 50 points, and for measurement 3, I had 70 points.
Just to put it another way:
for measurement 0 - an array of 3x100
for measurement 1 - an array of 3x50
for measurement 2- an array of 3x70
The sampling frequency is the same ( let's say 1 sample/second)
When I used tsfresh to generate features, I just providing column with IDs and timestamps
ts_features = extract_features(time_df, column_id='id', column_sort='timestamp')
It boils down to the question:
How do I need to re-shape the data frame to use tfsel to generate each measurement's features, regardless of the measurement lengths (they may be the same, or may be different)?
There is an unhandled exception in line 770 in features.py. The corresponding function has the following code:
@set_domain("domain", "statistical")
def ecdf_slope(signal, p_init=0.5, p_end=0.75):
"""Computes the slope of the ECDF between two percentiles.
Possibility to return infinity values.
Feature computational cost: 1
Parameters
----------
signal : nd-array
Input from which ECDF is computed
p_init : float
Initial percentile
p_end : float
End percentile
Returns
-------
float
The slope of the ECDF between two percentiles
"""
signal = np.array(signal)
# check if signal is constant
if np.sum(np.diff(signal)) == 0:
return np.inf
else:
x_init, x_end = ecdf_percentile(signal, percentile=[p_init, p_end])
return (p_end - p_init) / (x_end - x_init)
Unfortunately, the case (x_end-x_init) is not handled and thus occassionally results in:
PathToAnaconda\lib\site-packages\tsfel\feature_extraction\features.py:770: RuntimeWarning: divide by zero encountered in double_scalars
return (p_end - p_init) / (x_end - x_init)
I'm using TSFEL on a Windows 10 machine and end up with the following error message whenever I enable the progress bar for feature extraction:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2588' in position 12: character maps to <undefined>
I'm not really sure why this problem occurs but this Stackoverflow thread provides some suggestions on how to mitigate the issue.
If I set the verbose parameter to 0, everything works as expected.
I have a question about the following usage scenario of tsfel:
I have a pandas data frame with the following structure:
time value
2020-01-01 1.2
2020-01-02 1.3
2020-01-04 1.1
2020-01-07 1.0
2020-01-08 1.5
As you may see, the sampling frequency is not constant (sometimes it is once per day, sometimes it is once per few days).
Can you please let me know if tsfel can handle this kind of data to extract time-series features?
Hello,
I would like to generate features for each observation of my time serie and not only window by window.
Does this possibility exist in tsfel and do you know how to do it ?
Thanks in advance
Hello! thank you for the tsfel package.
Do you plan to push recently made updates to the PyPI? I see that there have been a number of changes to tsfel since 14th Feb.
If not, would you recommend that we use the development branch or stick with v0.1.4
Hi, thanks for this tool! It's a huge help. I'm struggling with how I go about extracting some but not all features from the spectral domain.
To extract all we use something like this:
cfg_file = tsfel.get_features_by_domain('spectral')
data = tsfel.time_series_features_extractor(cfg_file, data, fs=fs)
Which function can we use to extract a list of chosen features?
Thanks!
Hola como seria el proceso para editar el diccionario
Originally posted by @espjose in #89 (comment)
I think this is due to recent updates in numpy.
diff_sig = np.diff(signal,axis=0)
fixes the issues
How does the dataset should look like in order to create features based on a class of the data. Let's say I have this data, where a class is 1 or 2:
energy.current energy.power energy.powerFactor class time
0 0.080 12.5 0.67 1 1
1 0.081 12.6 0.67 1 2
2 0.083 12.7 0.66 1 3
3 0.083 12.7 0.66 1 4
4 0.080 12.5 0.67 2 1
5 0.081 12.6 0.67 2 2
6 0.083 12.7 0.66 2 3
7 0.083 12.7 0.66 2 4
How do I provide this data to the library for it to be able to generate features by class?
I would ike to get a simple demo of tsfel running so I downloaded the notebooks and run into:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-14-0cfaaee0f6b7> in <module>
2 googleSheet_name = "MH Copy of Features_dev"
3 # Extract excel info
----> 4 cfg_file = tsfel.extract_sheet(googleSheet_name)
5
6 # Get features
~/Library/Python/3.6/lib/python/site-packages/tsfel/utils/gSheetsFilters.py in extract_sheet(gsheet_name, **kwargs)
111
112 assert len(list_of_features) <= (len_json), \
--> 113 "To insert a new feature, please add it to data/features.json with the code in src/utils/features.py"
114
115 # adds a new feature in Google sheet if it is missing from features.json
AssertionError: To insert a new feature, please add it to data/features.json with the code in src/utils/features.py
I'm not sure the first demo should use Google sheets - maybe that table could be inserted into the Notebook?
Dear all,
Thank you for your incredible work.
I am giving a try with your software, analyzing some neural recordings that I have.
Because of computational power, I am running my code through Google's Colaboratory. However, since this tool has a fixed time for running, it should really help a verbose option when running
tsfel.time_series_features_extractor(cfg,data)
as it may help to calculate the amount of time that certain feature calculation should take
Hi,
First thankyou for such a wonderful library and its real life saver. I wanted to clarify about sampling frequency so say that in my time series dataset, i have observation timestamped after every 10 minutes so should i set sampling frequency 'fs' to about 1/600 or 0.001667ish something?
Would you be able to share an example ? How to use this library with a timeseries dataframe?
Library is great it is just that, I am trying to get my head around fs
and window_size
parameter to feed my hourly data .
Basically how I can do i have to map the data from time to frequency ?
My schema looks looks like this one below :
hourly recording :
id, time, feature1, feature2 , target
Thanks!
Hey! I really like the plenty feature functions that are implemented in this library 😄
When toying around with your library, I found a bug (some unexpected behavior).
When calling lppc
on some data with default arguments (i.e., n_coeff
=12) it returns an array of length 13.
from tsfel.feature_extraction import lpcc
# Calculate lpcc feature on some dummy data
len(lpcc(np.arange(500))) # returns 13
However, when calling the get_number_features
function, it returns that only 12 features will be returned.
from tsfel import get_features_by_domain, get_number_features
# Get number of features for lpcc feature configuration
feat_dict = get_features_by_domain("spectral")
get_number_features({"dummy": {"dummy": sorted(feat_dict.values())[0]["LPCC"]}}) # returns 12
This difference is because the get_number_features
function use the n_coeff
parameter to determine the number of features. But in the lpc
function the output is pre-pended with 1 value ([1]), making the output of lpc
length 13 and thus also the output of lpcc
.
So the number of features for lppc
will always be n_coeff
+ 1. Is there a way to encode this in the feature configuration (dictionary)?
Hi guys,
When running the following it extracts two sorts of statistical features. For example I get 0_Mean
but also 1_Mean
. It seems that the first is the regular mean but I cannot figure out from the documentation what is the later mean. Same question for all other features which are returned as 0_featureName and 1_featureName.
Could you clarify?
import tsfel
cfg_file = tsfel.get_features_by_domain("statistical")
X_train = tsfel.time_series_features_extractor(cfg_file, dataset, fs=1, window_size=10)
Best, Patrick
I apologize in advance if this is not the right place for this post. I am new to Github.
TSFEL repeats basic calculations, such as FFT, for different types of features, which slows down the feature calculation unnecessarily. I work with time series with over 160 million data points and calculating FFT costs 2 minutes each time. Wouldn't it be better to store and retrieve the FFT for the following features?
Hey guys,
Thanks for this awesome library. I saw that in this commit, ECDF slope feature extraction functionality was removed from the library. It would be great if you guys can explain what was the issue with it.
Hi, I found it to be a great library for time-series features extraction and that too in lightweight manner.
I was wondering if one wants to create/sample synthetic series based on TSFEL extracted features from univariate real series, what should be the way forward and if this going to be a feature in future release? Any suggestions/recommendations in this regard will be highly appreciated.
Many Thanks.
It would be nice TSFEL benefit from a progress bar during the feature extraction process. We can monitor the estimated time to accomplish the feature extraction.
Thank you very much for this create library, first of all!
The following code in the "distance" function in features.py:
diff_sig = np.diff(signal)
return np.sum([np.sqrt(1 + diff_sig ** 2)])
should be changed to:
diff_sig = np.diff(signal)
diff_sigFloat = diff_sig.astype(float)
return np.sum([np.sqrt(1 + diff_sigFloat ** 2)])
The reason for that is that otherwise an integer overflow might occur for larger numbers, which results in negative numbers, which, in turn, results in an "invalid value" error.
The X_train data reduced from 208 rows to just 1 row, resulting error for further execution of the code. What can go wrong?
Here is the code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train)
cfg = tsfel.get_features_by_domain()
# Get features
X_train = tsfel.time_series_features_extractor(cfg, X_train, fs=fs)
X_test = tsfel.time_series_features_extractor(cfg, X_test, fs=fs)
print(X_train)
print(X_test)
corr_features = tsfel.correlated_features(X_train)
X_train.drop(corr_features, axis=1, inplace=True)
X_test.drop(corr_features, axis=1, inplace=True)
Hi everyone,
I'm looking to extract features from 3 IMUs each containing a 3-axis Accelerometer and Gyroscopes. I have created a dataframe to combine the data from all of them 3 IMUs x 2 sensors (Acc, Gry) x 3 axis (xyz) = 18 columns + Timestamp
I started out calling the tsfel.get_features_by_domain on the entire dataframe, but that never progressed from 0% Complete. Then, I much reduce the problem:
`
cfg = tsfel.get_features_by_domain(domain = 'statistical',
json_path = 'features.json')
data = df.loc[s_times[0]:f_times[0], 'Neck.Acc.X'][:101].to_list()
df_tsfel = tsfel.time_series_features_extractor(
# configuration file with features to be extracted
dict_features = cfg,
# dataframe window to calculate features window on
signal_windows = data,
# sampling frequency of original signal
fs = 100,
# sliding window size
window_size = 100
)
`
It surely can't get any simpler than this and still it doesn't leave the 0%
There must be something wrong with the way I set up stuff. Can someone help please? I'm currentely writting a paper and will need to give up using this package if I dont' manage to sort this out...
Some extra info:
When I am extracting features, the histogram feature constantly raises a warning. Could you check what is happening?
Hi,
I am wondering if there is a gui version of tsfel.
Also, what should the input data look like? Is this able to compute multiple timeseries at the same time and extract their features and cluster them based on the features?
Hi tsfel developers!
Nice package! Congratulations on this.
I am one of the authors of tsfresh. I would be very much interested in understanding if there is something we can learn from each other? Is there a functionality tsfresh could provide to tsfel? Or the other way round? Would you even think it makes sense to combine efforts?
Happy to hear your opinion :-)
While opening the Feature list from docs of tsfel , it's showing errors in place of code-area.
System Message: WARNING/2 (/home/docs/checkouts/readthedocs.org/user_builds/tsfel/checkouts/latest/docs/descriptions/feature_list.rst, line 6)
failed to import tsfel.feature_extraction.features
System Message: WARNING/2 (/home/docs/checkouts/readthedocs.org/user_builds/tsfel/checkouts/latest/docs/descriptions/feature_list.rst, line 6)
toctree references unknown document ‘descriptions/_generated/tsfel.feature_extraction.features’
I have a series with a sample frequency of 5minutes, so sample rate 0.0033
X_train = tsfel.time_series_features_extractor(cfg_file, X_train_sig, fs=0.0033, window_size=12)
but I get an error:
Name: value, dtype: float64
*** Feature extraction started ***
/Users/markh/Library/Python/3.6/lib/python/site-packages/scipy/signal/spectral.py:1800: RuntimeWarning: divide by zero encountered in double_scalars
scale = 1.0 / (fs * (win*win).sum())
Must the sample rate be > 1 ?
Dear authors,
First of all, congratulations on this great project very helpful fo all the community.
I have a issue related to the number of extracted features samples:
I execute this call
X = ts.time_series_features_extractor(cfg, tmp_data, fs = 32, window_size=32, overlap=0, verbose = 0)
On my accelerometer data frame of dimension 160 x 3, 160 samples and three columns ['X','Y','Z'].
From this call, X has a dimension of 1 x 789. It returns a single sample of features for all the 160 x 3 accelerometer samples.
However, this does not seem right. Since window _size = 32 (1 second of time frame), it has to return to me an X whit dimension 5 x 789.
How is this possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.