Giter VIP home page Giter VIP logo

ppscore's Introduction

ppscore - a Python implementation of the Predictive Power Score (PPS)

If you don't know yet what the Predictive Power Score is, please read the following blog post:

RIP correlation. Introducing the Predictive Power Score

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).

Installation

You need Python 3.6 or above.

From the terminal (or Anaconda prompt in Windows), enter:

pip install -U ppscore

Getting started

The examples refer to the newest version (1.2.0) of ppscore. See changes

First, let's create some data:

import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]

Based on the dataframe we can calculate the PPS of x predicting y:

pps.score(df, "x", "y")

We can calculate the PPS of all the predictors in the dataframe against a target y:

pps.predictors(df, "y")

Here is how we can calculate the PPS matrix between all columns:

pps.matrix(df)

Visualization of the results

For the visualization of the results you can use seaborn or your favorite viz library.

Plotting the PPS predictors:

import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")

Plotting the PPS matrix:

(This needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data)

import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

API

ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Calculate the Predictive Power Score (PPS) for "x predicts y"

  • The score always ranges from 0 to 1 and is data-type agnostic.

  • A score of 0 means that the column x cannot predict the column y better than a naive baseline model.

  • A score of 1 means that the column x can perfectly predict the column y given the model.

  • A score between 0 and 1 states the ratio of how much potential predictive power the model achieved compared to the baseline model.

Parameters

  • df : pandas.DataFrame
    • Dataframe that contains the columns x and y
  • x : str
    • Name of the column x which acts as the feature
  • y : str
    • Name of the column y which acts as the target
  • sample : int or None
    • Number of rows for sampling. The sampling decreases the calculation time of the PPS. If None there will be no sampling.
  • cross_validation : int
    • Number of iterations during cross-validation. This has the following implications: For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
  • random_seed : int or None
    • Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling. If the value is set, the results will be reproducible. If the value is None a new random number is drawn at the start of each calculation.
  • invalid_score : any
    • The score that is returned when a calculation is not valid, e.g. because the data type was not supported.
  • catch_errors : bool
    • If True all errors will be catched and reported as unknown_error which ensures convenience. If False errors will be raised. This is helpful for inspecting and debugging errors.

Returns

  • Dict:
    • A dict that contains multiple fields about the resulting PPS. The dict enables introspection into the calculations that have been performed under the hood

ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Calculate the Predictive Power Score (PPS) for all columns in the dataframe against a target (y) column

Parameters

  • df : pandas.DataFrame
    • The dataframe that contains the data
  • y : str
    • Name of the column y which acts as the target
  • output : str - potential values: "df", "list"
    • Control the type of the output. Either return a df or a list with all the PPS score dicts
  • sorted : bool
    • Whether or not to sort the output dataframe/list by the ppscore
  • kwargs :
    • Other key-word arguments that shall be forwarded to the pps.score method, e.g. sample, cross_validation, random_seed, invalid_score, catch_errors

Returns

  • pandas.DataFrame or list of PPS dicts:
    • Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument

ppscore.matrix(df, output="df", sorted=False, **kwargs)

Calculate the Predictive Power Score (PPS) matrix for all columns in the dataframe

Parameters

  • df : pandas.DataFrame
    • The dataframe that contains the data
  • output : str - potential values: "df", "list"
    • Control the type of the output. Either return a df or a list with all the PPS score dicts
  • sorted : bool
    • Whether or not to sort the output dataframe/list by the ppscore
  • kwargs :
    • Other key-word arguments that shall be forwarded to the pps.score method, e.g. sample, cross_validation, random_seed, invalid_score, catch_errors

Returns

  • pandas.DataFrame or list of PPS dicts:
    • Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument

Calculation of the PPS

If you are uncertain about some details, feel free to jump into the code to have a look at the exact implementation

There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

  • The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
  • The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via cross_validation). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
  • All rows which have a missing value in the feature or the target column are dropped
  • In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via sample. However, in most scenarios the results will be very similar
  • There is no grid search for optimal model parameters
  • The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (random_seed).
  • If the score cannot be calculated, the package will not raise an error but return an object where is_valid_score is False. The reported score will be invalid_score. We chose this behavior because we want to give you a quick overview where significant predictive power exists without you having to handle errors or edge cases. However, when you want to explicitly handle the errors, you can still do so.

Learning algorithm

As a learning algorithm, we currently use a Decision Tree because the Decision Tree has the following properties:

  • can detect any non-linear bivariate relationship
  • good predictive power in a wide variety of use cases
  • low requirements for feature preprocessing
  • robust model which can handle outliers and does not easily overfit
  • can be used for classification and regression
  • can be calculated quicker than many other algorithms

We differentiate the exact implementation based on the data type of the target column:

  • If the target column is numeric, we use the sklearn.DecisionTreeRegressor
  • If the target column is categoric, we use the sklearn.DecisionTreeClassifier

Please note that we prefer a general good performance on a wide variety of use cases over better performance in some narrow use cases. If you have a proposal for a better/different learning algorithm, please open an issue

However, please note why we actively decided against the following algorithms:

  • Correlation or Linear Regression: cannot detect non-linear bivariate relationships without extensive preprocessing
  • GAMs: might have problems with very unsmooth functions
  • SVM: potentially bad performance if the wrong kernel is selected
  • Random Forest/Gradient Boosted Tree: slower than a single Decision Tree
  • Neural Networks and Deep Learning: slower calculation than a Decision Tree and also needs more feature preprocessing

Data preprocessing

Even though the Decision Tree is a very flexible learning algorithm, we need to perform the following preprocessing steps if a column represents categoric values - that means it has the pandas dtype object, category, string or boolean.‌

  • If the target column is categoric, we use the sklearn.LabelEncoder​
  • If the feature column is categoric, we use the sklearn.OneHotEncoder​

Choosing the prediction case

This logic was updated in version 1.0.0.

The choice of the case (classification or regression) has an influence on the final PPS and thus it is important that the correct case is chosen. The case is chosen based on the data types of the columns. That means, e.g. if you want to change the case from regression to classification that you have to change the data type from float to string.

Here are the two main cases:

  • A classification is chosen if the target has the dtype object, category, string or boolean
  • A regression is chosen if the target has the dtype float or int

Cases and their score metrics​

Each case uses a different evaluation score for calculating the final predictive power score (PPS).

Regression

In case of an regression, the ppscore uses the mean absolute error (MAE) as the underlying evaluation metric (MAE_model). The best possible score of the MAE is 0 and higher is worse. As a baseline score, we calculate the MAE of a naive model (MAE_naive) that always predicts the median of the target column. The PPS is the result of the following normalization (and never smaller than 0):

PPS = 1 - (MAE_model / MAE_naive)

Classification

If the task is a classification, we compute the weighted F1 score (wF1) as the underlying evaluation metric (F1_model). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The weighted F1 takes into account the precision and recall of all classes weighted by their support as described here. As a baseline score (F1_naive), we calculate the weighted F1 score for a model that always predicts the most common class of the target column (F1_most_common) and a model that predicts random values (F1_random). F1_naive is set to the maximum of F1_most_common and F1_random. The PPS is the result of the following normalization (and never smaller than 0):

PPS = (F1_model - F1_naive) / (1 - F1_naive)

Special cases

There are various cases in which the PPS can be defined without fitting a model to save computation time or in which the PPS cannot be calculated at all. Those cases are described below.

Valid scores

In the following cases, the PPS is defined but we can save ourselves the computation time:

  • feature_is_id means that the feature column is categoric (see above for classification) and that all categories appear only once. Such a feature can never predict a target during cross-validation and thus the PPS is 0.
  • target_is_id means that the target column is categoric (see above for classification) and that all categories appear only once. Thus, the PPS is 0 because an ID column cannot be predicted by any other column as part of a cross-validation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
  • target_is_constant means that the target column only has a single value and thus the PPS is 0 because any column and baseline can perfectly predict a column that only has a single value. Therefore, the feature does not add any predictive power and we want to communicate that.
  • predict_itself means that the feature and target columns are the same and thus the PPS is 1 because a column can always perfectly predict its own value. Also, this leads to the typical diagonal of 1 that we are used to from the correlation matrix.

Invalid scores and other errors

In the following cases, the PPS is not defined and the score is set to invalid_score:

  • target_is_datetime means that the target column has a datetime data type which is not supported. A possible solution might be to convert the target column to a string column.
  • target_data_type_not_supported means that the target column has a data type which is not supported. A possible solution might be to convert the target column to another data type.
  • empty_dataframe_after_dropping_na occurs when there are no valid rows left after rows with missing values have been dropped. A possible solution might be to replace the missing values with valid values.
  • Last but not least, unknown_error occurs for all other errors that might raise an exception. This case is only reported when catch_errors is True. If you want to inspect or debug the underlying error, please set catch_errors to False.

Citing ppscore

DOI

About

ppscore is developed by 8080 Labs - we create tools for Python Data Scientists. If you like ppscore you might want to check out our other project bamboolib - a GUI for pandas DataFrames

ppscore's People

Contributors

8080labs avatar florianwetschoreck avatar jdm288 avatar suryathiru avatar tkrabel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ppscore's Issues

Quick Win: Performance Improvements

Since most of the sklearn predictors (especially DT) release the GIL, one could simply wrap the for x in features / for y in features loops into a single threadpool, collecting the results in the end.
Also, I think the predict itself "task" could also return earlier, even before dropping Nan values, simply by additionally performing the check whether columns are identic, or even by simply appending a 1 (by definition) instead of calling the score function directly.

[Suggestion]: Plot the Decision Tree for pps.score

Would it be possible to create the ability to print the Decision Classifier Tree using pps.score?
There might be a way to do this directly from my end, but I am currently unsure of how to do this.
Thanks, Lauren

[Question] Factor analysis or it's equivalent in PPScore

Question

At the moment this is just a query, if there is anything equivalent to "Factor analysis" in PPScore or if someone knows of steps to come up with then given a dataset.

It's possible this is a missing feature but before I make such claims I wanted to inquire and start a discussion.

Additional context

Found these resources related to this topic/concept:

Failure to handle dtype=category

I am having a similar issue to #11. I recreated the sample dataset from their example but with 100 rows and made features 2 and 3 both categorical. When I run pps.matrix with features 2 and 3 as dtype=object, I get the expected outcome. However, if I convert those same features to dtype=category, the output of pps.matrix for those features are only 1's and 0's. Is this intentional behavior? I appreciate your help. Thank you.

This works:

#importing the libs
import ppscore as pps
import numpy as np
import pandas as pd

#Creating a sample df
col1 = np.random.randn(100)*10
col2 = np.random.choice(['cat1', 'cat2', 'cat3', 'cat4'],100)
col3 = np.random.choice(['yes', 'no'],100)

#creating dataframe
df=pd.DataFrame(
{'feature1': col1,
'feature2': col2,
'feature3':col3}
)

#trying to calculate pps matrix
pps.matrix(df)

This does not:

df['feature2'] = pd.Categorical(df.feature2)
df['feature3'] = pd.Categorical(df.feature3)

pps.matrix(df)

PPS is not useful for data with high noise to signal ratio

Note: "Issue" is not the correct tag for this as it is more of a comment, but anyone working with noisy data should consider it.

I'm interest in using PPS for feature selection. I work with data that has a high noise to signal ratio and the PPS score is consistently 0 despite changes to parameters such as sample size and number of cross validation folds. One can easily reproduce this result by changing the "error" term in the example from the "Getting started" section to be uniform -5 to 5 instead of -0.5 to 0.5 and leaving the "x" value as uniform -2 to 2. Does anyone have a similar experience or any insight on the usefulness of PPS for this type of data?

I think a more constructive approach to identifying when a relationship exists between data would be to not RIP correlation. Despite some downsides, it still has its place. Instead, please consider the benefits of using multiple scores that measure relatedness between data when working on your data science projects.

While Pearson correlation measures linear relationships between data, their are several other correlation measures that do not make that assumption and should be examined before abandoning sound mathematical/statistical methods. Spearman correlation measures rank correlation (i.e., does not measure linear relationship) and Kendall's tau (and several variants of it such as Goodman and Kruskal's gamma) measures ordinal association between data with a non-parametric construction. It would be interesting to see how these other correlation measures stack up against PPS in the canonical "0 correlation" scenarios.

Also, on a slightly different topic: One inherent benefit of PPS that its authors have not taken advantage of is its ability to easily be extended to measure predictive power of combinations of features (i.e., interaction terms). The decision tree model used under the hood already supports multiple input features so why not allow "x" to be "x's"? Traditional correlation measures (that I mention above) require the user to assume the form of the interaction between features (is it x1 * x2? x1 / x2? etc.) in order to test the relationship while an extended PPS would not require that assumption. I think this is a very powerful use case for PPS despite its challenge of being visualized in a traditional correlation matrix format.

Sorry for the rant. I do not mean to disparage PPS and the work the authors have done. I think it does have its merits as a practical solution to real world data science problems that we are all facing in our work and with some improvements it can be even more useful.

Improving time using pyspark?

I was wondering if you explored the option of using pyspark to reduce the running time.
Since all the regression/classification models are independent of each other, it seems like a good candidate for parallelisation.
Would love to discuss more about this if you are interested.

Code Improvisation

I have improvised the Error Handling Code in ppscore. As the same error handling statements were being used repetitively, I have created a class called Check_Error using which it will be quite easy to call the necessary 'raise error' statements. I have tested it too and only 1 test failed but it also failed in the main repo.

As Hacktoberfest has started, I would like to contribute this code as a contribution to open source, so it would be great if you could label this issue and the pull request 'hacktoberfest'.

A pull request has been created on the master branch.

pps.matrix and pps.predictors are giving different score values for the same dataset against the same target feature

Hi All,

Found useful for extracting non linear relations . What I understood that the relation estimated by ppscore between the two variables or features are not impacted by the other available variables if I use pps.matrix ,or pps.predictors(). Does my understanding correct?

Found weird that pps.matrix tends to give different scores for features that pps.predictors() for the same target column .
Can you check what is the cause for the difference, is it valid ?

Numerical columns treated as categorical

Hi guys,

I heard of PPS, through your article and was curious to test it. I have tried implementing it on some data I've been working on.

Unfortunately, I get numerous error messages when calculating the pps matrix :

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

My guess is pps is considering my data to be categorical and therefore trying to apply classification with a huge number of labels.

Looking at how pps determines if the data is numerical or categorical, I cannot find the reason it would consider my data categorical :

  • The dtypes are int or float
  • The number of unique values is higher than 15 (except for 1 column which is equal to 15, but changing the NUMERIC_AS_CATEGORIC_BREAKPOINT constant to 10 does not resolve the problem)

Also, if I try to force the pps score to be calculated using task = 'regression', I get the following error :

'DataFrame' object has no attribute 'dtype'

Here is my code :

import pandas as pd
import ppscore as pps

df = pd.read_csv('seattle_building_energy_benchmark.csv', sep = ';')

df.dtypes

df.nunique()

pps.NUMERIC_AS_CATEGORIC_BREAKPOINT = 10

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = None)

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = 'regression')

pps.matrix(df)

Is there something I am missing ? If not, would you like me to share the data with you ? (I do not know which sharing method is more convenient for you)

pps.score not executing

Hello,

I can get pps.matrix to work just fine but I'm having trouble running pps.score.

I get "TypeError: unhashable type: 'numpy.ndarray'"

This is the code I've used

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import ppscore as pps

dataset = pd.read_csv('D:/location/test.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

pps_matrix_results = pps.matrix(dataset)
pps_score_results = pps.score(dataset, x,y)

What am I doing wrong?

There should be an option to override the attribute type like PyCaret

Numerical attributes not always point to regression and could be categorical attributes, where a Classification model should be used.
PyCaret offers such features, where we can list numerical and categorical features. Something of that kind will greatly increase the usability of this library.

Time-series validation workflow

Hello folks,

I really like the idea of your package and the approach. I was just curious how difficult it might be to introduce a custom CV validation (or even just a ts-meaningful) validation.

I could probably assist in that with a bit of guidance from you :)

Thanks,

Anton.

Test target variable against all others

Hello!

I wonder if there is a possibility to test a single target variable against all others and further print the coefficient in descending order?

pps.score(df, "x", "y")

What do I put instead of "x" in that case?
Sorry, Im rather new to Python.

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices?

In the readme it is mentioned,

In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (ppscore.RANDOM_SEED). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar.

What datasets were tested on to make this claim?
It seems highly unlikely that sampling 5000 rows from a dataset with millions of rows would lead to consistent ppscore matrices.

PPSScore drop rows with missing: can be improved.

Can missing be treated as a separate category? I see errors sometimes as "after dropping missing no valid rows are there".

I see in the doc
"All rows which have a missing value in the feature or the target column are dropped"

this is not desirable as missingness maybe a predictive factor which lightgbm and xgboost can handle.

feat: Return result for all possible combintations of columns

Feature Request 🎉

Description

To find the score, two columns (namely the params x and y in ppscore.score()) need to be supplied in call to score().
It would be nice to have a functionality where the user can get an array of results corresponding to all columns in a single line of code.

For instance, if comparisons between columns 'A', 'B', 'C', and 'D' are to be made, it should be possible to do so with a single line of code. Results could be obtained as a list of comparisons between A-B, A-C, A-D, B-C, B-D, and C-D.

Possible Solution

This can be achieved in two ways -

  1. create another function which iterates over all possible combinations and yields/returns a list
  2. add logic to existing implementation of score for the same.

Regarding test cases, it'd be easier to write a new one for a new function (as per 1), rather than modifying the existing implementation.

Remarks

@8080labs I'd like to work on it.

Detection of linear patterns and decoupling of concerns

When the PPS is applied toward linear relationships with the same error but different slopes, the score varies a lot e.g. from 0.1 to 0.7 depending on the slope.

This might not be the behaviour that we expect intuitively and normalizing the target does not help.
The reason for this is that the ppscore calculates the ratio of the variance of the predictor to the variance of the baseline. If the slope is steep, the ratio is higher because the baseline makes more errors. If the slope is flat, the variances are nearly the same.

The underlying problem is that the current metric and calculation of the ppscore couples to questions:

  1. Is there a valid pattern? e.g. statistical significance or predictive power after cross-validation
  2. Is the variance of the pattern low? (compared to baseline variance)

If either of those two criteria is wrong or weak, the ppscore will be low, too.
Only if both are true, the ppscore will be high.

The problem with the linear cases is that the pattern is valid BUT the variance of the pattern is not low because there is a lot of noise - even if the pattern is statistically significant. (High error to signal ratio)
For this scenario (and maybe also for others), we might want to find a calculation that decouples those two concerns

Some rough code:

import pandas as pd
import numpy as np
import seaborn as sns
import bamboolib as bam

import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
df["0.3_linear_x"] = 0.3*df["x"]+df["error"] #0.11 pps
df["0.5_linear_x"] = 0.5*df["x"]+df["error"] #0.4 pps
df["1_linear_x"] = 1*df["x"]+df["error"] # 0.68 pps

# normalized linear to [0,1] via +2 and /4
df["1_linear_x_norm"] = (df["1_linear_x"] + 2)/4 #0.68 pps, too

Readme / docs unclear about using ppscore on time series data

I would just love to use this on timeseries data and out of the box it seems to do pretty well, but I don't know if I'm interpreting the score right, however, I read over the readme at this link: https://github.com/8080labs/ppscore#calculation-of-the-pps

'''
The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via cross_validation). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
'''

I don't understand, should I set sample=None for time series or should I modify the cross_validation kwarg for timeseries data?

Getting different results from the ones in article

Hi, first of all, thank you for a great tool!

I tried reproducing the experiment from jupyter notebooks, but got different results, both for quadratic function and titanic dataset.

For the quadratic function, I am getting 0.67 instead of the value of 0.88 mentioned in the article. Although that discrepancy might have been caused by the randomness of data, the discrepancy I get with Titanic dataset is bigger:
in the article, you mention the PPS score between TicketId and TicketPrice of 0.67, whereas reproducing your notebooks, I am getting a score of 0.27.

You can see the steps to reproduce the discrepancy in this notebook, please skip to line 26.

I have python version 3.6.8, sklearn 0.22 and ppscore version 0.0.2, you can see them in the mentioned above notebook.

Can be much faster

Nice idea! From a blog post I read it's not very fast.
If you use randomized trees, you'll make it very much faster, doing hundreds of columns and many thousands of rows in a few seconds. Moreover, because you are fitting a single column, the randomized trees don't have any drawback compared to very advanced trees.
Can think more about it, if you need help.

How to understand the baseline_score?

Hi,

Im having trouble understanding the baseline score. From what i think i can see in the code then the baseline is the score when we are being naive and choosing the median of y as our predicted value.

That would mean if the baseline_score is higher than the model_score then using the median as a the predicited value would be better than the prediction from the decision tree. Is this correct?

Btw. Thanks for releasing

Data preprocessing and information leakage

Hello, before anything thanks for the package, it is very useful and the overall approach is innovative and generates a lot of efficiency. I have a comment regarding the "state" of the data to run the pps analysis on, it seems (I may be mistaken) that any transformation to the data (standardization for example) will lead to large data leakage into the Kfolds cross-validations. Is it correct? The module could use sklearn´s pipelining and standard transforms to possibly increase the information generated, would this be of value to the module?

Handling of categorical columns

As it is written in your code, if the column has less then 15 unique values, it is considered as a categorical during fitting.

But, I am facing the following warning during the execution:

Reproducible sample:

#importing the libs
import ppscore as pps
import numpy as np
import pandas as pd

#Creating a toy df
#feature column 1
#feature column 2
col1= np.random.randn(10)*10
col2=np.random.randn(10)*10
#feature colum 3- 5 unique values
col3=np.random.choice(range(4),10)

#creating dataframe
df=pd.DataFrame(
{'feature1': col1,
'feature2': col2,
'feature3':col3}
)

#trying to calculate pps matrix
pps.matrix(df)

Warning that I observed:

anaconda3\envs\DL_py37\lib\site-packages\sklearn\model_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: Unknown label type: 'continuous'

It is not a new issue, and example of solution (find on Kaggle) - temporarily convert the data type of feature3 column to categorical with LabelEncoder

Thought on a possible enhancement of the PPS

Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.

As mentioned in the article RIP correlation. Introducing the Predictive Power Score, When using the PPS one should keep in mind that it only captures direct relations, and not combinations of input features.

To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.

For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:

  • If I combine x1 and x2, I get a predictive power of 0.5, I now know that this combination increases the PPS by 0.5 - 0.4 = 0.1
  • If I combine x1 and x4, I get a predictive power score of 0.6. The increase is now 0.6 - 0.6 = 0. Implying that even though x1 has a pps of 0.4, I might as well use x4 and drop x1.

This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?

Numpy arrays and Unknown label type: 'continuous'

Hi,
I'm quickly experimenting by implementing ppscore in my pipeline for the assessment of functional connectivity between brain regions, and I noticed two things:
1/ I think we should be able to use pps.matrix() even on a 2D numpy array when we don't have explicit column names: as of now, it is raising the error AttributeError: 'numpy.ndarray' object has no attribute 'columns'
2/ I got a strange error telling me that "continuous" is an unknown label. File "/home/clementpoiret/anaconda3/envs/nilearn/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets raise ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'continuous'
Code to reproduce the error:

import numpy as np
import pandas as pd

X = pd.DataFrame(np.random.randn(10,10))
pps.matrix(X)

The error is solved by passing task='regression'. I have sklearn 0.23.0
Maybe an additional comment: maybe that the diagonal of the resulting matrix should be 1, because it makes sense that the predictive power of a vector on itself is 1, no?

warning message

It shows this warning. I attached the csv file

The least populated class in y has only 1 members, which is less than n_splits=4.

C:\Users....\sklearn\model_selection_split.py:667: UserWarning:

for_test.txt

ppscore interpretation

Hi @FlorianWetschoreck, @tkrabel , @SuryaThiru

Quick question: how to properly interpret ppscore?

Say if you have a dataset, 3000 rows x 30 columns; you then apply pps.matrix(), then sort values by ppscore. Is there a "rule of thumb" or rational-guideline to categorize ppscore levels?
Like the following:

  • If ppscore is in range 0.6 - 1.0, means strong (so X feature has a strong predictive power on Y)
  • If ppscore is in range 0.4 - 0.6, means moderate (so X feature has a moderate predictive power on Y)
  • If ppsscore is lower than 0.4, means weak (so X feature has a weak predictive power on Y)
    Note: the ranges and categories I gave are totally arbitrary

I read this article - https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 , but couldnt find an answer there

Thanks a million, Fernando

Adding the ability to use different evaluation metrics

Hello,

First of all PPScore is so good, great job! Running only PPS score to reduce feature numbers gives better results than using other feature elimination techniques, or running it first before using other feature elimination techniques ALWAYS yields better results than running pretty much every other feature elimination technique(s) alone.

I just wanted to ask how to change from F1 score to ROC, specifically, Precision-Recall Curves (as I have moderate imbalanced classes), for my binary classification problem.

Thank you for your help in the matter.

Regards,

Achilleas

Comparing RF feature importance to PPS

Has anyone compared RF feature importance to PPS?

My one concern with PPS is that it is a univariate calculation. Even in the matrix, the numbers correspond to a single feature's predictive power on the target.

My inkling is that feature importance can cut across multiple variables when determining which variables provide the most power in that sense.

Just curious. Thanks!

ENH: maybe use AUPRC as classification score

Vinicius reached out and proposed to maybe use another evaluation metric for classification problems:

Have you used the average_precision metric (average_precision = AUPRC = area under precision-recall curve)? I think that it's better metric for imbalanced class problem.
Read the post: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
On multiclass problem, it's possible to use the same options as F1_Score: macro, micro, weight, etc.

This might be a viable alternative to the F1 Score.

Next steps (can be taken by anyone)

  • Create experiment notebook which clearly shows the differences/advantages of the AUPRC in certain situations (what is the benefit? how does it change the PPS? What is a suitable naive baseline of the AUPRC?)

  • Decide about the potential usage of AUPRC

  • Decide about possible integration into ppscore (change the default? add as an option? ...?)

ppscore changes to 0 for multiple variables after upgrade

Hi,

The ppscore is very good, I have tried it out previously and yielded good results.

I was using the 0.02 version and very recently (about 2 weeks ago) I tried it out on a dataset which gave a score to multiple variables. This week I had to upgrade because I recieve an AttributeError: module 'ppscore' has no attribute 'predictors'

After the upgrade I only had 2 variables which yielded ppscores, with all the other variables going to value 0.

Nothing else from the code was changed so I asume this was due to the upgrade, I suspect the 0 value is not correct because I have visually confirmed some distributions and applied statistical test to some variables (KS and chi-square) and they seemed to be somewhat predictive (plus they previously had scores).

Any clue on what might have happended? Many thanks

[SUGGEST] Release a verson supported GPU

Thank you for creating a power metric to calculate the score for choosing features.

I hope that you can release a version which supports to calculate this score by GPU. It will reduce time for computating.

Thank you.

Comparing Negative result of correlation to ppscore.

I have a simple question regarding ppscore. When I was calculating the correlation between two datasets(Columns) the result was was -0.248, which means when one when data increases the other will decreases but when I calculated the ppscore of the same columns the result was 0.37 from x to y and 0 from y to x. It clearly indicates that x can predicts y with 0.37 ppscore and y cannot predict x.

But what I actually want to know is the relation between 2 datasets, either it is directly proportional (positive) or inversely proportional(negative) with each other.

Thank you,

Scikit-learn dependency < 1.0.0

Scikit-Learn recently released their first stable version.
There aren't many breaking changes, so I think it would be nice if this could be reflected in the dependencies of PPScore.

Would be happy to help with this as well.

Problems installing ppscore via reticulate

I want to use ppscore via R by means of reticulate (which I know from Keras in R; see link for details) and do the following in a fresh R session:

library(reticulate)
conda_list()

This shows the Python environments actually available on my computer:

1 Miniconda3 C:\Miniconda3\python.exe
2 r-reticulate C:\Miniconda3\envs\r-reticulate\python.exe

I choose use_condaenv("r-reticulate"), then py_install("ppscore", pip = TRUE), which produces:

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
All requested packages already installed.
Collecting ppscore
Using cached ppscore-0.0.2.tar.gz (38 kB)
Building wheels for collected packages: ppscore
Building wheel for ppscore (setup.py): started
Building wheel for ppscore (setup.py): finished with status 'done'
Created wheel for ppscore: filename=ppscore-0.0.2-py2.py3-none-any.whl size=9634 sha256=6d04a943bc87ef27f697de2cedef048966e2a56f4f41667949d37f3f4fcebc2c
Stored in directory: c:\users\lf\appdata\local\pip\cache\wheels\fd\39\a8\130eda2ee307e849923caf5b555b0d113ec7f7e8c7de731f9f
Successfully built ppscore
Installing collected packages: ppscore
Successfully installed ppscore-0.0.2

Now, I want to import ppscore, ppscore <- import('ppscore') , which produces

Error in py_module_import(module, convert = convert) :
ModuleNotFoundError: No module named 'sklearn'

Ok - so I do py_install(sklearn) , which gives

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
All requested packages already installed.
Error in conda_install(envname, packages = packages, conda = conda, python_version = python_version, :
Object 'sklearn' not found

This is somewhat surprising since when I do a pip list under Anaconda prompt I get

grafik

which clearly shows that package sklearn is present. And now I am at a loss what I could further do to successfully import and use (!) ppscore. I have to add that I am a complete newcomer to Python, so I am not at all familiar with Python environments.

I really would appreciate any useful hints - thanks in advance,
Leo

Is there a paper available?

Hi there,

This appears to be quite a valuable metric and I am curious, have you published a pre-print, or paper on the details surrounding the experiments and calculations?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.