The clust-learn from danielw2904

Clust-learn

A Python package for extracting information from large and high-dimensional mixed-type data through explainable cluster analysis.

Introduction
Overall architecture
Implementation
Installation
Version and license information
Bug reports and future work
User guide & API
Citing

1. Introduction

clust-learn enables users to run end-to-end explainable cluster analysis to extract information from large and high-dimensional mixed-type data, and it does so by providing a framework that guides the user through data preprocessing, dimensionality reduction, clustering, and classification of the obtained clusters. It is designed to require very few lines of code, and with a strong focus on explainability.

2. Overall architecture

clust-learn is organized into four modules, one for each component of the methodological framework presented here:

Figue 1 shows the package layout with the functionalities covered by each module along with the techniques used, the explainability strategies available, and the main functions and class methods encapsulating these techniques and explainability strategies.

3. Implementation

The package is implemented with Python 3.9 using open source libraries. It relies heavily on pandas and scikit-learn. Read the complete list of requirements here.

It can be installed manually or from pip/PyPI (see Section 4. Installation).

4. Installation

The package is on PyPI. Simply run:

pip install clust-learn

5. Version and license information

Version: 0.1.2
Author: Miguel Alvarez-Garcia ([email protected])
License: GPLv3

6. Bug reports and future work

Please report bugs and feature requests through creating a new issue here.

7. User guide & API

clust-learn is organized into four modules:

Data preprocessing
Dimensionality reduction
Clustering
Classifier

The four modules are designed to be used sequentially to ensure robust and explainable results. However, each of them is independent and can be used separately to suit different use cases.

7.i. Data preprocessing

Data preprocessing consists of a set of manipulation and transformation tasks performed on the raw data before it is used for its analysis. Although data quality is essential for obtaining robust and reliable results, real-world data is often incomplete, noisy, or inconsistent. Therefore, data preprocessing is a crucial step in any analytical study.

7.i.a. Data imputation

compute_missing()

compute_missing(df, normalize=True)

Calculates the pct/count of missing values per column.

Parameters

df : pandas.DataFrame
normalize : boolean, default=True

Returns

missing_df : pandas.DataFrame
- DataFrame with the pct/counts of missing values per column.

missing_values_heatmap()

missing_values_heatmap(df, output_path=None, savefig_kws=None)

Plots a heatmap to visualize missing values (light color).

Parameters

df : pandas.DataFrame
- DataFrame containing the data.
output_path : str, default=None
- Path to save figure as image.
savefig_kws : dict, default=None
- Save figure options.

impute_missing_values()

impute_missing_values(df, num_vars, cat_vars, num_pair_kws=None, mixed_pair_kws=None, cat_pair_kws=None, graph_thres=0.05, k=8, max_missing_thres=0.33)

This function imputes missing values following this steps:

One-to-one model based imputation for strongly related variables.
Cluster based hot deck imputation where clusters are obtained as the connected components of an undirected graph G=(V,E), where V is the set of variables and E the pairs of variables with mutual information above a predefined threshold.
Records with a proportion of missing values above a predefined threshold are discarded to ensure the quality of the hot deck imputation.
Hot deck imputation for the remaining missing values considering all variables together.

Parameters

df : pandas.DataFrame
- Data frame containing the data with potential missing values.
num_vars : str, list, pandas.Series, or numpy.array
- Numerical variable name(s).
cat_vars : str, list, pandas.Series, or numpy.array
- Categorical variable name(s).
{num,mixed,cat}_pair_kws : dict, default=None
- Additional keyword arguments to pass to compute imputation pairs for one-to-one model based imputation, namely:
  - For numerical pairs, corr_thres and method for setting the correlation coefficient threshold and method. By default, corr_thres=0.7 and method='pearson'.
  - For mixed-type pairs, np2_thres for setting the a threshold on partial eta square with 0.14 as default value.
  - For categorical pairs, mi_thres for setting a threshold on mutual information score. By default, mi_thres=0.6.
graph_thres : float, default=0.05
- Threshold to determine if two variables are similar based on mutual information score, and therefore are an edge of the graph from which variable clusters are derived.
k : int, default=8
- Number of neighbors to consider in hot deck imputation.
max_missing_thres: float, default=0.33
- Maximum proportion of missing values per observation allowed before final general hot deck imputation - see step 3 of the missing value imputation methodology in section 2.1.

Returns

final_pairs : pandas.DataFrame
- DataFrame with pairs of highly correlated variables (var1: variable with values to impute; var2: variable to be used as independent variable for model-based imputation), together proportion of missing values of variables var1 and var2.

plot_imputation_distribution_assessment()

plot_imputation_distribution_assessment(df_prior, df_posterior, imputed_vars, sample_frac=1.0, prior_kws=None, posterior_kws=None, output_path=None, savefig_kws=None)

Plots a distribution comparison of each variable with imputed variables, before and after imputation.

Parameters

df_prior : pandas.DataFrame
- DataFrame containing the data before imputation.
df_posterior : pandas.DataFrame
- DataFrame containing the data after imputation.
imputed_vars : list
- List of variables with imputed variables.
sample_frac : float, default=1.0
- If < 1 a random sample of every pair of variables will be plotted.
{prior,posterior}_kws : dict, default=None
- Additional keyword arguments to pass to the kdeplot.
output_path : str, default=None
- Path to save figure as image.
savefig_kws : dict, default=None
- Save figure options.

7.i.b. Outliers

remove_outliers()

remove_outliers(df, variables, iforest_kws=None)

Removes outliers using the Isolation Forest algorithm.

Parameters

df : pandas.DataFrame
- DataFrame containing the data.
variables : list
- Variables with potential outliers.
iforest_kws : dict, default=None
- IsolationForest algorithm hyperparameters.

Returns

df_inliers : pandas.DataFrame
- DataFrame with inliers (i.e. observations that are not outliers).
df_outliers : pandas.DataFrame
- DataFrame with outliers.

7.ii. Dimensionality reduction

All the functionality of this module is encapsulated in the DimensionalityReduction class so that the original data, the instances of the models used, and any other relevant information is self-maintained and always accessible.

DimensionalityReduction class

dr = DimensionalityReduction(df, num_vars=None, cat_vars=None, num_algorithm='pca', cat_algorithm='mca', num_kwargs=None, cat_kwargs=None)

Parameter	Type	Description
`df`	`pandas.DataFrame`	Data table containing the data with the original variables
`num_vars`	`string`, `list`, `pandas.Series`, or `numpy.array`	Numerical variable name(s)
`cat_vars`	`string`, `list`, `pandas.Series`, or `numpy.array`	Categorical variable name(s)
`num_algorithm`	`string`	Algorithm to be used for dimensionality reduction of numerical variables. By default, PCA is used. The current version also supports SPCA
`cat_algorithm`	`string`	Algorithm to be used for dimensionality reduction of categorical variables. By default, MCA is used. The current version doesn’t support other algorithms
`num_kwargs`	`dictionary`	Additional keyword arguments to pass to the model used for numerical variables
`cat_kwargs`	`dictionary`	Additional keyword arguments to pass to the model used for categorical variables
Attribute	Type	Description
`n_components_`	`int`	Final number of extracted components
`min_explained_variance_ratio_`	`float`	Minimum explained variance ratio. By default, 0.5
`num_trans_`	`pandas.DataFrame`	Extracted components from numerical variables
`cat_trans_`	`pandas.DataFrame`	Extracted components from categorical variables
`num_components_`	`list`	List of names assigned to the extracted components from numerical variables
`cat_components_`	`list`	List of names assigned to the extracted components from categorical variables
`pca_`	`sklearn.decomposition.PCA`	PCA instance used to speed up some computations and for comparison purposes

Parameter	Type	Description
`df`	`pandas.DataFrame`	Data frame containing the data to be clustered
`algorithms`	`string` or `list`	Algorithms to be used for clustering. The current version supports k-means and agglomerative clustering
`normalize`	`bool`	Whether to apply data normalization for fair comparisons between variables. In case dimensionality reduction is applied beforehand, normalization should not be applied
Attribute	Type	Description
`dimensions_`	`list`	List of columns of they input data frame
`instances_`	`dict`	Pairs of algorithm name and its instance
`metric_`	`string`	The cluster validation metric used. Four metrics available: ['inertia', 'davies_bouldin_score', 'silhouette_score', 'calinski_harabasz_score']
`optimal_config_`	`tuple`	Tuple with the optimal configuration for clustering containing the algorithm name, number of clusters, and value of the chosen validation metric
`scores_`	`dict`	Pairs of algorithm name and a list of values of the chosen validation metric for a cluster range

Parameter	Type	Description
`df`	`pandas.DataFrame`	Data frame containing the data
`predictor_cols`	`list` of `string`	List of columns to use as predictors
`target`	`numpy.array` or `list`	Values of the target variable
`num_cols`	`list`	List of numerical columns from predictor_cols
`cat_cols`	`list`	List of categorical columns from predictor_cols
Attribute	Type	Description
`filtered_features_`	`list`	List of columns of the input data frame
`model_`	Instance of `TransformerMixin` and `BaseEstimator` from `sklearn.base`	Trained classifier
`X_train_`	`numpy.array`	Train split of predictors
`X_test_`	`numpy.array`	Test split of predictors
`y_train_`	`numpy.array`	Train split of target
`y_test_`	`numpy.array`	Test split of target
`grid_result_`	`sklearn.model_selection.GridSearchCV`	Instance of fitted estimator for hyperparameter tuning

danielw2904 / clust-learn Goto Github PK

clust-learn's Introduction

Clust-learn

Table of contents

1. Introduction

2. Overall architecture

3. Implementation

4. Installation

5. Version and license information

6. Bug reports and future work

7. User guide & API

7.i. Data preprocessing

7.i.a. Data imputation

compute_missing()

missing_values_heatmap()

impute_missing_values()

plot_imputation_distribution_assessment()

7.i.b. Outliers

remove_outliers()

7.ii. Dimensionality reduction

DimensionalityReduction class

Methods

transform()

num_main_contributors(()

cat_main_contributors(()

cat_main_contributors_stats()

plot_num_explained_variance()

plot_cat_explained_variance()

plot_num_main_contributors()

plot_cat_main_contributor_distribution()

7.iii. Clustering

Clustering class

Methods

compute_clusters()

describe_clusters()

describe_clusters_cat()

compare_cluster_means_to_global_means()

anova_tests()

chi2_test()

plot_score_comparison()

plot_optimal_components_normalized()

plot_clustercount()

plot_cluster_means_to_global_means_comparison()

plot_distribution_comparison_by_cluster()

plot_clusters_2D()

plot_cat_distribution_by_cluster()

7.iv. Classifier

Classifier class

Methods

train_model()

hyperparameter_tuning_metrics()

confusion_matrix()

classification_report()

plot_shap_importances()

plot_shap_importances_beeswarm()

plot_confusion_matrix()

plot_roc_curves()

8. Citing

clust-learn's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org