Giter VIP home page Giter VIP logo

reval_clustering's Introduction

reval: stability-based relative clustering validation method to determine the best number of clusters

Determining the number of clusters that best partitions a dataset can be a challenging task because of 1) the lack of a priori information within an unsupervised learning framework; and 2) the absence of a unique clustering validation approach to evaluate clustering solutions. Here we present reval: a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions, as described in [1].

Statistical software, both in R and Python, usually compute internal validation metrics that can be leveraged to select the number of clusters that best fit the data and open-source software solutions that easily implement relative clustering techniques are lacking. The advantage of a relative approach over internal validation methods lies in the fact that internal metrics exploit characteristics of the data itself to produce a result, whereas relative validation converts an unsupervised clustering algorithm into a supervised classification problem, hence enabling generalizability and replicability of the results.

Requirements

python>=3.6

Installing

From github:

git clone https://github.com/IIT-LAND/reval_clustering
pip install -r requirements.txt

PyPI alternative (latest version v0.1.0):

pip install reval

Documentation

Code documentation can be found here. Documents include Python code descriptions, reval usage examples, performance on benchmark datasets, and common issues that can be encountered related to a dataset number of features and samples.

Manuscript

reval package functionalities are presented in our recent work that, as of now, can be found as a preprint. The experiments presented in the manuscript are in the Python file ./working_examples/manuscript_examples.py of the github folder. For reproducibility, all experiments were run with reval v0.1.0.

Refrences

[1] Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural computation, 16(6), 1299-1323.

Cite as

Landi, I., Mandelli, V., & Lombardo, M. V. (2021). reval: A Python package to determine best clustering solutions with stability-based relative clustering validation. Patterns, 2(4), 100228.

BibTeX alternative

@article{LANDI2021100228,
title = {reval: A Python package to determine best clustering solutions 
         with stability-based relative clustering validation},
journal = {Patterns},
volume = {2},
number = {4},
pages = {100228},
year = {2021},
issn = {2666-3899},
doi = {https://doi.org/10.1016/j.patter.2021.100228},
url = {https://www.sciencedirect.com/science/article/pii/S2666389921000428},
author = {Isotta Landi and Veronica Mandelli and Michael V. Lombardo},
keywords = {stability-based relative validation, 
            clustering, 
            unsupervised learning, 
            clustering replicability}
}

reval_clustering's People

Contributors

landiisotta avatar

Stargazers

 avatar Jie X avatar Michael Zabolocki avatar José Pérez Chávez avatar Yang avatar  avatar Rick Benavidez avatar Amandeep Jutla avatar  avatar Wanwan avatar Karlis Kanders avatar Stefanos Panagiotou avatar Ross Burton avatar Nico Müller avatar Peer Herholz avatar Pietro Monticone avatar  avatar  avatar Feffery avatar Xiaogang He avatar Lorenzo Gorini avatar Alessia Marcolini avatar AaronCao avatar  avatar  avatar Seder(方进) avatar 爱可可-爱生活 avatar

Watchers

James Cloos avatar  avatar  avatar Johannes Zenn avatar

reval_clustering's Issues

Fix problems with running AgglomerativeClustering

Dear all,
I tried to perform AgglomerativeClustering in reval on toy dataset and gaussian blobs, but the clustering algorithm failed to find any cluster. I tried to re-run AgglomerativeClustering starting from the fit method and moving backwards throughout the FindBestClustCV class. Since FindBestClustCV is a child class of RelativeValidation, I focused on the RelativeValidation class and found that the problem was in performing the Kuhn munkres algorithm in the test and rescale_score methods. The Kuhn munkres algorithm method in the utils module requires array-like structures of int32 or int as input variables, whereas the clustering labels I obtained from the train method of the RelativeValidation class were int64. Consequently, I modified line 84 in the relative_validation module (test method) as follows to convert any type of numpy array into a int32 numpy array:

bestperm = kuhn_munkres_algorithm(np.int32(classlab_ts), np.int32(clustlab_ts))

Accordingly, I also adjusted line 130 ( rescale_score_ method) in the same way:

me_ts = zero_one_loss(pred_lab, kuhn_munkres_algorithm(np.int32(pred_lab), np.int32(labts)))

I suggest modifying these code lines to overcome any issues due to the array structure type of clustering and classifier labels.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.