This project aims to evaluate the performance of different clustering algorithms on multiple datasets. The algorithms compared are K-means, DBSCAN, and HDBSCAN. The datasets used for this analysis include Breast Cancer, Wine Quality, Abalone, and Ecoli datasets. The performance metrics used to evaluate the clustering algorithms are Silhouette Score, Rand Score, and Dunn Index.
The datasets used in this project are:
-
Breast Cancer Dataset: Contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
-
Wine Quality Dataset: Contains various chemical tests of wines and their quality ratings.0
-
Abalone Dataset: Contains physical measurements of abalones and their age.
-
Ecoli Dataset: Contains features related to E.coli proteins and their cellular localization sites.
To run this project, ensure you have Python installed on your system and poetry.
Clone the project:
git clone [email protected]:hericlesferraz/comparative_study_clustering_techniques.git
Enter inside the root project:
cd comparative_study_clustering_techniques
Install dependencies
poetry install
Run each cell on the juyter notebook, located in and enjoy the results:
comparative_study_clustering_techniques/jupyter/comparative_study_clustering_techniques.ipynb
Plot results of metric in algorithms:
Plot elbow methog: