Giter VIP home page Giter VIP logo

alg_think_mod_3's Introduction

Comparison of Clustering Algorithms

In this assignment for Coursera's Algorthmic Thinking course we were to create two functions for clustering data into subsets of similar items. We were then to analyze the performance of the functions in three areas:

  • Efficiency
  • Automation
  • Quality

For the purposes of having data to perform operations on, we used data that correlated cancer risk by zip code in the United States.

Clustering Algorithms

The two algorithms being analyzed were hierarchical clustering and k-means clustering.

Hierarchical Clustering

The first function we were to create was one which performed the hierarchical clustering of data by first initializing all starting data points as a cluster, and then iteratively finding the two clusters with the closest centers and merging them. This is done until we have only the number of desired clusters remaining.

K-means Clustering

The second method involved initializing the desired number of clusters as empty clusters. For our purposes, we initialized the the desired empty clusters at locations that correlated to the highest population counties. The data is then clustered to those initial points by distance and the center is re-calculated. After this, the steps of clustering and re-centering is done a desired number of iterations.

Efficiency

Plot of the function runtimes for hierarchical and k-means clustering algorithms

Clear winner here is the k-means clustering function. Asymptotically we can say that it's running at O(n) while the hierarchical algorithm is running at O(n2log2n).

Automation

Visualizing the Algorithms' data

Plot of hierarchical data for the 111 counties with the highest cancer rates clustered into nine regions: hierarchical Plot

Plot of k-means data for the 111 counties with the highest cancer rates clustered into nine regions K-means Plot

Looking at the west coast we can already see some issues with the k-means graph clustering because of the initial given population centers in southern California should be one cluster and not two. Mathematically, we can compute the distortion as the sum of the error for each cluster. So, for the two plots we find that:

  • Hierarchical Distortion: 1.752 x 1011
  • K-means Distortion: 2.712 x 1011

Which leads me to believe that k-means requires a greater level of trial-and-error from a human trying to find a good set of starting points, whereas hierarchical can be left to run on its own, at least for this small dataset.

Quality

Distortion Comparison

Plot of distortions for 6 to 20 clusters on 111 counties. Plot of Distortions for 111 Counties

Plot of distortions for 6 to 20 clusters on 290 counties. Plot of Distortions for 290 Plot

Plot of distortions for 6 to 20 clusters on 896 counties. Plot of Distortions for 896 Plot

Initially, the hierarchical method clearly has the lower distortion, but as the dataset size increases we see that strength virtually evaporate. Considering that I think scalability is important, I think that both the aspect of function speed and the positive trend in distortion values as dataset size increases would lead me to choose k-means as the superior clustering method.

alg_think_mod_3's People

Contributors

el-zoltan avatar

Watchers

Zoltan Albert avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.