Giter VIP home page Giter VIP logo

xvxvdee / cps803-finalproject Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 9.12 MB

This repo contains code and data for solving a practical clustering problem using the Daily Kos blog entries dataset. The dataset consists of 3420 documents, a vocabulary of 6906 terms, and 467,714 words. The goal is to cluster the documents into meaningful groups based on their content.

Jupyter Notebook 99.94% Python 0.06%
machine-learning machine-learning-algorithms nlp-machine-learning python

cps803-finalproject's Introduction

CPS803-FinalProject

Solving a Practical Clustering Problem: Exploring the Daily Kos Dataset

This project is a machine learning assignment from Toronto Metropolitan University. The goal is to apply the KMeans algorithm to cluster a bag of words dataset from the Daily Kos political blog.

Data

The data consists of two files: the bag of words file in sparse format and the vocabulary. The repository's sample dataset contains 3420 documents, a vocabulary of 6906 terms, and 467,714 words. The creation of the vocabulary was based on the tokenization and elimination of stop words from each document. If the token occurred more than ten times, it was added to the vocab.

Methods

The pipeline for preprocessing the data includes:

  • Building each post by using the bag of words file
  • Cleaning the content by replacing underscores, eliminating words with numbers, and stemming the vocabulary
  • Vectorizing the text using the TF-IDF vectorizer
  • Reducing the dimensionality using PCA

To cluster the bag of words, the KMeans algorithm was applied. To choose the optimal number of clusters, the Elbow method was used, which calculated the Sum of Squared Errors (SSE) for different values of k.

Results

The optimal number of clusters was found to be four, based on the analysis of the SSE plot and the words in each cluster. The clusters were labeled as follows:

  • Cluster 0: General politics and news
  • Cluster 1: Iraq war and foreign policy
  • Cluster 2: US elections and candidates
  • Cluster 3: Bush administration and criticism

Conclusions

The project demonstrated the use of KMeans to cluster a text-based dataset and the importance of considering other factors besides the Elbow method when choosing the number of clusters. The clusters showed some meaningful patterns and topics that reflected the nature of the Daily Kos blog.

cps803-finalproject's People

Contributors

xvxvdee avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.