Giter VIP home page Giter VIP logo

text_clustering's Introduction

CMPE 239 HW6 Report

Shiyu Mou 011551358 Rank and Score: NO.4 with 0.67 score

Homework goal

Text Clustering

The objectives of this assignment are the following:

  • Implement the K-Means algorithm
  • Deal with text data (news records) in document-term sparse matrix format.
  • Design a proximity function for text data
  • Think about the Curse of Dimensionality
  • Think about best metrics for evaluating clustering solutions.

Approach

Data Pre-processing

  • Input data (provided as training data) consists of 8580 text records in sparse format. No labels are provided.
  • First I read the dataset and transferred them into CSR matrixes in order to speed up the processing. After transformation the shape of matrix is (8580, 126356). The values are frequency of the words
  • Since in text data (news records), the frequency doesn't always means the important, there are many words that appear frequently but irrelevant. So we scale the CSR matrix by IDF.
  • Since we'll use cosine similarity to compute distance, so I normalized the CSR matrix got above.

Feature selection and Dimensionality reduction

I did try to use TruncatedSVD to reduce the Dimensions of the matrix, but the kernel kept getting died after run those line of code, and since my result is not bad, so I didn't try more on that.

K-Means Algorithm The sudo code is here:

# Initialize centroids
centroids = initCentorids(X, K);
for iter = 1:iterations
    # Cluster assignment step: Assign each data point to the
    # closest centroid. idx[i] corresponds to the index
    # of the centroid assigned to example i
     idx = findCentroids(X, centroids)
    # Move centroid step: Compute means based on centroid
    # assignments
    centroids = computeMeans(X, idx, k)
end

Since we have already normalized the CSR matrix, so in order to compute the distance, we can just multiply two matrix.

sims = x1.dot(x2.T)

When I got distances between every points to clusters, I sorted the result use quicksort and label the points with the label of the min distance centroid. The findCentroids() function return a list a labels of every document.

After we got new centroid, we compute means with in clusters - points with same labels with computeMeans(). Then we use the results as new centroids.

Keep doing the same thing above.

Evaluation

Since clustering problem doesn't have labels. We need other way to evaluate the clustering result.

The method I selected is Calinski and Harabaz score, which is a method for identifying clusters of points in a multidimensional Euclidean space. See paper here: Link

For implementation I used sklearn.metrics.calinski_harabaz_score to compute it.

Parameters finding

There are only two parameters, the first is k, since it was already given by instruction, we don't need to adjust it.

The second one, which is pretty important, is the number of iterations. In every iteration, we re-compute the means within clusters as the new centroids. And assign each points to centroids again. So we can expect that: after a number of iterations, the score would stop growing of grow very slow since all the clusters are divided good.

K from 3 -21

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.