Giter VIP home page Giter VIP logo

text-clustering's Introduction

Learning Clustering (BahasaIndonesia)

Code

source: http://brandonrose.org/clustering modified by : kirra

Data sources

  1. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  2. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.

Step

  1. tokenizing and stemming each article (Bahasa Indonesia)
  2. transforming the corpus into vector space using tf-idf
  3. calculating cosine distance between each document as a measure of similarity
  4. clustering the documents using the k-means algorithm
  5. using multidimensional scaling to reduce dimensionality within the corpus
  6. plotting the clustering output using matplotlib and mpld3
  7. conducting a hierarchical clustering on the corpus using Ward clustering
  8. plotting a Ward dendrogram
  9. topic modeling using Latent Dirichlet Allocation (LDA)

How to use

  1. download the new (kompas and tempo) extract to folder "data"
  2. create virtualenvironment python >>> $ virtualenv env
  3. activate virtualenvironment >>> source env/bin/activate
  4. install all depedencies >>> pip install -r requirements.txt
  5. run jupiter >>> jupyter notebook
  6. open file "Clustering.ipynb"

Example visualization

alt text

alt text

alt text

Source for vosualization

  1. http://adilmoujahid.com/posts/2015/01/interactive-data-visualization-d3-dc-python-mongodb/
  2. http://bl.ocks.org/lmatteis/efd9be8f472e673eef6ce9d1951256a9
  3. https://bl.ocks.org/bricedev/8b2da06ddef27d94cde9
  4. https://bl.ocks.org/jyucsiro/767539a876836e920e38bc80d2031ba7
  5. https://bl.ocks.org/emeeks/df6ea0128724289337ef

text-clustering's People

Contributors

ardhimaarik avatar dependabot[bot] avatar erdearik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.