Giter VIP home page Giter VIP logo

xcluster's Introduction

xcluster

xcluster contains algorithms and evaluation tools for extreme clustering, i.e., instances of clustering in which the number of points to be clustered and the number of clusters is large. Most notably, xcluster contains an implementation of PERCH (Purity Enhancing Rotations for Cluster Hierachies). PERCH is an online extreme clustering algorithm that incrementally builds a tree with data points at its leaves. During the data point insertion procedure, PERCH performs rotations to keep the tree accurate and as balanced as possible. Empirical experiments show that PERCH produces purer trees faster than other algorithms; theoretical analysis shows that for separable data, PERCH builds trees with perfect dendrogram purity regardless of the order of the data. Technical details of the algorithm and analysis are forthcoming.

Setup

If running the python code, download and Install Anaconda's Python3

https://docs.continuum.io/anaconda/install

If running python code, install numba

conda install numba

Set environment variables:

source bin/setup.sh

Install maven if you don't already have it installed:

./bin/util/install_mvn.sh

Build Scala code:

./bin/build.sh

Download data

./bin/download_data.sh

Run

Scala

Run Test on Separated Data:

 ./bin/test/test_perch_dendrogram_purity.sh

Run PERCH on Small Scale Data (glass dataset):

# Hierarchical clustering
./bin/hierarchical/glass/run_perch.sh

# Flat clustering
./bin/flat/glass/run_perch.sh

Run PERCH on ALOI (see notes below for suggested system environment):

# Hierarchical clustering
./bin/hierarchical/aloi/run_perch.sh

# Flat clustering
./bin/flat/aloi/run_perch.sh

Python

Run Test on Separated Data:

 ./bin/test/test_perch_dendrogram_purity_py.sh

Run PERCH on Small Scale Data (glass dataset):

# Hierarchical clustering
./bin/hierarchical/glass/run_perch_py.sh

Run PERCH on ALOI:

# Hierarchical clustering
./bin/hierarchical/aloi/run_perch_py.sh

Notes

  • The ALOI scripts are set up to run on a machine with about 24 cores and 60GB of memory. Most of the computation required is to compute Dendrogram Purity. You can run the Perch algorithm with much less computational resources efficiently (even 1 thread and a few gigabytes of memory.)
  • You'll need perl installed on your system to run experiment shell scripts as is. perl is used to shuffle the data. If you can't run perl, you can change this to another shuffling method of your choice.
  • The scripts in this project use environment variables set in the setup script. You'll need to source this set up script in each shell session running this project.
  • Java Version 1.8 and Scala 2.11.7 are used in this project. Java 1.8 must be installed on your system. It is not necessary to have Scala installed.

xcluster's People

Contributors

nmonath avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.