Giter VIP home page Giter VIP logo

clusopt_core's Introduction

ClusOpt Core

This package is used by ClusOpt for it's CPU intensive tasks, but it can be easily imported in any python data stream clustering project, it is coded mainly in C/C++ with bindings for python, and features:

  • CluStream (based on MOA implementation)
  • StreamKM++ (wrapped around the original paper authors implementation)
  • Distance Matrix computation (in place implementation using boost threads)
  • Silhouette score (custom in place implementation inspired by BIRCH clustering vector)

Prerequisites

  • python >= 3.6
  • pip
  • boost-thread
  • gcc >= 6

boost-thread can be installed in Debian based systems with :

apt install libboost-thread-dev

Usage

See examples folder for more.

CluStream online clustering

from clusopt_core.cluster import CluStream
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

k = 32

dataset, _ = make_blobs(n_samples=64000, centers=k, random_state=42, cluster_std=0.1)

model = CluStream(
    m=k * 10,  # no microclusters
    h=64000,  # horizon
    t=2,  # radius factor
)

chunks = np.split(dataset, len(dataset) / 4000)

model.init_offline(chunks.pop(0), seed=42)

for chunk in chunks:
    model.partial_fit(chunk)

clusters, _ = model.get_macro_clusters(k, seed=42)

plt.scatter(*dataset.T, marker=",", label="datapoints")

plt.scatter(*model.get_partial_cluster_centers().T, marker=".", label="microclusters")

plt.scatter(*clusters.T, marker="x", label="macro clusters", color="black")

plt.legend()
plt.show()

output:

clustream clustering results

Benchmarks

Some functions in clusopt_core are faster than scikit learn implementations, see the benchmark folder for more info.

Silhouette

Each bar have a tuple of (no_samples,dimension,no_groups), so independently of those 3 factors, clusopt implementation is faster.

clusopt silhouette versus scikit learn silhouette execution time

Distance Matrix

Each bar shows the dataset dimension, so clusopt_core implemetation is faster when the dataset dimension is small (<~150), even when using 4 processes in scikit-learn.

clusopt distance matrix versus scikit learn pairwise distance in execution time

Installation

You can install it directly from pypi with

pip install clusopt-core

or you can clone this repo and install from the directory

pip install ./clusopt_core

Acknowledgments

Thanks to:

  • Marcel R. Ackermann et al. for the StreamKM++ algorithm - link
  • The university of Waikato for the MOA framework - link

clusopt_core's People

Contributors

giuliano-macedo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

mwskywalker

clusopt_core's Issues

Test in ARM-based system

Pretty sure that GCC will compile with no problem in any arch, but just to be sure this test should be done and documented in README.md

Test in windows

I'm pretty sure that if using MSYS2 all the source files will compile fine, but this test must be done and findings have to be documented in the README.md .

Docs

Mostly of the code have google styled docstrings, however i haven't setup any documentation program like sphynx.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.