zigasajovic / consensus_clustering Goto Github PK

View Code? Open in Web Editor NEW

58.0 4.0 16.0 11 KB

An implementation of Consensus clustering in Python

License: MIT License

Python 100.00%

consensus-clustering python clustering

consensus_clustering's Introduction

Consensus clustering

An implementation of Consensus clustering in Python

This repository contains a Python implementation of consensus clustering, following the paper Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.

ConsensusCluster

The class containing the implementation.

Attributes

cluster : the class to perform the clustering (like KMEANS from sklearn)
- NOTE: the class is to be instantiated with parameter n_clusters, and possess a fit_predict method, which is invoked on data.
L : smallest number of clusters to try
K : largest number of clusters to try
H : number of resamplings for each number of clusters
resample_proportion : percentage to sample
Mk : consensus matrices for each k (shape =(K,data.shape[0],data.shape[0]))
- NOTE: every consensus matrix is retained, like specified in the paper
Ak : area under CDF for each number of clusters
- (see paper: section 3.3.1. Consensus distribution.)
deltaK : changes in areas under CDF
- (see paper: section 3.3.1. Consensus distribution.)
bestK : number of clusters that was found to be best

Methods

ConsensusCluster.init

Parameters:
    * cluster : the class to perform the clustering (like KMEANS from sklearn)
      * NOTE: the class is to be instantiated with parameter `n_clusters`,
        and possess a `fit_predict` method, which is invoked on data.
    * L : smallest number of clusters to try
    * K : largest number of clusters to try
    * H : number of resamplings for each number of clusters
    * resample_proportion : percentage to sample

ConsensusCluster.fit

Fits all attributes of the class to data

Parameters:
    * data : data.shape == (n_examples,n_features) 
    * verbose : should print or not

ConsensusCluster.predict

Predicts the clustering on the consensus matrix, for best found number of cluster

Returns:
    * Cluster labels for each example

ConsensusCluster.predict_data

Predicts the clustering on the data, for best found number of cluster

Parameters:
    * data : data.shape == (n_examples,n_features)

Returns:
    * Cluster labels for each example

consensus_clustering's People

Contributors

Stargazers

Watchers

Forkers

briskshan armingithub burtonrj sarsbug pilarortega meijian seralouk ky-zhou jnsnwjy hlzl freddymu kbsatter anuparnade-kore biosyy igumnov-daniel snarles

consensus_clustering's Issues

Pip Package

Is it possible to additionally maintain this pypi using the setup tools?

I can submit a PR for this if needed.

There maybe something wrong?

The element of Mk[i_] greater than elemet of Is elment ?

Resulting matrix has very large value?

I am using kmeans for the ConsensusCluster class (or others work same). Then fit and predict right after the first operation. The resulting matrix has 1s on the diagonal, but there are very large values like 500000.0 on some places in the matrix.
Is that a bug? Thanks.

Here is full code:

        c = ConsensusCluster(cluster.KMeans, 2, 3, 4, 1)
        c.fit(self.d1)
        _, similarity1 = c.predict()

BTW I modified the source code so that the matrix with best k is also returned by predict. This similarity1 has the issue.

License for repo

Hi - I would like to use this implementation.

Can you add a license for this code?

Inconsistent details in the README

Hi,

In the README, it is written:

NOTE: needs fit_predict method called with parameter n_clusters

However, scipy k-means do not have fit_predict method (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html)

Did you mean scikit-learn models maybe?

Set n_jobs of the clustering model

Amazing implementation.

One importnant suggestion: It is not possible to set the n_jobs of the cluster model. This would be nice to add.

E.g. as you set the input argument n_clusters=k in Mh = self.cluster_(n_clusters=k).fit_predict(resample_data), you could similarly pass the n_jobs input argument (most sklarn models have this argument).

name 'bisect' is not defined

Hey there :)

I'm using your script and I get an error saying:
name 'bisect' is not defined

This is the code I'm using:

kmeans_=KMeans
cc = ConsensusCluster(cluster=kmeans_, L= 10, K= 30, H=10)
cc = cc.fit(np.array(data), verbose = True)

Thanks in advance