Giter VIP home page Giter VIP logo

mixdir's Introduction

mixdir

The goal of mixdir is to cluster high dimensional categorical datasets.

It can

  • handle missing data
  • infer a reasonable number of latent class (try mixdir(select_latent=TRUE))
  • cluster datasets with more than 70,000 observations and 60 features
  • propagate uncertainty and produce a soft clustering

A detailed description of the algorithm and the features of the package can be found in the the accompanying paper. If you find the package useful please cite

C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data", 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

Installation

install.packages("mixdir")

# Or to get the latest version from github
devtools::install_github("const-ae/mixdir")

Example

Clustering the mushroom data set.

# Loading the library and the data
library(mixdir)
set.seed(1)

data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
#>    bruises cap-color cap-shape cap-surface    edible
#> 1  bruises     brown    convex      smooth poisonous
#> 2  bruises    yellow    convex      smooth    edible
#> 3  bruises     white      bell      smooth    edible
#> 4  bruises     white    convex       scaly poisonous
#> 5       no      gray    convex      smooth    edible
#> 6  bruises    yellow    convex       scaly    edible
#> 7  bruises     white      bell      smooth    edible
#> 8  bruises     white      bell       scaly    edible
#> 9  bruises     white    convex       scaly poisonous
#> 10 bruises    yellow      bell      smooth    edible

Calling the clustering function mixdir on a subset of the data:

# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000,  1:5], n_latent=3)

Analyzing the result

# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)
#>  [1] 3 1 1 3 2 1 1 1 3 1

# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
#>               [,1]         [,2]         [,3]
#>  [1,] 3.103495e-07 1.055098e-05 9.999891e-01
#>  [2,] 9.998594e-01 4.683764e-06 1.359291e-04
#>  [3,] 9.998944e-01 3.111462e-06 1.025194e-04
#>  [4,] 5.778033e-04 7.114603e-08 9.994221e-01
#>  [5,] 3.662625e-07 9.999992e-01 4.183025e-07
#>  [6,] 9.996461e-01 8.764031e-08 3.537838e-04
#>  [7,] 9.998944e-01 3.111462e-06 1.025194e-04
#>  [8,] 9.997331e-01 5.822320e-08 2.668420e-04
#>  [9,] 5.778033e-04 7.114603e-08 9.994221e-01
#> [10,] 9.999999e-01 5.850067e-09 9.845112e-08
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
                  labels_col = paste("Class", 1:3))

# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)
#> $bruises
#>      bruises           no 
#> 0.9998223256 0.0001776744 
#> 
#> $`cap-color`
#>        brown         gray          red        white       yellow 
#> 0.0001775934 0.0001819672 0.0001776373 0.4079822666 0.5914805356 
#> 
#> $`cap-shape`
#>      bell    convex      flat    sunken 
#> 0.3926736 0.4767291 0.1304197 0.0001776 
#> 
#> $`cap-surface`
#>   fibrous     scaly    smooth 
#> 0.0568571 0.4871396 0.4560033 
#> 
#> $edible
#>       edible    poisonous 
#> 0.9998223174 0.0001776826

# The most predicitive features for each class
find_predictive_features(result, top_n=3)
#>       column    answer class probability
#> 19 cap-color    yellow     1   0.9993990
#> 22 cap-shape      bell     1   0.9990947
#> 1    bruises   bruises     1   0.7089533
#> 48    edible poisonous     3   0.9980468
#> 15 cap-color       red     3   0.8462032
#> 9  cap-color     brown     3   0.6473043
#> 5    bruises        no     2   0.9990364
#> 11 cap-color      gray     2   0.9978218
#> 32 cap-shape    sunken     2   0.9936162
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict(result, c(`cap-color`="yellow"))
#>          [,1]         [,2]         [,3]
#> [1,] 0.999399 0.0003004692 0.0003004907

# Note the most predictive features are different from the most typical ones
find_typical_features(result, top_n=3)
#>         column  answer class probability
#> 1      bruises bruises     1   0.9998223
#> 43      edible  edible     1   0.9998223
#> 19   cap-color  yellow     1   0.5914805
#> 3      bruises bruises     3   0.9995546
#> 27   cap-shape  convex     3   0.7460615
#> 9    cap-color   brown     3   0.6746224
#> 44      edible  edible     2   0.9995310
#> 5      bruises      no     2   0.9713177
#> 35 cap-surface fibrous     2   0.7355413

Dimensionality Reduction

# Defining Features
def_feat <- find_defining_features(result, mushroom[1:1000,  1:5], n_features = 3)
print(def_feat)
#> $features
#> [1] "cap-color" "bruises"   "edible"   
#> 
#> $quality
#> [1] 74.35146

# Plotting the most important features gives an immediate impression
# how the cluster differ
plot_features(def_feat$features, result$category_prob)
#> Loading required namespace: ggplot2
#> Loading required namespace: tidyr

Underlying Model

The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).

mixdir's People

Contributors

const-ae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mixdir's Issues

Error: missing value where TRUE/FALSE needed

Hi!

I am running mixdir command to a matrix in which I have patients in columns and some features in rows (introns) and the data in the matrix is categorized in No/Weak/Strong/Very_Strong (as it determines the inclusion of the intron). The dimension of the matrix is 186 x 947. I put here the head of the matrix:

mat_data[1:6, 1:6]


                         C5RR0ACXX_6_18 C5RR0ACXX_7_23 C5RRKACXX_6_27 C5RRKACXX_7_4
chr1;1922388;1922984;-   "No"           "No"           "No"           "No"         
chr1;6146423;6146664;-   "No"           "No"           "No"           "No"         
chr1;15567269;15567777;+ "No"           "No"           "No"           "No"         
chr1;16029299;16029730;+ "No"           "No"           "No"           "No"         
chr1;16132273;16132377;- "No"           "No"           "No"           "No"         
chr1;20483719;20484366;- "No"           "No"           "No"           "No"         
                         C5RRKACXX_8_13 C5RT7ACXX_2_2
chr1;1922388;1922984;-   "No"           "No"         
chr1;6146423;6146664;-   "No"           "No"         
chr1;15567269;15567777;+ "No"           "No"         
chr1;16029299;16029730;+ "No"           "No"         
chr1;16132273;16132377;- "No"           "No"         
chr1;20483719;20484366;- "No"           "No"

What I'm trying to do is clustering the patients according the inclusion of the intron (the feature). When I run mixdir command it appears the following error:

res <- mixdir(mat_data, max_iter = 100, select_latent = TRUE)

Error in if (iter != 1 && !is.infinite(elbo) && elbo - elbo_hist[iter - : missing value where TRUE/FALSE needed

Curiously, if I restrict the analysis to the first 50 rows, sometimes the error appears and sometimes it does not.

Thanks for your help.

Best regards,
Juan Luis

question about reproducibility

Dear authors,
thanks for the implementation of the algorithm, it is very useful.
How can I check for example for cluster reproducibility?
I noticed that without setting a seed, the clustering can change significantly.
Do I need to increase the number of repetitions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.