The search-framework-based-on-hierarchical-clustering from vardhanshah

Here, It is an implementation of hierarchical clustering, basically tree of clusters.

It is like tree, each node(cluster) can have number of childern(sub-clusters). Node(cluster) qualifies to have number of childern(cluster) and maximum number of children(sub-clusters) it can have, all can be defined in config (more on config file below !).

Input-Data-Set needed to be vectors, vectors of d-Dimension, where d>0 (positive number).

After the hierarchical clustering has been constructed, one of the uses of it can be for searching the neareset vector for a candid-vector. (Neareset-vector is decided by euclidian-distance.)

candid-vector needed to have dimension d, same as input-vectors.

So I used this hierarchical clustering for searching, In code execuction, you can also check seaching accuracy, by comparing it with brute-force-searching accuracy.

Clustering is done with help of K-Means algorithm. Intial values of k-means clustering has been choosen with help of K-means++ algorithm.

Implementation is written in python3

matplotlib is required (if you want to plot the cluster or mean path)

You need to define hyper-parameters of algorithm in config file. config file should be located in current directory.

In config file,

all parameter inside config file should preserve below order,

Parameters:
	Input_vectors_file
	Input_candid_vectors_file
	Output_file
	Minimum_cluster_size
	number_of_sub_clusters
	mean plot verbose
	cluster_plot_verbose
	coverge_loop_times
	optimization_loop_times

No specific name of the parameters is required, but order should be preserved

In above parameters,

Input_vectors_file denotes, file in which number of input vectors given

Input_candid_vectors_file denotes, file in which number of candid-vectors defined

Output_file denotes, file in which output of search result will be written

Minimum_cluster_size denotes, minimum cluster size needed for further clustering of a node, set the value of it an integer

number_of_sub_clusters denotes, Maximum possible sub-clusters will occur for each node, set the value of it an integer

mean_plot_verbose denotes, plotting of inital means choosen and path they take to converge, set the value of it true/false

cluster_plot_verbose denotes, plotting of clusters for each node, set the value of it true/false

Note: set mean_plot_verbose true, only when data size of input is small (because the plotting will occur for each 
	node, so for big size-data, number of nodes can be high) and dimension of input-vector >= 2

Note: set cluster_plot_verbose true, only for data size of input is not so big, cause as the number of inputs increase.
	plotting time incereases also and plotting will occur for each sub-cluster

converger_loop_times denotes, maximum number of iterations given to mean to converge a value

Now, first mean in K-means++ initialization algorithms is decided randomly, so there is chance of getting at local-optimum in 
clustering procedure, getting kind of bad cluster (now this has low impact for searching procedure). but it can be avoided by, one 
of the methods of optimization, which is intializing mean and finding stable mean, repeat this process number of times.

optimization_loop_times denotes above procedure.

Note: if the number of optimization_loop_times increases by k, then K-Means hierarchical clustering construction is also 
	increased by k of original time.

Example Config file:

input_file_vectors = vectors
input_file_candid_vectors = candid_vector
output_file = ans_cluster
minimum cluster size = 250
number of sub clusters = 2
mean plot verbose = false
cluster plot verbose = false
converge loop times = 1000
optimization loop times = 1

with this paramters,

Tree of cluster will be created, where each node will have two children, which means, each cluster will have two sub-clusters
sub-cluster will be created until cluster-size>=250

To execute:

python3 cluster_tree_constructor.py

output:

constructing hierarchical clustering...


Possible clusters:  2 level:  0
Vectors:  1000
502
498
Possible clusters:  2 level:  1
Vectors:  502
221
281
Possible clusters:  2 level:  2
Vectors:  281
166
115
Possible clusters:  2 level:  1
Vectors:  498
218
280
Possible clusters:  2 level:  2
Vectors:  280
142
138
hierarchical cluster has been constructed


starting searching procedure
searching procedure has been ended.
time taken by searching: 0.2417609691619873s


want to check accuracy by comparing it with brute - force solution[y\N]:  y
bruter-force searching has been begun
brute_force searching has been done
time taken by brute-force searching: 1.2946381568908691s


comparing both files for differences
total number of differences: 43

vardhanshah / search-framework-based-on-hierarchical-clustering Goto Github PK

search-framework-based-on-hierarchical-clustering's Introduction

search-framework-based-on-hierarchical-clustering's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent