atislabs / scargc.jl Goto Github PK
View Code? Open in Web Editor NEWA Julia implementation of Stream Classification Algorithm Guided by Clustering – SCARGC
License: MIT License
A Julia implementation of Stream Classification Algorithm Guided by Clustering – SCARGC
License: MIT License
The resizeData()
function is responsible for resizing the data in the case of the feature count is smaller than the value of K. Then, if that happens, the function needs to complete the labeled data with 1s.
The function receives the labeled data, the stream data, the feature count and the K value and returns the updated values.
The function opens a file, stores it's values in a matrix and return this matrix to be used along the code.
The penultimate step in the closed loop update is to update the labels using the pool data and the updated centroids. These labels are gonna replace the labels before stored in poolData
. Then, the labeledData
and labeledDataLabels
can be also updates with the pool values.
The initial centroids can be found in two different ways:
The value of K is the same as the classes count
In this case, the initial centroid of each class is the mean of each feature;
The value of K is different as the classes count
Otherwise, in this case, KMeans method is used to find the initial centroids.
The fit()
function is responsible for divide a dataset into different arrays, with different finalities.
The function needs to receive a dataset and the percent of the amount of data to be used as labeled values
.
After the division, the function needs to return the labels array, labeled data and their labels, data used as stream and their labels and the number of features.
The concordant label count needs to compare the labels between the pool data and the recent calculated labels. However, it's being calculated over poolData[:, 1:sizePoolData[2] - 1]
, what means that we're comparing labels with "non label" data. The correct in this case should be poolData[:, sizePoolData[2]]
, that means "take, from every line, the last column.
This bug has a strong impact in the algorithm result.
Apparently, the Clustering.jl package isn't returning the expected values in KMeans model. The PyCall is going to help by using scikit-learn KMeans, that has a better result.
The basic package structure refers to stuff such as folders and the .toml
and travis files.
The resizeData()
function isn't necessary. The kmeans()
function consider the instances as columns and we consider the instances as rows. Then, when we were trying to apply the KMeans to our data with the K value higher than the number of features, it was like if we were trying to divide the dataset into more clusters that we have instances.
Using the created functions, now it's important to create the main function: the SCARGC implementation using the Nearest Neighbor as classifier.
The update function is the one responsible for updating the centroids, the labeled data and the labeled data labels. This is gonna happen if the concordance between the labeled data stored in the pool and the new calculated labels is still different of the pool size (concordance/maxPoolSize < 1 means that there's still something to change because it didn't reached 100% yet) or if there's less labels in the labeled data than in the pool.
The nearest neighbor function must calculate the Euclidean Distance between the test instance and all of the labeled data to find the smaller distance.
The return of this function is the output label and the data from the neighbor with the smaller distance.
Synthetic data are going to be helpful in tests and experiments.
Along the algorithm, the centroids are updated. The function needs to find the label for the centroid in the current iteration.
For this, the nearest neighbor logic needs to be used. Then it's possible to get the label based on the label of the "same" centroid from the last iteration.
(...) given the current centroids (q1, q2, . . . , qk) from the most recent unlabeled clusters Ct and the past centroids (p1, p2, . . . , pk) from the previously labeled clusters Ct−1, where qi and pi are n-dimensional data, each centroid pi have a label yi and each centroid qi needs a label ˆyi. This label is obtained by the simple nearest neighbor algorithm. For this, each new centroid qi is associated to its closest past centroid, according to the Euclidean distance. In other words, after calculating the distance of a centroid qi to each past centroid, the label ˆyi given to qi is the same of the label yi of the nearest past centroid.
Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A.: Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency. SIAM International Conference on Data Mining (SDM), pp. 873-881, 2015.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.