atislabs / scargc.jl Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 14.43 MB

A Julia implementation of Stream Classification Algorithm Guided by Clustering – SCARGC

License: MIT License

Julia 5.33% Jupyter Notebook 94.67%

clustering concept-drift evl julia nonstationary-environments

scargc.jl's People

Contributors

Stargazers

Watchers

scargc.jl's Issues

Implement resizing function

The resizeData() function is responsible for resizing the data in the case of the feature count is smaller than the value of K. Then, if that happens, the function needs to complete the labeled data with 1s.

The function receives the labeled data, the stream data, the feature count and the K value and returns the updated values.

Create experiments to make the comparisons present in the article

Create function to read values from a file

The function opens a file, stores it's values in a matrix and return this matrix to be used along the code.

Create functions documentation page

Create function to update the label values using the updated centroids

The penultimate step in the closed loop update is to update the labels using the pool data and the updated centroids. These labels are gonna replace the labels before stored in poolData. Then, the labeledData and labeledDataLabels can be also updates with the pool values.

Implement function to find the initial centroids

The initial centroids can be found in two different ways:

The value of K is the same as the classes count
In this case, the initial centroid of each class is the mean of each feature;
The value of K is different as the classes count
Otherwise, in this case, KMeans method is used to find the initial centroids.

Create fit function

The fit() function is responsible for divide a dataset into different arrays, with different finalities.
The function needs to receive a dataset and the percent of the amount of data to be used as labeled values.
After the division, the function needs to return the labels array, labeled data and their labels, data used as stream and their labels and the number of features.

Concordant label count isn't being calculated correctly

The concordant label count needs to compare the labels between the pool data and the recent calculated labels. However, it's being calculated over poolData[:, 1:sizePoolData[2] - 1], what means that we're comparing labels with "non label" data. The correct in this case should be poolData[:, sizePoolData[2]], that means "take, from every line, the last column.
This bug has a strong impact in the algorithm result.

Hello, how can I get the Matlab version?

Use PyCall.jl to use scikit-learn KMeans

Apparently, the Clustering.jl package isn't returning the expected values in KMeans model. The PyCall is going to help by using scikit-learn KMeans, that has a better result.

Add travis configurations for documentation

Create documentation base structure

Create package structure

The basic package structure refers to stuff such as folders and the .toml and travis files.

Create css file for documentation pages

Create the home documentation page

Create experiments showing examples of using the package

Remove function to resize data

The resizeData() function isn't necessary. The kmeans() function consider the instances as columns and we consider the instances as rows. Then, when we were trying to apply the KMeans to our data with the K value higher than the number of features, it was like if we were trying to divide the dataset into more clusters that we have instances.

Implement SCARGC for 1NN classifier

Using the created functions, now it's important to create the main function: the SCARGC implementation using the Nearest Neighbor as classifier.

Create update function

The update function is the one responsible for updating the centroids, the labeled data and the labeled data labels. This is gonna happen if the concordance between the labeled data stored in the pool and the new calculated labels is still different of the pool size (concordance/maxPoolSize < 1 means that there's still something to change because it didn't reached 100% yet) or if there's less labels in the labeled data than in the pool.

Implement nearest neighbor classifier function

The nearest neighbor function must calculate the Euclidean Distance between the test instance and all of the labeled data to find the smaller distance.
The return of this function is the output label and the data from the neighbor with the smaller distance.

Add synthetic data to the repository

Synthetic data are going to be helpful in tests and experiments.

Create function to find the labels for current centroids

Along the algorithm, the centroids are updated. The function needs to find the label for the centroid in the current iteration.
For this, the nearest neighbor logic needs to be used. Then it's possible to get the label based on the label of the "same" centroid from the last iteration.

(...) given the current centroids (q1, q2, . . . , qk) from the most recent unlabeled clusters Ct and the past centroids (p1, p2, . . . , pk) from the previously labeled clusters Ct−1, where qi and pi are n-dimensional data, each centroid pi have a label yi and each centroid qi needs a label ˆyi. This label is obtained by the simple nearest neighbor algorithm. For this, each new centroid qi is associated to its closest past centroid, according to the Euclidean distance. In other words, after calculating the distance of a centroid qi to each past centroid, the label ˆyi given to qi is the same of the label yi of the nearest past centroid.

Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A.: Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency. SIAM International Conference on Data Mining (SDM), pp. 873-881, 2015.

atislabs / scargc.jl Goto Github PK

scargc.jl's People

Contributors

Stargazers

Watchers

scargc.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org