Giter VIP home page Giter VIP logo

scargc.jl's People

Contributors

marinhogabriel avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scargc.jl's Issues

Implement resizing function

The resizeData() function is responsible for resizing the data in the case of the feature count is smaller than the value of K. Then, if that happens, the function needs to complete the labeled data with 1s.

The function receives the labeled data, the stream data, the feature count and the K value and returns the updated values.

Implement function to find the initial centroids

The initial centroids can be found in two different ways:

  1. The value of K is the same as the classes count
    In this case, the initial centroid of each class is the mean of each feature;

  2. The value of K is different as the classes count
    Otherwise, in this case, KMeans method is used to find the initial centroids.

Create fit function

The fit() function is responsible for divide a dataset into different arrays, with different finalities.
The function needs to receive a dataset and the percent of the amount of data to be used as labeled values.
After the division, the function needs to return the labels array, labeled data and their labels, data used as stream and their labels and the number of features.

Concordant label count isn't being calculated correctly

The concordant label count needs to compare the labels between the pool data and the recent calculated labels. However, it's being calculated over poolData[:, 1:sizePoolData[2] - 1], what means that we're comparing labels with "non label" data. The correct in this case should be poolData[:, sizePoolData[2]], that means "take, from every line, the last column.
This bug has a strong impact in the algorithm result.

Use PyCall.jl to use scikit-learn KMeans

Apparently, the Clustering.jl package isn't returning the expected values in KMeans model. The PyCall is going to help by using scikit-learn KMeans, that has a better result.

Remove function to resize data

The resizeData() function isn't necessary. The kmeans() function consider the instances as columns and we consider the instances as rows. Then, when we were trying to apply the KMeans to our data with the K value higher than the number of features, it was like if we were trying to divide the dataset into more clusters that we have instances.

Implement SCARGC for 1NN classifier

Using the created functions, now it's important to create the main function: the SCARGC implementation using the Nearest Neighbor as classifier.

Create update function

The update function is the one responsible for updating the centroids, the labeled data and the labeled data labels. This is gonna happen if the concordance between the labeled data stored in the pool and the new calculated labels is still different of the pool size (concordance/maxPoolSize < 1 means that there's still something to change because it didn't reached 100% yet) or if there's less labels in the labeled data than in the pool.

Implement nearest neighbor classifier function

The nearest neighbor function must calculate the Euclidean Distance between the test instance and all of the labeled data to find the smaller distance.
The return of this function is the output label and the data from the neighbor with the smaller distance.

Create function to find the labels for current centroids

Along the algorithm, the centroids are updated. The function needs to find the label for the centroid in the current iteration.
For this, the nearest neighbor logic needs to be used. Then it's possible to get the label based on the label of the "same" centroid from the last iteration.

(...) given the current centroids (q1, q2, . . . , qk) from the most recent unlabeled clusters Ct and the past centroids (p1, p2, . . . , pk) from the previously labeled clusters Ct−1, where qi and pi are n-dimensional data, each centroid pi have a label yi and each centroid qi needs a label ˆyi. This label is obtained by the simple nearest neighbor algorithm. For this, each new centroid qi is associated to its closest past centroid, according to the Euclidean distance. In other words, after calculating the distance of a centroid qi to each past centroid, the label ˆyi given to qi is the same of the label yi of the nearest past centroid.

Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A.: Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency. SIAM International Conference on Data Mining (SDM), pp. 873-881, 2015.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.