Giter VIP home page Giter VIP logo

minetext's Introduction

minetext

minetext's People

Contributors

caiomelo8 avatar gustavomts avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

minetext's Issues

Generating word cloud of empty clusters

I'm using minetext on my application, specifically the clustering and visualization modules. In the clustering module, I'm using the distance and kmedoids classes; in the visualization module, I'm using the wordcloud_visualization class.

As it is already well-known, the K-medoids algorithm, by its nature, can generate empty clusters. So, when I tried to do the clustering and subsequently generating the word cloud based on it, using the generate_pure_word_cloud_from_clusters(), if there was an empty cluster, the WordCloud class would raise this error:

~minetext\visualization\wordcloud_visualization.py in generate_word_cloud(corpus, save_dir)
     38 
     39 def generate_word_cloud(corpus, save_dir):
---> 40     wc = WordCloud().generate(corpus)
     41     plt.imshow(wc, interpolation="bilinear")
     42     plt.axis("off")

C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate(self, text)
    563         self
    564         """
--> 565         return self.generate_from_text(text)
    566 
    567     def _check_generated(self):

C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_text(self, text)
    545         """
    546         words = self.process_text(text)
--> 547         self.generate_from_frequencies(words)
    548         return self
    549 

C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
    351         if len(frequencies) <= 0:
    352             raise ValueError("We need at least 1 word to plot a word cloud, "
--> 353                              "got %d." % len(frequencies))
    354         frequencies = frequencies[:self.max_words]
    355 

ValueError: We need at least 1 word to plot a word cloud, got 0.

This is caused because, as the cluster is empty, the corpus will also be empty.

A suggestion would be to verify if at least the cluster has any non-empty documents in it. If yes, then a proper error could be raised, instead of waiting the WordCloud class to detect it.

Medoids being relocated to another cluster

When executing the clustering method in the K-medoids class, the for that iteracts through all the documents passed to the clusters, including those that have already been randomly assigned to be a cluster's medoid.

As such, the algorithm calculates again the distance between the document and all the assigned medoids, and assigns it to a cluster with the closest or least distant medoid.
That wouldn't be a problem, as the medoid would be the "closest medoid to itself". But, when there's a case when two different medoids are exactly the same, that can become an issue. In fact, it would become a matter of in which order their distances are calculated.

Consider the example:

Medoids

[1] {id: 10, text: 'example text', cluster: 3}
[2] {id: 98, text: 'example text', cluster: 5}

Consider you're currently deciding to which cluster you'll assign the medoid [2] as a document, as it is part of the documents collection.

When calculating the distance from the document {id:98, text: 'example text', cluster: 5}, that is also a medoid, to the first medoid {id:10, text: 'example text', cluster: 3} in the medoids list, the distance value would be 0. So, as the algorithm goes, even if the distance from the document to the medoid [2] (that is, to itself) is also 0, the medoid [1] would still be considered the closest medoid, as the distance to the medoid [2] isn't smaller.

That way, the document would be assigned to the cluster number 3, that is, the cluster fro the document [1]. Furthermore, the document would still be a medoid, but already in cluster. That way, no other document would be assigned to the cluster 5, making it empty.

This is a matter of order, to such extend that there could be cases in which no medoids would be relocated to another clusters. Still, it can cause problems.

Normalize Levenshtein distance results

Make the Levenshtein distance calculation result values normalized between 0 and 1.

This is made by dividing the number of necessary operations to transform the source string into the target string by the length of the longer string (assuming they have different lenghts).

Add the cluster medoid to the cluster documents

Currently, in the K-medoids algorithm, the cluster medoids are not included in the cluster documents, when they're first assigned.

So, when it's needed to get all the cluster documents, including the medoid, it is needed to get them separately.

It's needed to add the medoid to the cluster documents, because it is a document itself, and as such, it won't be changing clusters midway through the clustering.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.