gustavoaires / minetext Goto Github PK

View Code? Open in Web Editor NEW

13.0 4.0 3.0 1.83 MB

A Python framework for text mining. Specially Twitter text.

License: MIT License

Python 100.00%

data-science data-analysis data-mining text-mining python naive-bayes dbscan twitter natural-language-processing

minetext's Introduction

minetext

minetext's People

Contributors

Stargazers

Watchers

Forkers

acrale gustavomts rsmahabir

minetext's Issues

Generating word cloud of empty clusters

I'm using minetext on my application, specifically the clustering and visualization modules. In the clustering module, I'm using the distance and kmedoids classes; in the visualization module, I'm using the wordcloud_visualization class.

As it is already well-known, the K-medoids algorithm, by its nature, can generate empty clusters. So, when I tried to do the clustering and subsequently generating the word cloud based on it, using the generate_pure_word_cloud_from_clusters(), if there was an empty cluster, the WordCloud class would raise this error:

~minetext\visualization\wordcloud_visualization.py in generate_word_cloud(corpus, save_dir)
     38 
     39 def generate_word_cloud(corpus, save_dir):
---> 40     wc = WordCloud().generate(corpus)
     41     plt.imshow(wc, interpolation="bilinear")
     42     plt.axis("off")

C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate(self, text)
    563         self
    564         """
--> 565         return self.generate_from_text(text)
    566 
    567     def _check_generated(self):

C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_text(self, text)
    545         """
    546         words = self.process_text(text)
--> 547         self.generate_from_frequencies(words)
    548         return self
    549 

C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
    351         if len(frequencies) <= 0:
    352             raise ValueError("We need at least 1 word to plot a word cloud, "
--> 353                              "got %d." % len(frequencies))
    354         frequencies = frequencies[:self.max_words]
    355 

ValueError: We need at least 1 word to plot a word cloud, got 0.

This is caused because, as the cluster is empty, the corpus will also be empty.

A suggestion would be to verify if at least the cluster has any non-empty documents in it. If yes, then a proper error could be raised, instead of waiting the WordCloud class to detect it.

Medoids being relocated to another cluster

When executing the clustering method in the K-medoids class, the for that iteracts through all the documents passed to the clusters, including those that have already been randomly assigned to be a cluster's medoid.

As such, the algorithm calculates again the distance between the document and all the assigned medoids, and assigns it to a cluster with the closest or least distant medoid.
That wouldn't be a problem, as the medoid would be the "closest medoid to itself". But, when there's a case when two different medoids are exactly the same, that can become an issue. In fact, it would become a matter of in which order their distances are calculated.

Consider the example:

Medoids

[1] {id: 10, text: 'example text', cluster: 3}
[2] {id: 98, text: 'example text', cluster: 5}

Consider you're currently deciding to which cluster you'll assign the medoid [2] as a document, as it is part of the documents collection.

When calculating the distance from the document {id:98, text: 'example text', cluster: 5}, that is also a medoid, to the first medoid {id:10, text: 'example text', cluster: 3} in the medoids list, the distance value would be 0. So, as the algorithm goes, even if the distance from the document to the medoid [2] (that is, to itself) is also 0, the medoid [1] would still be considered the closest medoid, as the distance to the medoid [2] isn't smaller.

That way, the document would be assigned to the cluster number 3, that is, the cluster fro the document [1]. Furthermore, the document would still be a medoid, but already in cluster. That way, no other document would be assigned to the cluster 5, making it empty.

This is a matter of order, to such extend that there could be cases in which no medoids would be relocated to another clusters. Still, it can cause problems.

Normalize Levenshtein distance results

Make the Levenshtein distance calculation result values normalized between 0 and 1.

This is made by dividing the number of necessary operations to transform the source string into the target string by the length of the longer string (assuming they have different lenghts).

Add the cluster medoid to the cluster documents

Currently, in the K-medoids algorithm, the cluster medoids are not included in the cluster documents, when they're first assigned.

So, when it's needed to get all the cluster documents, including the medoid, it is needed to get them separately.

It's needed to add the medoid to the cluster documents, because it is a document itself, and as such, it won't be changing clusters midway through the clustering.

gustavoaires / minetext Goto Github PK

minetext's Introduction

minetext

minetext's People

Contributors

Stargazers

Watchers

Forkers

minetext's Issues

Generating word cloud of empty clusters

Medoids being relocated to another cluster

Medoids

Normalize Levenshtein distance results

Add the cluster medoid to the cluster documents

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent