gustavoaires / minetext Goto Github PK
View Code? Open in Web Editor NEWA Python framework for text mining. Specially Twitter text.
License: MIT License
A Python framework for text mining. Specially Twitter text.
License: MIT License
I'm using minetext on my application, specifically the clustering and visualization modules. In the clustering module, I'm using the distance
and kmedoids
classes; in the visualization module, I'm using the wordcloud_visualization
class.
As it is already well-known, the K-medoids algorithm, by its nature, can generate empty clusters. So, when I tried to do the clustering and subsequently generating the word cloud based on it, using the generate_pure_word_cloud_from_clusters()
, if there was an empty cluster, the WordCloud class would raise this error:
~minetext\visualization\wordcloud_visualization.py in generate_word_cloud(corpus, save_dir)
38
39 def generate_word_cloud(corpus, save_dir):
---> 40 wc = WordCloud().generate(corpus)
41 plt.imshow(wc, interpolation="bilinear")
42 plt.axis("off")
C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate(self, text)
563 self
564 """
--> 565 return self.generate_from_text(text)
566
567 def _check_generated(self):
C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_text(self, text)
545 """
546 words = self.process_text(text)
--> 547 self.generate_from_frequencies(words)
548 return self
549
C:\ProgramData\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
351 if len(frequencies) <= 0:
352 raise ValueError("We need at least 1 word to plot a word cloud, "
--> 353 "got %d." % len(frequencies))
354 frequencies = frequencies[:self.max_words]
355
ValueError: We need at least 1 word to plot a word cloud, got 0.
This is caused because, as the cluster is empty, the corpus will also be empty.
A suggestion would be to verify if at least the cluster has any non-empty documents in it. If yes, then a proper error could be raised, instead of waiting the WordCloud class to detect it.
When executing the clustering method in the K-medoids class, the for that iteracts through all the documents passed to the clusters, including those that have already been randomly assigned to be a cluster's medoid.
As such, the algorithm calculates again the distance between the document and all the assigned medoids, and assigns it to a cluster with the closest or least distant medoid.
That wouldn't be a problem, as the medoid would be the "closest medoid to itself". But, when there's a case when two different medoids are exactly the same, that can become an issue. In fact, it would become a matter of in which order their distances are calculated.
Consider the example:
[1] {id: 10, text: 'example text', cluster: 3}
[2] {id: 98, text: 'example text', cluster: 5}
Consider you're currently deciding to which cluster you'll assign the medoid [2] as a document, as it is part of the documents collection.
When calculating the distance from the document {id:98, text: 'example text', cluster: 5}, that is also a medoid, to the first medoid {id:10, text: 'example text', cluster: 3} in the medoids list, the distance value would be 0. So, as the algorithm goes, even if the distance from the document to the medoid [2] (that is, to itself) is also 0, the medoid [1] would still be considered the closest medoid, as the distance to the medoid [2] isn't smaller.
That way, the document would be assigned to the cluster number 3, that is, the cluster fro the document [1]. Furthermore, the document would still be a medoid, but already in cluster. That way, no other document would be assigned to the cluster 5, making it empty.
This is a matter of order, to such extend that there could be cases in which no medoids would be relocated to another clusters. Still, it can cause problems.
Make the Levenshtein distance calculation result values normalized between 0 and 1.
This is made by dividing the number of necessary operations to transform the source string into the target string by the length of the longer string (assuming they have different lenghts).
Currently, in the K-medoids algorithm, the cluster medoids are not included in the cluster documents, when they're first assigned.
So, when it's needed to get all the cluster documents, including the medoid, it is needed to get them separately.
It's needed to add the medoid to the cluster documents, because it is a document itself, and as such, it won't be changing clusters midway through the clustering.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.