brandomr / document_cluster Goto Github PK

View Code? Open in Web Editor NEW

505.0 505.0 340.0 6.66 MB

A guide to document clustering in Python

Home Page: http://brandonrose.org/clustering

CSS 0.03% JavaScript 1.15% HTML 10.97% Jupyter Notebook 87.86%

document_cluster's People

Contributors

Stargazers

Watchers

Forkers

vijayendra-g rajshah4 alejandrotobon dipanjanicron encodedcipher andreaspawlik heba-alsayed-ahmad jeremiahopendns jayhetee hamley241 rvaughan suncj codingafuture likaiguo ml-ai-nlp-ir fil bmabey xsongx akansal1 alabarga wguo123 madjelan moritzmcgarrie shantanu2204 luisalfredomoctezuma adishankar710 patrickallo kcompher markcheno burkesquires andradeandrey ganeshp-cuelogic sagnik17 abshahid giwa mrmethodmanman youfulei lizsz pmarais ezriah amitdingarenovelis amtec jternent zangsir chris-boson shikharateverest iamlalitpandey panther18 harrywang chiranjibsuruf narain280493 neo-nie ankit481 ajagaja zhz46 yyd106 kevenyf huskyeder jiehu567 darg0001 ksivathanu drstatsvenu oksimosenko sridevibaskaran crayon277 madhuvaddi alexandrudaia gkoumasd pasana spiderflying ohrobin brityboy enod rgaonkar masou kero13 aimimi2015 2dpodcast brucewalthers aberguilla001 laxpatil rahulin05 indu28 anuj1267 ferrero-zhang linyae dsmathon chiddianozie colinsongf yangqiokay swamvenk airrohit123 tinja shemster kwresearch sp-mishra mooons mrunmayeeshukla huanqi eileenhui

document_cluster's Issues

BeautifulSoup used for no reason?

synopses_wiki = open('synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]

synopses_clean_wiki = []
for text in synopses_wiki:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_wiki.append(text)

synopses_wiki = synopses_clean_wiki

It seems the html has already been stripped in synopses_list_wiki.txt, so running the text through BeautifulSoup is pointless? I mention it because BeautifulSoup seems to be slowing things down significantly.

Get top n words that are nearest to cluster centroid

I cannot understand how by taking the indices of the words with max tf-idf per cluster center, you find the top words that are nearest to cluster centroid.Moreover, I want to ask you, cluster centroid is the center of each cluster?

[Upgrade] Stop Words upgrade

There a better list of Stop Words with this package.

I have clustered crime articles based on crime category. But now I want to see which article belongs to which category ?

Please give some hints or direction to perform that . I have followed your code. here I am attaching my data sheet.
classification.xlsx

Printing Clusters (Top terms & titles)

I've followed all the steps down to the final one where you print the top terms per cluster, together with the film titles.
I'm using a slightly different dataset (blog titles and blog post content) but in essence my data is the same as yours, although my data is already in a dataframe, so where you call on 'synopses', I call df.Content. The one step I couldn't do was the one where you grouped the rank by clusters as obviously this doesn't apply to me. I want ten clusters from my data.

Here, you create a dictionary:

films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])

But as I already have a dataframe, I reindex-ed using clusters. The problem is, only the first ten blog post titles are being used, as this screenshot shows:

As this is my first attempt at kMeans (although I've been experimenting with my data for three weeks) I'm not yet clever enough to work out what's going wrong. Any ideas? Thanks in advance!

'cluster_analysis' workbook

Hi Brandon,

Great work on http://brandonrose.org/clustering!

But, where can I find the cluster analysis full workbook? I want to look what is the structure of the "synopses" list. Is it a regular 2D Python list?

Cheers,
Glorian

"'float' object has no attribute encode" when trying to get 6 most frequent words for a cluster

Hi Brandon,
Thank you so very much for this tutorial. It is helping me a lot. I'd like to ask you about the following line of code: print(' %s' % frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',') When I run it, the compiler throws this error: "AttributeError: 'float' object has no attribute 'encode'."

I'm working with Python2.7, by the way. My tokenized list of words looks like this: norms = [u'jamie', u'johnson', u'sword', u'middle'].
dic = {'id': ids, 'norm': norms, 'cause': causes, 'cluster': clusters}
frame = pd.DataFrame(dic, index = [clusters] , columns = ['id', 'norm', 'cause']

I tried this line <<< frame.ix[terms[ind].split(' ')].values.tolist()[0][0, end='' >> (i.e. without the encoding part), but it gives me NaN for each value of the 6 most frequent words. And <<frame.ix[terms[ind].split(' ')].values.tolist()[0][1]>> And converting it to str: <<<frame.ix[str(terms[ind]).split(' ')].values). Also <<<<import sys; reload(sys); sys.setdefaultencoding("utf-8")>>>. These were probably pointless things to do... since << frame.ix[terms[ind].split(' ')].values>>> is a float object. I don`t understand this line. Do you know, by any chance, a good tutorial for pandas that might explain indexing and sorting on clusters for me or how to deal with this "float object has no attribute encode" situation?

Thank you so much for your reply! And have a great day.

"'float' object has no attribute encode" when trying to get 6 most frequent words for a cluster

I tried this line <<< frame.ix[terms[ind].split(' ')].values.tolist()[0][0, end='' >> (i.e. without the encoding part), but it gives me NaN for each value of the 6 most frequent words. And <<frame.ix[terms[ind].split(' ')].values.tolist()[0][1]>>.....And converting it to str: <<<frame.ix[str(terms[ind]).split(' ')].values) and also <<<<import sys; reload(sys); sys.setdefaultencoding("utf-8")>>>. These were probably pointless things to do... since << frame.ix[terms[ind].split(' ')].values>>> is a float object. I don`t understand this line. Do you know, by any chance, a good tutorial for pandas that might explain indexing and sorting on clusters for me or how to deal with this "float object has no attribute encode" situation?

Thank you so much for your reply! And have a great day.

Unable to get the top n words nearest to the cluster centroid.

Thank you so much for posting such detailed tutorial !

I am trying to use this to cluster news content.
I have 275449 news contents that I need to cluster. The structure of my data is pretty similar to yours. I have news content Id and description (I don't have a ranking concept that you have in your data).

I followed all the steps as per your guide but when I tried to print the top n words nearest to the cluster centroid, it gave me a weird output. It printed the same combination of words in a specific format, with special characters etc.

In fact, I tried running this by creating very small test dataset, with just 10 records, but ended up with the same output.

Cluster 0 words: b'good', b'weather', b'game',

Cluster 0 ContentID: 1, 6,

Cluster 1 words: b'weather', b'good', b'game',

Cluster 1 ContentID: 3, 5, 8, 10,

Cluster 2 words: b'game', b'weather', b'good',

Cluster 2 ContentID: 2, 7,

Cluster 3 words: b'weather', b'good', b'game',

Cluster 3 ContentID: 4, 9,

Could you please help me to fix this.

Appreciate your help on this !

Urgent: How to modify the number of synopses to take more than 100

When I try to add a number of titles and synopses more than 100 it keeps giving me a ValueError due to the shape of arrays

Issue in the visualisation section when iterating through the dataframe

Hi, thanks for a great article, i noticed an issue with the following code
i had to change this:

#add label in x,y position with the label as the film title
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=8)

to this:


#add label in x,y position with the label as the film title
    for i,r in df.iterrows():
        ax.text(r['x'], r['y'], r['title'], size=8)

This is because earlier in the code, you set the dataframe index to the clusters:
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
Thanks again

Why distances are calculated twice?

Hi,

Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering?
First here on the `tfidf_matrix' using cosine distance:

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

and second time here over the dist through ward function that runs euclidean distance before doing the ward linkage:

linkage_matrix = ward(dist)

Is this something specially done for text clustering?

Thanks again

requirements.txt file?

When trying to run the notebook I ran into the following error:


---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-2-8a8f6d507950> in <module>()
     12 synopses_clean_wiki = []
     13 for text in synopses_wiki:
---> 14     text = nltk.clean_html(text)
     15     #strips html formatting
     16     text = text.replace('\xef', '')

/Users/bmabey/.virtualenvs/rbl-data/lib/python2.7/site-packages/nltk/util.pyc in clean_html(html)
    344 
    345 def clean_html(html):
--> 346     raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
    347 
    348 def clean_url(url):

NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

I'm guessing that I have a newer version of nltk installed. Could you push a requirements.txt file for this notebook so an appropriate virtual env can be created?

BTW, I was trying to run this notebook so I could use my new pyLDAvis project to visualize the LDA model you create at the end.

cosine_similarity(x,y)

I attempted to apply the method to clustering tweets. I may be misunderstanding how this works, but running it with cosine_similarity(matrix name) only worked when my data was very small (500 tweets). Once I went to 150,000 tweets, I received memory errors. I used what the documentation said here, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html, by adding cosine_similarity(matrix[len - 1], matrix) which I found in another example elsewhere since lost.

Is there a reason your code runs it without passing the x and y separately?

Am I doing something wrong here

Hi
I just went over your document clustering tutorial and it is really amazing ! great work!
I am trying to conduct a clustering of e-mails, so I have been altering the code a bit to fit my purpose.
When I print the words in each cluster I get the same word reiterated in the same cluster (cluster 0: word1, word2, word1, word 3, word4 etc) or the same word appears in two or more clusters.

Would you assume there is a problem with the code or is this theoretically possible? scratching my head at the moment.
The dataset you worked with had titles and synopsis, for me the synopsis are the contents of the emails and I was thinking to add a "category" called spam/ham instead of the titles to have more informative data points.
For the hierarchy document clustering I changed the labels to 'terms' instead of 'titles'. Does that make sense?

IPython notebook stuff just gets in the way?

It took me a while to figure out what an ipython notebook was and how to open it and run the code.

Then I tried to create a pull request for my earlier issue, but when I opened the notebook in jupyter it apparently upgraded the notebook file to a more recent format, so my diff would have been huge.

I wanted to use this project in my own code, but it looks like I have to copy and paste snippets out of the notebook in order to do that?

I guess it's kind of cool to have that literate programming style, but mostly the notebook stuff just seems to get in the way and make life difficult. If you got rid of it and just had normal python code, it seems like this project would be significantly easier to work with.

ValueError: Circular reference detected

helle,brandom,i'm trying to use your code to do my own cluster,but when i want to use mpld3 to visiualize my cluster. a default occurs like that.i don't know what happens,is there any fault in your code?Or is my mpld3 version wrong?
i'm so anxious?can you help me with that?