Giter VIP home page Giter VIP logo

document_cluster's People

Contributors

bmabey avatar brandomr avatar fil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

document_cluster's Issues

BeautifulSoup used for no reason?

synopses_wiki = open('synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]

synopses_clean_wiki = []
for text in synopses_wiki:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_wiki.append(text)

synopses_wiki = synopses_clean_wiki

It seems the html has already been stripped in synopses_list_wiki.txt, so running the text through BeautifulSoup is pointless? I mention it because BeautifulSoup seems to be slowing things down significantly.

Get top n words that are nearest to cluster centroid

I cannot understand how by taking the indices of the words with max tf-idf per cluster center, you find the top words that are nearest to cluster centroid.Moreover, I want to ask you, cluster centroid is the center of each cluster?

Printing Clusters (Top terms & titles)

I've followed all the steps down to the final one where you print the top terms per cluster, together with the film titles.
I'm using a slightly different dataset (blog titles and blog post content) but in essence my data is the same as yours, although my data is already in a dataframe, so where you call on 'synopses', I call df.Content. The one step I couldn't do was the one where you grouped the rank by clusters as obviously this doesn't apply to me. I want ten clusters from my data.

Here, you create a dictionary:

films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])

But as I already have a dataframe, I reindex-ed using clusters. The problem is, only the first ten blog post titles are being used, as this screenshot shows:

image

As this is my first attempt at kMeans (although I've been experimenting with my data for three weeks) I'm not yet clever enough to work out what's going wrong. Any ideas? Thanks in advance!

"'float' object has no attribute encode" when trying to get 6 most frequent words for a cluster

Hi Brandon,
Thank you so very much for this tutorial. It is helping me a lot. I'd like to ask you about the following line of code: print(' %s' % frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',') When I run it, the compiler throws this error: "AttributeError: 'float' object has no attribute 'encode'."

I'm working with Python2.7, by the way. My tokenized list of words looks like this: norms = [u'jamie', u'johnson', u'sword', u'middle'].
dic = {'id': ids, 'norm': norms, 'cause': causes, 'cluster': clusters}
frame = pd.DataFrame(dic, index = [clusters] , columns = ['id', 'norm', 'cause']

I tried this line <<< frame.ix[terms[ind].split(' ')].values.tolist()[0][0, end='' >> (i.e. without the encoding part), but it gives me NaN for each value of the 6 most frequent words. And <<frame.ix[terms[ind].split(' ')].values.tolist()[0][1]>> And converting it to str: <<<frame.ix[str(terms[ind]).split(' ')].values). Also <<<<import sys; reload(sys); sys.setdefaultencoding("utf-8")>>>. These were probably pointless things to do... since << frame.ix[terms[ind].split(' ')].values>>> is a float object. I don`t understand this line. Do you know, by any chance, a good tutorial for pandas that might explain indexing and sorting on clusters for me or how to deal with this "float object has no attribute encode" situation?

Thank you so much for your reply! And have a great day.

"'float' object has no attribute encode" when trying to get 6 most frequent words for a cluster

Hi Brandon,
Thank you so very much for this tutorial. It is helping me a lot. I'd like to ask you about the following line of code: print(' %s' % frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',') When I run it, the compiler throws this error: "AttributeError: 'float' object has no attribute 'encode'."

I'm working with Python2.7, by the way. My tokenized list of words looks like this: norms = [u'jamie', u'johnson', u'sword', u'middle'].
dic = {'id': ids, 'norm': norms, 'cause': causes, 'cluster': clusters}
frame = pd.DataFrame(dic, index = [clusters] , columns = ['id', 'norm', 'cause']

I tried this line <<< frame.ix[terms[ind].split(' ')].values.tolist()[0][0, end='' >> (i.e. without the encoding part), but it gives me NaN for each value of the 6 most frequent words. And <<frame.ix[terms[ind].split(' ')].values.tolist()[0][1]>>.....And converting it to str: <<<frame.ix[str(terms[ind]).split(' ')].values) and also <<<<import sys; reload(sys); sys.setdefaultencoding("utf-8")>>>. These were probably pointless things to do... since << frame.ix[terms[ind].split(' ')].values>>> is a float object. I don`t understand this line. Do you know, by any chance, a good tutorial for pandas that might explain indexing and sorting on clusters for me or how to deal with this "float object has no attribute encode" situation?

Thank you so much for your reply! And have a great day.

Unable to get the top n words nearest to the cluster centroid.

Thank you so much for posting such detailed tutorial !

I am trying to use this to cluster news content.
I have 275449 news contents that I need to cluster. The structure of my data is pretty similar to yours. I have news content Id and description (I don't have a ranking concept that you have in your data).

I followed all the steps as per your guide but when I tried to print the top n words nearest to the cluster centroid, it gave me a weird output. It printed the same combination of words in a specific format, with special characters etc.

In fact, I tried running this by creating very small test dataset, with just 10 records, but ended up with the same output.

Cluster 0 words: b'good', b'weather', b'game',

Cluster 0 ContentID: 1, 6,

Cluster 1 words: b'weather', b'good', b'game',

Cluster 1 ContentID: 3, 5, 8, 10,

Cluster 2 words: b'game', b'weather', b'good',

Cluster 2 ContentID: 2, 7,

Cluster 3 words: b'weather', b'good', b'game',

Cluster 3 ContentID: 4, 9,

Could you please help me to fix this.

Appreciate your help on this !

Issue in the visualisation section when iterating through the dataframe

Hi, thanks for a great article, i noticed an issue with the following code
i had to change this:

#add label in x,y position with the label as the film title
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=8)  

to this:


#add label in x,y position with the label as the film title
    for i,r in df.iterrows():
        ax.text(r['x'], r['y'], r['title'], size=8)

This is because earlier in the code, you set the dataframe index to the clusters:
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
Thanks again

Why distances are calculated twice?

Hi,

Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering?
First here on the `tfidf_matrix' using cosine distance:

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

and second time here over the dist through ward function that runs euclidean distance before doing the ward linkage:

linkage_matrix = ward(dist)

Is this something specially done for text clustering?

Thanks again

requirements.txt file?

When trying to run the notebook I ran into the following error:


---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-2-8a8f6d507950> in <module>()
     12 synopses_clean_wiki = []
     13 for text in synopses_wiki:
---> 14     text = nltk.clean_html(text)
     15     #strips html formatting
     16     text = text.replace('\xef', '')

/Users/bmabey/.virtualenvs/rbl-data/lib/python2.7/site-packages/nltk/util.pyc in clean_html(html)
    344 
    345 def clean_html(html):
--> 346     raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
    347 
    348 def clean_url(url):

NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

I'm guessing that I have a newer version of nltk installed. Could you push a requirements.txt file for this notebook so an appropriate virtual env can be created?

BTW, I was trying to run this notebook so I could use my new pyLDAvis project to visualize the LDA model you create at the end.

cosine_similarity(x,y)

I attempted to apply the method to clustering tweets. I may be misunderstanding how this works, but running it with cosine_similarity(matrix name) only worked when my data was very small (500 tweets). Once I went to 150,000 tweets, I received memory errors. I used what the documentation said here, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html, by adding cosine_similarity(matrix[len - 1], matrix) which I found in another example elsewhere since lost.

Is there a reason your code runs it without passing the x and y separately?

Am I doing something wrong here

Hi
I just went over your document clustering tutorial and it is really amazing ! great work!
I am trying to conduct a clustering of e-mails, so I have been altering the code a bit to fit my purpose.
When I print the words in each cluster I get the same word reiterated in the same cluster (cluster 0: word1, word2, word1, word 3, word4 etc) or the same word appears in two or more clusters.

  1. Would you assume there is a problem with the code or is this theoretically possible? scratching my head at the moment.
  2. The dataset you worked with had titles and synopsis, for me the synopsis are the contents of the emails and I was thinking to add a "category" called spam/ham instead of the titles to have more informative data points.
  3. For the hierarchy document clustering I changed the labels to 'terms' instead of 'titles'. Does that make sense?

IPython notebook stuff just gets in the way?

It took me a while to figure out what an ipython notebook was and how to open it and run the code.

Then I tried to create a pull request for my earlier issue, but when I opened the notebook in jupyter it apparently upgraded the notebook file to a more recent format, so my diff would have been huge.

I wanted to use this project in my own code, but it looks like I have to copy and paste snippets out of the notebook in order to do that?

I guess it's kind of cool to have that literate programming style, but mostly the notebook stuff just seems to get in the way and make life difficult. If you got rid of it and just had normal python code, it seems like this project would be significantly easier to work with.

ValueError: Circular reference detected

helle,brandom,i'm trying to use your code to do my own cluster,but when i want to use mpld3 to visiualize my cluster. a default occurs like that.i don't know what happens,is there any fault in your code?Or is my mpld3 version wrong?
i'm so anxious?can you help me with that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.