Giter VIP home page Giter VIP logo

wordsclusterbysynonyms's Introduction

WordsClusterBySynonyms

Words clustering using synonyms

This class is able to create clusters by using the definition of synonyms inside NLTK. Let's see an example.

import pandas as pd
import WordsClusterBySynonyms as wcbs

In this case we decided to use a list of italian verbs.

verbs = [
    'cogliere', 'intagliare', 'ragguagliare', 'dilazionare', 'tuffare',
    'dissipare', 'indisporre', 'complottare', 'contraddire', 'sconoscere',
    'sgocciolare', 'ridimensionare', 'ammansire', 'stuzzicare', 'rintuzzare',
    ...
    'autenticare', 'programmare', 'assassinare', 'immalinconire', 'esalare',
    'istigare', 'abiurare', 'curare', 'tranciare', 'tracciare', 'vagolare',
    'raddolcire', 'sfinire', 'confrontare', 'indispettire','fare','avere','vivere'
]
verbs = pd.DataFrame(verbs)
verbs.columns = ['verbs']

WordsClusterBySynonyms requires a dataframe in which you have to specify the name of the target column and the language.

The first function inside WordClusterBySynonyms is get_synonyms_pandas. It applies on the dataframe the generation of synonyms by creating a new columns.

wc = wcbs.WordsClusterBySynonyms(verbs, 'verbs', lang='ita')
df = wc.get_synonyms_pandas()
wc.plot_hist(df)

hist_all.jpg

Using set_threshold you can repeat get_synonyms_pandas with a threshold on the number of synonyms for each word.

df = wc.set_threshold(20, df)

Using plot_hist you can check if in your list of words there are words with associate a huge number of synonyms. These words are a problem, because they tend to create few huge clusters with our definition of distance.

wc.plot_hist(df)

hist_no_higher.jpg

DISTANCE

Given two different words (A and B) with associated two lists of synonyms ( sa and sb) A is equal to B if sa is equal to sb. A is totally different from B if there is an empty intersection between sa and sb.

The formula we used is:

formula

You can choose between min or max, or if you would like to use your definition of distance:

    def mydistance_name():
        ...
        return ...

    wc.create_distance_matrix(mydistance= mydistance_name, criteria=None, verbose=True)
matrix = wc.create_distance_matrix(criteria=min, verbose=True)
wc.plot_eps_ncluster(matrix, ntot=10, min_samples=6)

plot_eps_clusters.jpg plot_eps_not_clustered.jpg

The function run_cluster uses the DBSCAN implemented in sklearn. You can find the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

result = wc.run_cluster(0.3,6, matrix)

Below a plot to show the cluster using a wordcloud-like format, where for a smaller size correnspond a lower distance.

wc.plot_cluster_k(matrix, 'contraddire')

contraddire.jpg

This class seems to work better for verbs and adjectives, but in general the goodness of this method is crucial correlated to the "goodness" of synonyms' structure.

I've done this class together with https://github.com/aborgher

wordsclusterbysynonyms's People

Contributors

frucci avatar

Stargazers

Evgeniy avatar Theodore Galanos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.