Giter VIP home page Giter VIP logo

similarity-check-advanced's Introduction

Similarity-Check-Advanced

Checks for keyword similarity - then generates unique keywords and a color-coded spreadsheet with removed keywords, retained keywords and similarity.

Keyword De-duplication with Agglomerative Clustering

This Python script helps you remove extremely similar keywords from a list using the Agglomerative Clustering algorithm. It leverages the power of sentence transformers to generate embeddings for each keyword and then clusters them based on similarity. The script also generates an Excel output file that shows which keywords were removed and their similarity to the retained keyword.

Requirements

  • Python 3.6 or higher
  • Install required packages: pip install -r requirements.txt

Usage

  1. Place your list of keywords in a CSV file with no header (e.g., keywords.csv).

  2. Run the script:

python deduplication.py
  1. Follow the prompts to set the similarity threshold and output file name.

  2. The script will generate two output files:

    • unique_keywords.csv: A list of unique keywords after removing similar ones.
    • keywords_output.xlsx: An Excel file with color-coded information on removed keywords, retained keywords, and their similarity.

How does it work?

  1. Reads keywords from a CSV file - The script reads your list of keywords from a CSV file, storing the keywords in a list.

  2. Generate keyword embeddings - The script leverages the SentenceTransformer library to create embeddings for each keyword. Sentence transformers are pre-trained neural network models that can convert text into numerical vectors (embeddings) that capture semantic information. The script uses the paraphrase-distilroberta-base-v1 model for this purpose.

  3. Compute similarity matrix - The script calculates a similarity matrix for the keyword embeddings using the cosine similarity metric. Cosine similarity values range from 0 to 1. In this case, 0 would mean not similar at all - 1 would mean extremely similar. Try staying in the higher ranges.

  4. Perform Agglomerative Clustering - Based on the similarity matrix, the script applies the Agglomerative Clustering algorithm to group similar keywords together. This algorithm creates a hierarchical clustering structure, merging pairs of clusters iteratively until the specified similarity threshold is reached.

  5. Set similarity threshold - You can define a similarity threshold (default is 0.8) during the execution of the script. If the similarity between two keywords is higher than this threshold, they will be considered part of the same cluster and one of them will be removed.

  6. Generate output files - The script creates two output files for you:

  • unique_keywords.csv - Contains the list of unique keywords after removing similar ones.

  • keywords_output.xlsx (or the name you entered) - An Excel file with color-coded information on removed keywords, retained keywords, and their similarity. Removed keywords are highlighted in yellow.

similarity-check-advanced's People

Contributors

iamblogpro avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.