Giter VIP home page Giter VIP logo

relevant_topic_modeling's Introduction

Relevant topic modeling

This is repository with scripts for similarity search and topic modeling

Preperation

Creating and activating virtual environment (optional) Creating virtual environment
python -m venv venv

Activating virtual environment

Windows

venv\Scripts\activate.bat

Linux

source <venv>/bin/activate

Requirements installation

pip install -r requirenments.txt

Full process

Example:

python scripts/process.py example/test.csv example/queries.txt -o data/
process.py usage
usage: process.py [-h] [-o OUTPUT] [-l LANG] [-s] [-m MODEL] [-t THRESHOLD] [-sm SPACY_MODEL] [-gpt GPT_MODEL]
                  input queries

positional arguments:
  input                 path to input file
  queries               path to file with regex queries for relevant sentences search

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to directory where output files will be stored (default: ../data/)
  -l LANG, --lang LANG  language of documents (default: en)
  -s, --smart           use smart paragraphisation
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)
  -t THRESHOLD, --threshold THRESHOLD
                        threshold to determine relevant sentences (default: 0.5)
  -sm SPACY_MODEL, --spacy_model SPACY_MODEL
                        spacy model for lemmatization (default: en_core_web_lg)
  -gpt GPT_MODEL, --gpt_model GPT_MODEL
                        model for topic representation and summary (default: None)

Paragraphs and sentences split process

Example:

python scripts/split.py example/test.csv -o data/
split.py usage
usage: split.py [-h] [-o OUTPUT] [-l LANG] [-s] [-m MODEL] input

positional arguments:
  input                 path to input file

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to directory where output files will be stored (default: ../data/)
  -l LANG, --lang LANG  language of documents (default: en)
  -s, --smart           use smart paragraphisation
  -m MODEL, --model MODEL
                        model for smart paragraphisation (default: sentence-transformers/sentence-t5-xl)

Similarity score computing

Example:

python scripts/similarity.py example/queries.txt -i data/ -o data/
similarity.py usage
usage: similarity.py [-h] [-i INPUT] [-o OUTPUT] [-e EMBEDDINGS] [-m MODEL] queries

positional arguments:
  queries               path to file with regex queries for relevant sentences search

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to directory with paragraphs.csv and sentences.csv (default: ../data/)
  -o OUTPUT, --output OUTPUT
                        path to directory where files will be stored (default: ../data/)
  -e EMBEDDINGS, --embeddings EMBEDDINGS
                        is there embeddings
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)

Topic modeling

Example:

python scripts/topic_modeling.py -i data/ -o data/
topic_modeling.py usage
usage: topic_modeling.py [-h] [-i INPUT] [-o OUTPUT] [-t THRESHOLD] [-sm SPACY_MODEL] [-m MODEL] [-gpt GPT_MODEL]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to directory with sentences_sim.csv, optionaly with sentences_embeddings.npy, documents.csv (default:
                        ../data/)
  -o OUTPUT, --output OUTPUT
                        path to directory where files will be stored (default: ../data/)
  -t THRESHOLD, --threshold THRESHOLD
                        threshold to determine relevant sentences (default: 0.5)
  -sm SPACY_MODEL, --spacy_model SPACY_MODEL
                        spacy model for lemmatization (default: en_core_web_lg)
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)
  -gpt GPT_MODEL, --gpt_model GPT_MODEL
                        model for topic representation and summary (default: None)

relevant_topic_modeling's People

Contributors

iothor avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.