Giter VIP home page Giter VIP logo

bertopic's Introduction

PyPI - Python PyPI - License PyPI - PyPi Build docs DOI

BERTopic

BERTopic is a topic modeling technique that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium post can be found here and here.

Installation

Installation can be done using pypi:

pip install bertopic

To use the visualization options, install BERTopic as follows:

pip install bertopic[visualization]

To use Flair embeddings, install BERTopic as follows:

pip install bertopic[flair]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with the Google Colab notebook here.

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_freq().head()
Topic	Count
-1	7288
49	3992
30	701
27	684
11	568

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>> topic_model.get_topic(49)
[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Embedding Models

The parameter embedding_model takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:

from bertopic import BERTopic
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model)

Flair
Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

You can select any ๐Ÿค— transformers model here.

Custom Embeddings
You can also use previously generated embeddings by passing it through fit_transform():

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs, embeddings)

Overview

Methods Code
Fit the model topic_model.fit(docs])
Fit the model and predict documents topic_model.fit_transform(docs])
Predict new documents topic_model.transform([new_doc])
Access single topic topic_model.get_topic(12)
Access all topics topic_model.get_topics()
Get topic freq topic_model.get_topic_freq()
Visualize Topics topic_model.visualize_topics()
Visualize Topic Probability Distribution topic_model.visualize_distribution(probabilities[0])
Update topic representation topic_model.update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topics topic_model.reduce_topics(docs, topics, nr_topics=30)
Find topics topic_model.find_topics("vehicle")
Save model topic_model.save("my_model")
Load model BERTopic.load("my_model")
Get parameters topic_model.get_params()

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.5.0},
  doi          = {10.5281/zenodo.4430182},
  url          = {https://doi.org/10.5281/zenodo.4430182}
}

bertopic's People

Contributors

maartengr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.