Light

diem-ai / topic-modeling Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 1.21 MB

Retrieving real time breaking news from https://www.reuters.com/breakingviews and building topic modeling using Latent Dirichlet Allocation and Latent Semantic Analysis

License: MIT License

Jupyter Notebook 99.66% Python 0.34%

natural-language-processing gensim nltk latent-dirichlet-allocation latent-semantic-analysis

topic-modeling's Introduction

Topic Modeling with Latent Dirichlet Allocation (LDA) and Latent Sentiment Analysis (LSA)

Collecting top 500 news at https://www.reuters.com/breakingviews
The goal is to break text documents down into topics by word and to experience how topics are modelled with different appraches. We want to find “topics” that are collections of words that appear in similar documents
There are 2 popular libraries for LDA/LSAsuch as scikit-learn and gensim. I choose gensim for this project.

Project Notes

Dataset

Retrieving top 500 latest breaking news from https://www.reuters.com/breakingviews
Cleaning the data with beautifulsoup & save them into csv file (data/breakingnews.csv) in order to do analysis and to build model

Code

get_historical_news.py: pulling historial news from https://www.reuters.com/breakingviews
accessory_function.py: is a collection of functions imported in notebooks
- clean raw data
- sort returned values
- write pickle file
- read pickle file
model_preparation.ipynb:
- Read breakingnews.csv and clean special letters
- Visualize the most popular words by WordCloud
- Create dictionary from processed data and save it as dictionary.plk (/data/dictionary.pkl)
- Create a corpus from processed data and save it in /data/processed_data.pkl
- Create bag of words (BOW) and save it in /data/bow.pkl
- Create a TF-IDF and save it in /data/tfidf.pkl
Topic Modeling-LDA.ipynb:
- Build LDA model with bag-of-word from processed_data.pkl , bow.pkl and dictionary.pkl
- Build LDA model with TF-IDF from processed_data.pkl , ifidf.pkl and dictionary.pkl
- Print top 5 topics of each model and interpret the results
- Visual the topics and their words with pyLDAvis
- Calulate Perplexity and Topic Cohenrence between two models
Topic Modeling-LSA.ipynb:
- Build LSA model with bag-of-word from processed_data.pkl,bow.pkl, dictionary.pkl
- Build LSA model with TF-IDF from processed_data.pk, ifidf.pkl, dictionary.pkl
- Print top 5 topics of each model and interpret the results

View notebooks with Colab

model_preparation.ipynb: https://colab.research.google.com/drive/1VLf69UIoJ79TuMq2fh4BOnd8qeTzYUBI?authuser=1#scrollTo=aiERuDAhef71
Topic Modeling-LDA.ipynb: https://colab.research.google.com/drive/1RhSUArIbix4oF3ZbfSC94lHlTjhtBLUr?authuser=1#scrollTo=f28PG4o7x4SB
Topic Modeling-LSA.ipynb : https://colab.research.google.com/drive/1ZLiw8up2og9UVa2D6A8Wqa_P3YbJgWX7?authuser=1#scrollTo=rfjSPpiY299P

Project tasks:

Cleaning the dataset & Lemmatization
Creat a dictionay from processed data
Create Corpus and LDA/LSA Model with bag of words
Create Coprpus and LDA/LSA with TF-IDF
Caculate the Perplexity and Topic Cohenrence between two models
Visualize topics with the help of pyLDAvis

Requirements

Python >= 3.7
Jupyter Notebook

Dependencies

pandas
matplotlib
seaborn
pyLDAvis
scikit-learn
numpy
gensim
Scipy
nltk
string
beautifulsoup
WordCloud
requests

Run on local:

Checkout the project : git clone https://github.com/diem-ai/topic-modeling.git
Install the latest version of libraries in requirements and dependencies
Run get_historical_news.py to collect 500 latest news : python get_historical_news.py
Comment Colab Setup and change data path in notebooks
Run model_preparation.ipynb to produce the data
Run Topic Modeling-LDA.ipynb for LDA topic modeling
Run Topic Modeling-LSA.ipynb for LSA topic modeling

topic-modeling's People

Contributors

Stargazers

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.