Topic-Detection-unsupervised-from-urls

This program scrapes text from a list of urls (in csv) and creates TFIDF features from every term and puts it into a dataframe. It also creates the 'target' variable from the these features. We then use TF to create a basic network to classify. Topic detection from urls

**Packages you need to install: ** pandas, tensorflow, bs4, numpy, scikit-learn, time, csv

commands to install them:

pip install pandas

pin install -U scikit-learn

pip install bs4

pip install --upgrade tensorflow

if you face issues with TF installation, follow https://www.tensorflow.org/install/pip

Input: Please input the destination file for your list of urls (in csv) as filename.

This programs scrapes "significant" paragraphs from a list of provided urls. It uses Beautiful Soup to do this. Once that is done, the paragraphs are collated into a list. We then run the list of paragrpahs through a function defined as tfidf_dataframe. This function creates a pandas dataframe by capturing TFIDF of all the terms in these paragraphs (we will call each url's content as a document, so as not to confuse paragraphs as each url would contain multiple paragraphs). Scikit's TFIDF vectoriser is used for this purpose, and we can see that it captures relvant term frequencies, rather that just counting ALL the frequencies of all words.

""" Whats' TFIDF? TF-IDF stands for Term Frequency — Inverse Document Frequency Here is a blog that deep-dives into TFIDF and how it is used in information retrieval: https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089 """

we are also able to remove ALL stop words, so we will not be taxing any classifier by making them run through un-necessary words such as "this" and "that". We are hoping that the classifier looks at words such as "bitcoin", "ethereum", 'BTC', "crypto" etc.

Eg: the TFIDF score for the term 'wallet' and 'bitcoin' for the first url wallet
0.07957242026609590

bitcoin 0.18931356854040600

The task of classifying documents into multiple classes (meaning it's not a binary classification task) is a hard problem. It becomes extremely complex, when we add the constraint of non-supervision, meaning we have no training dataset which we can feed the classifier to "learn from".

So a "creative" way of solving this was to create the training data set using the word-frequencies (that we can extract using the TFIDF vectoriser). We can assume that this approach is rather good, but the only way to fully validate this approach is to actually manually compare results (which wasn't done).

so for a term like "bitcoin", we would calculate the TFIDF frequencies for all the documents. The documents that have "significant" presence of the term would have higher TFIDF scores. so for each category, we can calculate a cumulative score for each document.

so for document A, we have: (category + "score", labeling to make it easier to understand)

bitcoin_score 0.18907582749645400

technology_score (no presence) -1.0

trading_score 0.34438080201746400

art_score 0.028341746599359800

crypto_score 0.2914181223972970

so the score for the "bitcoin" occurences was lesser than "trading" and "crypto", so we can file this document under "trading". I followed this scoring approach for all the documents. I also had a label for "others"

Our target column looked something like this: (a cursory look revealed that a lot of the documents belonged to the category "others")

trading others bitcoin technology crypto trading technology bitcoin others crypto bitcoin crypto others technology technology crypto crypto bitcoin

Once this was done, we created a TF dataset from the dataframe, split the training dataset into test sets and validation sets. We only have 800 or so documents for this training, which is extremely less for the segmentation.

With an approach of 783 train examples & 88 test examples, we have obtained pretty bad performance.

Copied from TF blog is: """ Key Point: You will typically see best results with deep learning with much larger and more complex datasets. When working with a small dataset like this one, we recommend using a decision tree or random forest as a strong baseline. The goal of this tutorial is not to train an accurate model, but to demonstrate the mechanics of working with structured data, so you have code to use as a starting point when working with your own datasets in the future.

"""

How to improve:

More data. Think 10,000+ documents.
Some supervision. Even 20-30% can improve performance - we can use the above approach as well to get more supervision
We can use raw text in addition to the TFIDF frequecies as addiitonal features
We can also think of somehow emodying relationships in text as features. For eq. here we consider each word as independent of each other, we can figure out a way to also capture and quantify relationships.

aardra12 / topic-detection-unsupervised-from-urls Goto Github PK

topic-detection-unsupervised-from-urls's Introduction

Topic-Detection-unsupervised-from-urls

topic-detection-unsupervised-from-urls's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent