Giter VIP home page Giter VIP logo

topic-detection-unsupervised-from-urls's Introduction

Topic-Detection-unsupervised-from-urls

This program scrapes text from a list of urls (in csv) and creates TFIDF features from every term and puts it into a dataframe. It also creates the 'target' variable from the these features. We then use TF to create a basic network to classify. Topic detection from urls

**Packages you need to install: ** pandas, tensorflow, bs4, numpy, scikit-learn, time, csv

commands to install them:

pip install pandas

pin install -U scikit-learn

pip install bs4

pip install --upgrade tensorflow

if you face issues with TF installation, follow https://www.tensorflow.org/install/pip

Input: Please input the destination file for your list of urls (in csv) as filename.

This programs scrapes "significant" paragraphs from a list of provided urls. It uses Beautiful Soup to do this. Once that is done, the paragraphs are collated into a list. We then run the list of paragrpahs through a function defined as tfidf_dataframe. This function creates a pandas dataframe by capturing TFIDF of all the terms in these paragraphs (we will call each url's content as a document, so as not to confuse paragraphs as each url would contain multiple paragraphs). Scikit's TFIDF vectoriser is used for this purpose, and we can see that it captures relvant term frequencies, rather that just counting ALL the frequencies of all words.

""" Whats' TFIDF? TF-IDF stands for Term Frequency โ€” Inverse Document Frequency Here is a blog that deep-dives into TFIDF and how it is used in information retrieval: https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089 """

we are also able to remove ALL stop words, so we will not be taxing any classifier by making them run through un-necessary words such as "this" and "that". We are hoping that the classifier looks at words such as "bitcoin", "ethereum", 'BTC', "crypto" etc.

Eg: the TFIDF score for the term 'wallet' and 'bitcoin' for the first url wallet
0.07957242026609590

bitcoin 0.18931356854040600

The task of classifying documents into multiple classes (meaning it's not a binary classification task) is a hard problem. It becomes extremely complex, when we add the constraint of non-supervision, meaning we have no training dataset which we can feed the classifier to "learn from".

So a "creative" way of solving this was to create the training data set using the word-frequencies (that we can extract using the TFIDF vectoriser). We can assume that this approach is rather good, but the only way to fully validate this approach is to actually manually compare results (which wasn't done).

so for a term like "bitcoin", we would calculate the TFIDF frequencies for all the documents. The documents that have "significant" presence of the term would have higher TFIDF scores. so for each category, we can calculate a cumulative score for each document.

so for document A, we have: (category + "score", labeling to make it easier to understand)

bitcoin_score 0.18907582749645400

technology_score (no presence) -1.0

trading_score 0.34438080201746400

art_score 0.028341746599359800

crypto_score 0.2914181223972970

so the score for the "bitcoin" occurences was lesser than "trading" and "crypto", so we can file this document under "trading". I followed this scoring approach for all the documents. I also had a label for "others"

Our target column looked something like this: (a cursory look revealed that a lot of the documents belonged to the category "others")

trading others bitcoin technology crypto trading technology bitcoin others crypto bitcoin crypto others technology technology crypto crypto bitcoin

Once this was done, we created a TF dataset from the dataframe, split the training dataset into test sets and validation sets. We only have 800 or so documents for this training, which is extremely less for the segmentation.

With an approach of 783 train examples & 88 test examples, we have obtained pretty bad performance.

Copied from TF blog is: """ Key Point: You will typically see best results with deep learning with much larger and more complex datasets. When working with a small dataset like this one, we recommend using a decision tree or random forest as a strong baseline. The goal of this tutorial is not to train an accurate model, but to demonstrate the mechanics of working with structured data, so you have code to use as a starting point when working with your own datasets in the future.

"""

How to improve:

  1. More data. Think 10,000+ documents.
  2. Some supervision. Even 20-30% can improve performance - we can use the above approach as well to get more supervision
  3. We can use raw text in addition to the TFIDF frequecies as addiitonal features
  4. We can also think of somehow emodying relationships in text as features. For eq. here we consider each word as independent of each other, we can figure out a way to also capture and quantify relationships.

topic-detection-unsupervised-from-urls's People

Contributors

aardra12 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.