Giter VIP home page Giter VIP logo

topicmodeling_nlp's Introduction

Eample Portfolio URL github URL Mailto Linkedin URL Twitter URL


This Repo Contains two personal projects outlined below and all files attached in the repo

1. Sentiment Analysis

  • An example intuition to bag of words model in NLP using Kirill Eremenko Restaurant reviews intuition dataset.
  • You could improve more on the model could be improved further to still be able to get the intuition behind it.

2. Topic Modelling

Objective

  • NB: Notebook might not lood check the python script ๐Ÿ‘‰๐Ÿฝ Kmeans Topic Modelling
  • In this project, we want to group customers reviews on twitter corpus based on recurring patterns. We should be able to get a sense of the specific topic in each cluster, what the customers are complaining about based on specific patterns. The twitter corpus contains a lot of noise and we will try to minimize this and create sense out of the data.

Data

  • The data used is Twitter data with lots of Noise on reviews. 21047 tweets with 4 attributes username, date , tweet and mention i.e a data about vodafone which is a telecom company tweets.csv.

Methodology

  • The ML technique used in this project is the kmeans clustering which is an unsupervised model to be able to extract some patterns.
  1. Data Cleaning with Pattern Removal

    • Removing mentions with @
    • Replacing non-alphabets with empty space
    • Convert Capital cases to lower cases for computer comprehension
    • Collapse all spaces and remove words with lengths less than 2
  2. Tokenizing data and Identify Special Instances of Tweets This separates the words and remove punctuations

    • Create a list for each row of the clean text by making each word a standalone this also takes care of any full stops at end of text removes.
    • Drop empty index in clean data
    • Drop duplicates/empty tweets in data set and reset index
  3. Vectorizer This is similar to tokenization only that it takes all the word vocabulary and convert all the vocabulary in the documents in to a matrix format bag of words. For instance

      [Hi my name is celdrick]
      [Hi my friend is Joyce]
    
      #vectorizing the entire vocabulary or words in a more structured format to a fix number of input length
      [Hi my name is celdrick friend Joyce]
    
      #Count vectorizer converts to matrix format: count vectorizer preferred to Tfidf because we have small data set.
      [1, 1, 1, 1, 1, 0, 0]
      [1, 1, 0, 1, 0, 1, 1]
    • Implementing count vectorizer with parameters like stop_words, analyzer, ngram_range, min_df, max_df and convert the matrix to array for modeling
  4. Model Building and Evaluation

    • Since this is a clustering problem , Kmeans has been used to suit the purpose.

Results

  • The optimal cluster on my model is 6 clusters/ 6 centroids, this can be improve with experience clustered_tweets.csv
  • Word cloud analysis has help to visualize prominent patterns in deciding the cluster number
  • Best cluster/Centroid ranges are between 2-8

Recommendations

  • Since customer reviews are subjective, Try more bigger data set with more reviews and we could keep monitoring the system performance and varying clusters as it comes with experience figuring the clusters and depending on domain problem. N/B If jupyter file does not render at this time. Check the .py file extension

topicmodeling_nlp's People

Contributors

kuta-ndze avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.