Giter VIP home page Giter VIP logo

graduation_project's Introduction

Graduation Project

@Aliti-Coding, @mats-bb, @duggurd

Contents

1. Sentiment Analysis

The basic premise is to create a model that can evaluate a piece of text's sentimental properties, ie. what underlying emotions does the piece of text convey. A decision needs to be made wether to classify purely based on "good" and "bad" or to classify more emotions [1]. Furthermore, decide to treat the problem as a binary classification problem or as a regression problem with gradients of emotions.

Additionally, a text-summarization algorithm is to be implemented, so far we have looked at term frequency–inverse document frequency [2] (TF-IDF). This algorithm is to be applied to each "document" or each piece of text to extract the most important keyword or keywords from the text.

Targeting english text.

1.1. Social Media Platforms

Apply the trained model to analyze different social media plattforms for their intrinsic emotions, comparing them and also doing an analysis over time within each platform. Looking at intrinsic emotions of specific tags and topics.

1.2. News Outlets

Proof of concept case. Applying the model on different news article outlets and comparing their sentimental values for the same covered event.

Classify political orientation of different news outlets and individual articles, but out of scope for this project.

Use tf-idf for the same covered event to give a summary on the different news article. We can then compare the articles and see what words are used.

1.3. Business Value

The main goal of the project is to implement a sentiment analysis model that can be used on any piece of english text. As a creator on for example YouTube the model could be used to analyse comments over time to pinpoint potential problems with content without needing to read all the comments.

The same principle could be applied to a companies reviews for products and services to summarize the sentiment and to give a deeper insight into where a potential problem lies, its cause or it could give insight into which actions make a difference.

2. Method

Fast iterative approach Figure 1. Proposed iterative method until production ready.

The foundation and core of the project is the data.

3. Data

Two sets of data are required to fulfill our vision for the project.

Firstly a set of training data, which has two requirements. One, it is a piece of english text, and two, there exists some kind of labeling corresponding to the text encoding its sentimental value. For a binary classification approach that could be simply a yes or no if the text is negative or not. For a regression problem reviews with belonging scores could be used.

Secondly the data to use for analysis. For the social media platforms case, that would be posts from different social media platforms, and for the second case, news article outlets news articles from different sources covering the same topic or event.

3.1. Data Sources

[Data sources]

So far we have been looking at two main sources for training data. IMDB, providing a large amount of reviews with ratings and a pre-labeled twitter dataset.

Another source or metric that could be used to derive the sentiment of a piece of text is a dictionary of words with correlated emotion like the NRC lexicons with both binary and multiple emotion labeled words. A pure lexicon approach could be problematic as context is important when it comes to sentiment. A statement could have completely different underlying sentiment in different circumstances, a seemingly neutral sentence could be negative in some contexts.

On the analysis side we have looked at ways to collect social media posts from Rwitter, Facebook, Reddit and YouTube for the social media case. AljazeeraReuters, Fox News, CNN and Abc News (politically diverse news outlets) for the news outlet case, where web scraping will be necessary to collect the data.

3.2. Database

Azure Cloud hosted database. At least 2 tables, one for social media posts and one for news article outlets. Depending on the method used to train the model a third might be necessary to store the training data. The social media data needs to be normalized to fit one schema, the same for the news articles.

4. Machine Learning

Multiple different types of models that could be used. The choice will depend on the training, how much of it is available and the target value/values to predict. If there is a lot of available data a deep learning model could be the best choice, an LSTM or a Transformer model would probably be the best choices. If there are fewer samples of training data or for a less complex model machine learning models could be used like a Support Vector Machine model.

5. Visualization

Ideas:

  • Sentiment over time
  • Location based
  • Per person based
  • Platform (twitter, reddit, etc) based
  • News outlet based

6. Sources

  1. https://medicalxpress.com/news/2023-02-joy-caf-tweets-reveal-cities.html
  2. Wikipedia. tf-idf. wikipedia.com. https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
  3. T. Shaik, X. Tao, C. Dann, H. Xie, Y. Li, L. Galligan. Sentiment analysis and opinion mining on educational data: A survey. Natural Language Processing Journal 2 (2023) 100003. https://arxiv.org/abs/2302.04359.
  4. Media Bias Chart. https://www.allsides.com/media-bias/media-bias-chart.
  5. Nhan Cach Dang, María N. Moreno-García, Fernando De la Prieta. Sentiment Analysis Based on Deep Learning: A Comparative Study. arxiv.org. https://arxiv.org/abs/2006.03541

graduation_project's People

Contributors

duggurd avatar aliti-coding avatar

Watchers

 avatar

Forkers

aliti-coding

graduation_project's Issues

Test a smaller pre-trained transformer

Try smaller transformer model than 66 mil parameters. Pre-trained is probably the better choice.

Bert tiny

about 4mil parameters

BERT-tiny

Bert mini

About 12 mil parameters
BERT-mini

From scratch

Create own smaller model with BertConfig/DistilBertConfig

Visualization

platform: Power BI

Ideas:

  • Sentiment over time
  • Location based
  • Per person based
  • Platform (twitter, reddit, etc) based
  • News outlet based

Text transformation class/package

Should probably consider creating a module/package that exclusively focuses on doing transformations on text as large parts of our project is using some form of transformation on text. Therefore probably best to have a unified and generalized approach to doing the transformations in a module or package.

monday

create a trained model

Train for Norwgian language

  • Find norwegian data to train model on.
  • Train model for Norwegian reviews/text
  • Find norwegian inference data to test model on.

Might be able to use same method as BERT was originally trained on, without needing text with corresponding scores.

Machine Learning - TF-IDF

Implement tf-idf labeling or similar.

  1. Decide on an algorithm to be used for text-summarization/keyword extraction.
  2. Implement algorithm and apply to data.

Clean training data

Need to remove punctuation and other redundant characters from training data.
Related to:

Database Architecture

Discuss database design and architecture based on requirements.

  • Tables
  • Database type (cloud?)
  • Ingestion method (python, manually, csv?...)
  • Query method (python, export as csv?...)

Go Object Oriented

Turn source code into object oriented code where valuable.

Think "how did w use the code" and "how would be use the code further into the future?". Code that encapsulates these two ideas.

Inference Data

Decide on and collect data for inference, YouTube, Twitter etc. This is the data that is to be used for visualization.

  1. Extract data based on target business ideas/values
  2. Transform and add features according to training data format
  3. Ingest into database?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.