Tianyi-Yan-twitter

Personal Information:

First name: Tianyi (Lorena)
Last name: Yan
Email address: [email protected]

Steps to run the project

Please put the data source file under the same directory where main.py and load_tweets_data.py are located to runt the project
Install used libraries (Please see reference at the end of this ReadMe file)
Please download GloVe word embeddings (glove.6B.50.txt) from https://nlp.stanford.edu/ and put it directly under the project directory
Run main.py

Classification or regression:

I chose to do a classification, because since we are classifying all tweets based on valence values, which include 0, 2, 4 and are all discrete values, so I chose to treat it as a categorical value and thus do a classification task.

Data preprocessing:

Use Pandas to load the csv data source file
Use Keras to tokenize the tweets
Use Keras to pad all tweet sequences
Load pretrained word embeddings and get word embeddings for all words in each tweet. GloVe word embeddings are used in this case.
Perform a simple one hot encoding for the valence target value. (Although there’s only 0s and 4s for valence in the data source file, I still converted 0, 2, and 4 to binary vectors and preprocess all valences as the final labels.)

Post preprocessing:

----	Column	Domain	Type
Feature	tweet	array of int (word embeddings)	array of int (word embeddings)
Target	valence	array of binary int (one-hot encoding)	array of binary int (one-hot encoding)

Machine Learning model:

Type of model: supervised neural network
Input layer:
1. Embedding layer with size of length of word_index from the tokenizer (i.e., numbers of unique words that are tokenized), and with output size of embedding dimension of 50.
2. Reason: Need to convert word indices to GloVe word embeddings before further processing
Hidden layers:
1. Add several LSTM layers with size of embedding dimension of 50 (as it’s the output size of the embedding layer) with dropout rate of 0.2 and 0.5 and Tanh as the activation function.
2. Reason:
  - Use LSTM as we are processing tweets which involve sequencing and timing issue (sequence of the words).
  - Use dropout rate to avoid overfitting.
  - Use Tanh to avoid vanishing gradient problem and also to facilitate converging speed to some extent
Output layer:
1. Use a Dense layer with size of 3 and SoftMax as the activation function
2. Reason:
  - Size of 3: because we have 3 categories for valence: 0, 2, and 4
  - SoftMax: since we have 3 different categories/classes, SoftMax would be useful to standardize the output.

Loss and Optimizer:

Loss: categorical_crossentropy
Optimizer: RMSprop

Split data:

I used train_test_split from sklearn to shuffle and split all data into training and test data, and then split all training data to train and data for validation. The ratio between train, validate, and test data is 3:1:1.

Evaluation

Metrics: Accuracy
Result: ≈ 80.629% (See screenshot in the word document for further details)

Reference:

GloVe word embedding: https://nlp.stanford.edu/projects/glove/
Libraries used:
- Pandas
- Numpy
- Tensor flow – Keras
- Sklearn

lorenayannnnn / tianyi-yan-twitter Goto Github PK

tianyi-yan-twitter's Introduction

Tianyi-Yan-twitter

Personal Information:

Steps to run the project

Classification or regression:

Data preprocessing:

Post preprocessing:

Machine Learning model:

Loss and Optimizer:

Split data:

Evaluation

Reference:

tianyi-yan-twitter's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent