Trying to establish an Empirical finding upon taking a closer look on how different Machine Learning Models would work on Google-play-store-reviews dataset. Pretty simple binary calssification approach has been adopted. We are trying to identify if the reviews are "Energy-Related" or not.
- Naive Bayes
- Linear Classifier
- Support Vector Machine
- Extreme Gradient Boosting
- Sahallow Neural Networks
- Deep Neural Networks
- Convolutional Neural Network (CNN)
- Long Short Term Model (LSTM)
- Gated Recurrent Unit (GRU)
- Bidirectional (RNN)
- Recurrent Convolutional Neural Network (RCNN)
-
Count Vectors : Matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
-
Word Level TF-IDF: Matrix representing tf-idf scores of every term in different documents
-
N-gram Level TF-IDF: N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams
-
Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus
-
Combinations Tried:
- Count Vectors + Word Level TF-IDF
- Count Vectors + N-gram Level TF-IDF
- Count Vectors + Character Level TF-IDF
- Word Level TF-IDF + N-gram Level TF-IDF
- Word Level TF-IDF + Character Level TF-IDF
- N-gram Level TF-IDF + Character Level TF-IDF
- Count Vectors + Word Level TF-IDF + N-gram Level TF-IDF
- Count Vectors + Word Level TF-IDF + Character Level TF-IDF
- Count Vectors + N-gram Level TF-IDF + Character Level TF-IDF
- Word Level TF-IDF + N-gram Level TF-IDF + Character Level TF-IDF
- Count Vectors + Word Level TF-IDF + N-gram Level TF-IDF + Character Level TF-IDF
-
Word Embeddings:
Form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.
- Glove (wiki-news-300d-2M.vec)
- FastText (wiki-news-300d-2M.vec)
- Word2Vec (wiki-news-300d-2M.txt)
Details are in Final Report Sheet