This is an exercise for sentiment analysis based on python-nltk.
Requirements: Python >= v3.6
Check this repo
git clone https://github.com/qiuyujx/food-review-nlp-exercise.git
Install foodrw
cli tool:
$ python setup.py install
You can always run foodrw --help
to see all the avilable options.
This will train a model with default name "default_model".
$ foodrw -t "data\train.txt"
Note: If default_model exists, this will overwrite it.
Train a model with customized model name. The code below will train a model named "my_model"
$ foodrw -t "data\train.txt" -m "my_model"
Classify a review text. It come with its probability (relavence scores) of being Positive/Negtive
$ foodrw -q "An extensive menu provides lots of options for breakfast."
Output:
This review is classified as **Positive**, with probability 0.971040.
Classify a file with multiple reviews. The file must contain one reviews each new line.
$ foodrw -f "data\test.txt"
Optionally, you can specify the name of the model want to use, as well as customize the path of the classified file name.
$ foodrw -f "data\test.txt" -m test_model -d "data\my_classified.txt"
- Remove punctuation
E.g.
Wow... Loved this place.
-> Wow Loved this place
- Tokenisation
Separate words and put them into an array/list
E.g.
Wow Loved this place
-> ['Wow', 'Loved', 'this', 'place']
- Remove stop words
Stop words are those used very commonly and meaningless, such as "the", "a", "this" and "that".
E.g.
['Wow', 'Loved', 'this', 'place']
-> ['Wow', 'Loved', 'place']
- POS (part of speech) labelling & filtering
POS indicates the category to which a word is assigned in accordance with its syntactic functions
E.g.
['Wow', 'Loved', 'place']
-> [('Wow', 'NNP'), ('Loved', 'VBD'), ('place', 'NN')]
- Words lemmatization
Recover word into its original form, such as plural to single or past tense to present tense
E.g.
[('Wow', 'NNP'), ('Loved', 'VBD'), ('place', 'NN')]
-> [('wow', 'NNP'), ('love', 'VBD'), ('place', 'NN')]
- Bag of words & review vectors generating
All the words from all reviews are collected into a list with their occurrence frequencies. These words are filtered based on their frequencies. That is, words are reserved only if they occurred frequently. Finally, a feature (every words) vector is generated.
For every single review, we'll look at words it contains. If any words in the feature list occur, the corresponding boolean value of the word is set to True, otherwise remain to be False.
Therefore, each review is converted into a feature vector. For example, the review Wow... Loved this place.
is converted into the vector as follows:
{
'wow': True,
'love': True,
'place': True,
'stopped': False,
'late': False,
'recommendation': False,
'selection': False,
...
'undercooked': False
}
- Training (Naive Bayes)
Naive Bayes Classifier Algorithm is a family of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of a feature.
We classify whether the review Wow... Loved this place.
is a positive review or a negative review.
We then have to calculate:
P(positive | wow love place) — the probability that the tag of a sentence is positive given that the sentence is Wow... Loved this place.
P(negative | wow love place) — the probability that the tag of a sentence is negative given that the sentence is Wow... Loved this place.
TODO: More algorithms such as logistic regression can be added.
- Relevance score
Because we are using Naive Bayes, the probability of the classification can be simply used as its "relevance score".
For example:
This review is classified as **Positive**, with probability 0.983745.
In the above output, the probability 0.983745
means this review has 98.37% probability to be a positive review.
Currently we are extracting words with POS noun, verb, adjective and adverb. This is because the training set is tend to be small. To avoid too many "empty" vectors, which means that no word can be found in the feature list, I involved all of these POS. However, when we have a relatively larger training set, we may only need to consider adjectives and adverbs which potentially make more sense in expressing positive/negative sentiments.
- Unit testing
- More algorithms other than Naive Bayes, such as Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Stochastic Gradient Descent and Support Vector Classifier.