It is aimed at detecting sarcasm in news headlines. I have also developed a web app where user can type some news headline or some text and check whether it is sarcastic or not.
- Setting up your enviroment
- Project Motivation
- File Descriptions
- Instructions
- Results
- Conclusion
- Licensing, Authors, and Acknowledgements
Python 3.* is required to run this Project. If you are running Anaconda, then you'll need four extra libraries (4 to 5):
- Pandas
- Numpy
- Matplotlib
- Plotly
- Wordcloud
- NLTK
- Keras
Install above dependencies using pip install <dependency>
Recent advances in natural language sentence generation research have seen increasing interests in measuring negativity and positivity from thesentiment of words or phrases. However, accuracy and robustness of results are often affected by untruthful sentiments that are sarcastic in nature and this is often left untreated. Sarcasm detection is a very important process which can help to filter out noisy data (i,.e sarcastic sentences) from training data inputs, which can be used for natural language sentence generation2.
All my work is in the notebook. Data folder contains two files, both of which are required by the notebook
To run the web app:
- Go to app directory
cd app
- Run the run.py file
python run.py
- Open your browser and go to http://localhost:3001
- Test normal headlines from here: https://www.sciencedaily.com/news/computers_math/artificial_intelligence/
- Test sarcastic headlines from here: https://bestlifeonline.com/funniest-newspaper-headlines-of-all-time/
Since I'm using plotly which use iframes for visualizations, you won't be able to see them in notebook on github. Please download and open the HTML file in Firefox/Chrome.
I'm using accuracy to compare the models
- I tried three different models but still not able to breach 90% mark. Deep Learning method seems promising, but we are short of data to use more advanced deep learning models. Even a basic LSTM seemed to be overfitting in just 10 epochs. So, one thing which would be good to have would be more data
- From amongst traditional machine learning methods, Naive Bayes works better than Random Forest in my case. Naive Bayes is a good algorithm for working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, and since the typical person’s vocabulary is many thousands of words, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts.
- For the webapp, I'm going with Naive Bayes for the following reasons:
- It requires less model training time
- Naive Bayes model size is low and quite constant with respect to the data
- Naive Bayes can quickly adapt to changes in data whereas we would have to rebuild Random Forest everytime