This is a SMS/Email Spam classifier that identifies if a given text message is a potential advert, fraud or scam and seperate it from actual text messages.
The dataset used in this project was fetched from kaggle named:
Link to dataset: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
Big Tech Giants like Google put a spam classifier in their email system to detect whether a recieved email is an important one or a spam by some other company targeted for advertisement.
Whenever a user logs into another site or uses a product using the same account for email then the company pushes promotion with or without consent.
In order to deal with this massive problem, classification and detection is very crucial in order to provide a very good experience to the user and avoid any hassle.
We have to breakdown the MLA into following steps:
- Data Cleaning
- EDA
- Text Preprocessing
- Model Building
- Evaluation
- Improvement
- Deployment
pip install nltk
pip install pandas
pip install sklearn
pip install numpy
pip install streamlit
pip install collection
After every inspection, we can see that
Multinomial Naive Bayes is the best performing algorithm with
accuracy metrics of:
---------------------------
Accuracy Score: 0.9691
Conusion Matrix:
[[888 0]
[ 32 114]]
Precision Score: 1.0
with hyperparameter of
max_features of tfidf set to 3000
default parameters of MNB
Run the
main.ipynb
file from top to bottom
enter the following command in the terminal
streamlit run app.py
Accuracy has been precisely calculated over different scenarios. However, we can further fine tune the model using other ensemble learning methods like VotingClasifier
Note that this is merely a prototype and is not optimized