Dataset: Natural Language Processing with Disaster Tweets
In the disaster tweets classification task, I will build some machine learning models to train and predict which tweets are about real disaster (label = 1) and which one’s aren’t (label = 0).
This task is implemented at kaggle with CPU and GPU T4 x2 in Python.
1-data-preprocessing-disaster-tweets.ipynb: This code contains several parts of data preprocessing for training set and test set.
2-ml-disaster-tweets.ipynb: This code is to construct and train three machine learning models to classify the disaster tweets, such as XGBoost, SVM, Ransom Forest, and evaluate the performance with validation accuracy.
3-bert-disaster-tweets.ipynb: This code is to train BERT model to classify the disaster tweets and evaluate the performance with validation accuracy.
4-zero-shot-classification-disaster-tweets.ipynb: In this notebook, I explore the zero-shot classification using the Hugging Face library.
disaster tweets 3 tokenizers data (Testset): Contain the preprocessed results with 1-data-preprocessing-disaster-tweets.ipynb on Test set.
- test_e.csv: preprocessed test set with TreebankWordTokenizer
- test_u.csv: preprocessed test set with WordPunctTokenizer
- test_s.csv: preprocessed test set with WhitespaceTokenizer
Prediction: Contain the predicted results on test set and sumbit at kaggle.
- test_prediction.csv: The best result predicted by BERT model.
- SVM_penn_tokens_prediction.csv: The result predicted by SVM model (data tokenized by TreebankWordTokenizer).
- zero_shot_submission.csv: The result predicted by pre-trained model in zero-shot way.
- Tutorial document
- tf + fine-tuned model: Disaster NLP: Keras BERT using TFHub
- tf + fine-tuned model, data preprocessing, like removing emoji, puncuations: NLP - EDA, Bag of Words, TF IDF, GloVe, BERT
- pt + bert + self-defined architecture, cross-verification, can refer to the structure of the code: BERT Baseline
- tf + cross-verification + ClassificationReport, EDA, text cleaning:NLP with Disaster Tweets - EDA, Cleaning and BERT
- Zero-shot + pre-trained model: Zero Shot Classification with huggingface