In this problem, we are going to try and predict correctly the tags of various questions that have been uploaded on the Stack Overflow forum. As understood, the problem that we have to solve, is a multi-labeled classification problem. The dataset used can be found here: https://www.kaggle.com/datasets/stackoverflow/stacksample.
The Dataset consists of the text of 10% of questions and answers from the Stack Overflow programming Q&A website.
This is organized as three tables:
• Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
• Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
• Tags contains the tags on each of these questions
In the first part of the problem, I performed some exploratory data analysis on the dataset in order to understand the quality and distributions of the data. Following, I have "cleaned" the text data with various techniques like:
• Transforming to Lowercase
• Discarding HTML tags & Punctuation
• Lemmatization
• Removing Stopwords
In the next part, I extracted from the data the famous NLP feature, TF-IDF. Next, I binarized the tags so that the labels of every example were represented from a list of binary numbers, with 1 stating the existance of the tag, while 0 the opposite. The length of the list for every example was equal to the number of possible tags (100 - my choice). We used an 80-20 Train-Test split.
Three different Machine Learning classifiers were used in order to test their performance. The classifiers were:
• Logistic Regression
• Multinomial Naive Bayes
• Linear Support Vector Classifier
While, the metrics used to test their performance were:
• ROC - AUC
• F1 Score
• Jaccard Similarity Score
The classifier that managed to perform the best was the Linear SVC with: a) F1 Score: 61% and b) ROC-AUC Score: 74.8% c)Jaccard Similarity Score: 49.5%
For this part, a pretrained BERT transformer model, from Hugging Face, was used in order to perform the same classification. I have added a trainable classifier head on the pretrained model in order to make the output have the correct shape and match our labels. The final model have been trained only for 10 epochs and has already outperformed the base classifiers. I was used PyTorch for all this processes. For sure, with some extra tuning and training the results can be exceptional. The performance was evaluated using the same metrics. The scores on the test set are presented below:
• F1 Score: 70.7%
• ROC- AUC Score: 88.6%
• Jaccard Similarity Score: 64%
For the final part of the problem, I have created a Docker file in order to reproduce the problem in any machine. Using the Dockerfile, a Docker image can be created and a container with all the modules and libraries installed can be launched in order to run the Jupyter Notebook. The required libraries are written on the requirements.txt file.