Giter VIP home page Giter VIP logo

stack-overflow-tag-prediction-bert-transformer-huggingface's Introduction

Stack-Overflow-Tag-Prediction

Description

In this problem, we are going to try and predict correctly the tags of various questions that have been uploaded on the Stack Overflow forum. As understood, the problem that we have to solve, is a multi-labeled classification problem. The dataset used can be found here: https://www.kaggle.com/datasets/stackoverflow/stacksample.

The Dataset consists of the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

This is organized as three tables:

• Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.

• Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.

• Tags contains the tags on each of these questions

Exploratory Data Analysis (EDA) - Data Sanitizing

In the first part of the problem, I performed some exploratory data analysis on the dataset in order to understand the quality and distributions of the data. Following, I have "cleaned" the text data with various techniques like:

• Transforming to Lowercase

• Discarding HTML tags & Punctuation

• Lemmatization

• Removing Stopwords

Feature Extraction - Base Machine Learning Classifiers

In the next part, I extracted from the data the famous NLP feature, TF-IDF. Next, I binarized the tags so that the labels of every example were represented from a list of binary numbers, with 1 stating the existance of the tag, while 0 the opposite. The length of the list for every example was equal to the number of possible tags (100 - my choice). We used an 80-20 Train-Test split.

Three different Machine Learning classifiers were used in order to test their performance. The classifiers were:

• Logistic Regression

• Multinomial Naive Bayes

• Linear Support Vector Classifier

While, the metrics used to test their performance were:

• ROC - AUC

• F1 Score

• Jaccard Similarity Score

The classifier that managed to perform the best was the Linear SVC with: a) F1 Score: 61% and b) ROC-AUC Score: 74.8% c)Jaccard Similarity Score: 49.5%

LLM Approach - Pretrained BERT transformer model from Hugging Face

For this part, a pretrained BERT transformer model, from Hugging Face, was used in order to perform the same classification. I have added a trainable classifier head on the pretrained model in order to make the output have the correct shape and match our labels. The final model have been trained only for 10 epochs and has already outperformed the base classifiers. I was used PyTorch for all this processes. For sure, with some extra tuning and training the results can be exceptional. The performance was evaluated using the same metrics. The scores on the test set are presented below:

• F1 Score: 70.7%

• ROC- AUC Score: 88.6%

• Jaccard Similarity Score: 64%

Docker Contenarization

For the final part of the problem, I have created a Docker file in order to reproduce the problem in any machine. Using the Dockerfile, a Docker image can be created and a container with all the modules and libraries installed can be launched in order to run the Jupyter Notebook. The required libraries are written on the requirements.txt file.

Example Image

stack-overflow-tag-prediction-bert-transformer-huggingface's People

Contributors

papcharis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.