Document-Tagging system that leverages advanced natural language processing (NLP) techniques and transformer-based pre-trained models for efficient document categorization. This system aims to automatically generate the relevant tags or categories based on large volumes of unstructured textual data. Users can upload documents, which undergo preprocessing and are then classified into multiple relevant categories using NLP models. Download Problem Statement PDF
Here I build CI-CD pipeline for generating relevant tags based on given textual data.
DagsHub
WebApp
High Level Design Dcocument (HLD)
Low Level Design Document (LLD)
-
Step-01: Load the raw or custom data from AWS S3 Bucket, provided by user. And save the data into particular directory.
-
Step-02: Preprocessed the raw data, like handle the missing values, duplicate values, text-preprocessing, vectorization, separate the X and Y, create tensor dataset and split them into train, test and validation sets.
-
Step-03: Create the model (default: bert-base-uncased), and train the model. After that save the pre-trained model & tokenizer in a particular directory.
-
Step-04: Evaluate the model baed on test datasets and save the inflormation on DagsHub using mlflow.
-
Step-05: Create a Web Application for generating the tags and host the entier application on AWS.
# For Windows OS:
docker pull dibyendubiswas1998/document_tagging
docker run -p 8080:8080 dibyendubiswas1998/document_tagging
# For Ubuntu OS:
sudo docker pull dibyendubiswas1998/document_tagging
sudo docker run -p 8080:8080 dibyendubiswas1998/document_tagging