Document-Interaction-Assistant

Table of Contents

About The Project
Salient Features
Description
Data Preprocessing
Document Classification Model
Results
Information extraction model
Team

About the project

We put out a model that can recognize the collection of papers contained in a PDF or image made up of numerous documents. To accomplish this, the input PDF is divided into individual pages. The CNN model is used to categorize each page into the appropriate document category. After that, each document's data is extracted using OCR (optical character recognition). This is being recommended for five documents: voter identification, driver's license, PAN, and Aadhar. Except for the front and back of the same document, the input PDF must include a single document on a single page. Initially, our data classification model achieved an accuracy of 0.7342 on the training set and 0.7736 on the validation set, with gains of 0.6923 and losses of 0.8340.

In our ongoing efforts to enhance performance, we explored and discovered VGG16 and VGG19 models. Hyperparameter tuning was applied to our model, incorporating additional layers to the pre-trained models. As a result, we achieved a validation loss of 0.3677 and a validation accuracy of 0.8769 for VGG16.

In addition to this, we incorporated two more features:

1. Read Aloud:

Utilizes text-to-speech technology for accessibility.
Translates text into spoken words.
Supports auditory learners and those with visual impairments.
Enhances accessibility and consumability.

2. Document Summarization:

Aids time-constrained users by condensing lengthy papers.
Uses Hugging Face Transformers library for NLP models.
Provides clear and instructive document synopses.
Maximizes time efficiency by distilling crucial insights.

Salient Features

Hyperparameter tuning, regularization(early stopping,dropout), document split

Tech stack used

models: CNN, VGG16, VGG19 and OCR engine tesseract
Google TTS(Text to speech), Hugging Face Transformers for text summarization
Framework-Keras

User Flow

Data Description

When we began searching for an appropriate dataset, we observed that there is no publicly available dataset of identity documents as they hold sensitive and personal information. But we came across a dataset on Kaggle that consisted of six folders, i.e., Aadhar Card, PAN Card, Voter ID, single-page Gas Bill, Passport, and Driver's License. We added a few more images to each folder. These were our own documents that we manually scanned, with the rest coming from Google Images. Thus, these are the five documents we are classifying and extracting information from.

Data Preprocessing

Originally, we implemented horizontal and vertical data augmentation through random flips to enhance dataset size and diversity. Currently, we have transitioned to utilizing image data generators for both the train and test sets.

Document Classification Model

CNN model

Various hyperparameters like the number of layers, neurons in each layer, number of filters, kernel size, the value of p in dropout layers, number of epochs, batch size, etc. were changed until satisfactory training and validation accuracy was achieved.

CNN Model results

VGG16

The VGG model's architecture uses small convolution filters and deep structure that allows it to capture fine details, which is crucial for distinguishing between various ID documents that often have subtle differences.

4 additional layers were incorporated into the pre-trained model.

Before landing unto our final chosen model shown above, we tweaked the pre-trained architecture until satifactory results were acheived. ![Comparative results of identity document classification models]

Information extraction model

Following are the steps of OCR done on images:

Ongoing Improvements:

Interactive Summarization and Query Answering
Advanced Handwritten Text Extraction
Global Accessibility with Multilingual Support
Wider document classfication systems covering legal documents
Exploring advanced CNN architectures

Team

Kanika Kanojia GitHub Linkedin
Princy Singhal GitHub Linkedin

princysinghal / document-interaction-assistant Goto Github PK