This project aims to create an OCR-based application for extracting financial data from PDF documents. The application undergoes a series of steps, including document upload, data preprocessing, OCR, table reconstruction, error detection, verification, and storage in MongoDB.
- Document upload and preprocessing
- Table structure detection and cell identification
- OCR for numeric data (trocr) and Arabic text in table headers (easyocr)
- Table reconstruction based on detected cells
- Error detection for sum and difference mismatches
- User and system verification of financial data
- Storage of verified data in MongoDB
- Clone the repository:
git clone https://github.com/ArifMariem/financial-ocr-app.git
- Install dependencies:
pip install -r requirements.txt
- Set up MongoDB and configure connection settings in
config.py
.
- Upload a PDF document containing financial data.
- The application detects the table structure and identifies cells.
- Numeric data is OCR processed using trocr, and Arabic text in table headers is processed using easyocr.
- Reconstructed tables are generated based on the detected cells.
- The application checks for errors in the financial data, ensuring the sum and difference of rows match the total row.
- Users verify the correctness of the data both manually and through system checks.
- Verified financial data is stored in MongoDB for future reference.
- Document Upload: Users initiate the process by uploading the document containing tabular data.
- Image Compression: To optimize processing speed and resource utilization, the uploaded images undergo compression without compromising data integrity.
- Gray Scale Conversion: The images are converted to grayscale to simplify subsequent image processing operations.
- Binarization: Utilizing thresholding techniques, the grayscale images are transformed into binary images, enhancing the contrast between foreground and background.
- Image Inversion: Inverting the binary images ensures that table structures are appropriately highlighted for subsequent detection processes.
- Horizontal and Vertical Lines Detection: Employing algorithms for line detection, the system identifies both horizontal and vertical lines within the document, aiding in table structure recognition.
- Lines Intersection Detection: Intersection points of detected lines are identified, contributing to the accurate delineation of table cells.
- Contours Detection: Contours are extracted from the binary images, outlining distinct shapes and structures present in the document.
- Cell Segmentation: Based on the identified contours, the system segments the document into individual cells, laying the groundwork for subsequent optical character recognition (OCR) processes.
- Binarization
- Top Hat Transformation: Applying the top-hat transformation helps in highlighting subtle details and fine structures within the document. This is particularly useful for improving the visibility of smaller elements, such as text and lines.
- Text Region Detection: The system identifies regions within the document that contain text with MSER tools. This step aims to isolate areas where cells and textual information coexist.
- Region Filtering: The detected regions are filtered based on predefined criteria to focus on areas likely to contain cells. This helps eliminate unnecessary noise and ensures that only relevant portions of the document are considered for further analysis.
- Empty Cells Regions: Special attention is to regions that appear to be empty cells. By distinguishing between regions containing text and those that are seemingly empty, the system improves its ability to accurately recognize and segment cells.
- Dotted Line Detection: Implement algorithms to specifically detect dotted lines within the cells. This step is crucial for recognizing boundaries or divisions within the cells that may be represented by dotted lines, enhancing the precision of cell segmentation.
The Optical Character Recognition (OCR) phase involves the extraction of textual information from the preprocessed document, with a tailored approach to handle both numerical strings and Arabic text in table headers.
- Fine-Tuning of trocr: The trocr OCR model undergoes a fine-tuning process specific to the characteristics of financial data. This step optimizes the model for accurately detecting numerical strings that represent financial information within the tables.
- Numerical String Detection: Leveraging the fine-tuned trocr model, the system identifies and extracts numerical strings containing crucial financial data. This process ensures precise recognition of numeric values within the document.
- EasyOCR for Arabic Text Detection in Headers: The easyocr module is specifically applied to detect and extract Arabic text within the headers of tables. This ensures accurate recognition of textual information in the Arabic language, contributing to the comprehensive extraction of financial data.
Detected cells are organized to reconstruct tables, providing a clear representation of the financial data.
The application identifies errors in the financial data, such as discrepancies in row sums and differences. Users and system checks are performed to ensure data accuracy.
Verified financial data is stored in MongoDB, facilitating easy access and retrieval.
The application is deployed using FastAPI, a modern, fast web framework for building APIs with Python. The deployment process involves the following steps:
-
FastAPI Integration: The application is developed using FastAPI to create a robust and efficient API that serves as the backend for the OCR-based financial data extraction.
-
Asynchronous Programming for Real-Time Updates: These techniques are employed to enhance the user experience by providing real-time updates on processing progress. This allows users to track the status of document processing as OCR runs in the background.
-
Concurrent Error Detection and Correction: The system is designed to detect and correct errors concurrently with the OCR process. This ensures a proactive approach to handling errors in real-time, improving the overall accuracy and reliability of the financial data extraction.
-
Visual Studio Code Integration: Visual Studio Code (VSCode) is the chosen Integrated Development Environment (IDE) for both development and deployment. VSCode's rich set of extensions and integrated terminal make it seamless to manage the codebase and deploy the FastAPI application.
- Python 3.9.13
- trocr
- easyocr
- MongoDB
- OpenCV
For questions or support, contact us at [email protected].
- Special thanks to the developers of trocr and easyocr for their valuable OCR libraries.