This is a more accurate version which uses pypdfium2 and pytesseract (OCR engine) to convert pdf to image in structured blocks and back to text. This works well for PDF with complex structured formatting keeping writing coherent for the LLM.
photon48 / templerv2 Goto Github PK
View Code? Open in Web Editor NEWThis is a more accurate version which uses pypdfium2 and pytesseract (OCR engine) to convert pdf to image in structured blocks and back to text. This works well for PDF with complex structured formatting keeping writing coherent for the LLM.