Giter VIP home page Giter VIP logo

convert-to-pdf-service's Introduction

Convert documents to PDF

A Docker-powered service for converting files supported by Libreoffice to PDF.


Dependencies and requirements

  • Redis server for managing queues
  • Docker (install)
  • Docker-compose (install)
    • Note: On mac Docker-compose is installed with Docker

Quick Start

Start the service:

./run start

This script will start the service with default configurations.

Default configuration values are as follows:

REDIS_HOST=localhost
REDIS_PORT=6379
SERVICE_HOST=127.0.0.1
SERVICE_PORT=5060

Development and testing

A Python virtual env is needed for some of the development tasks

./run install_venv

Start the service for testing (with a redis server included)

./run start

Check service is up and get general info on supported languages and other important information:

curl localhost:5060/info

Test converting to PDF is working

curl -X POST -F 'file=@./src/test_files/sample-english.pdf' localhost:5060 --output english.pdf

To list all available commands just run ./run, some useful commands:

./run test
./run linter
./run check_format
./run formatter

Contents

Asynchronous OCR

  1. Upload the file to the service

    curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/upload/[namespace]

Alt logo

The enpoint sends a message to the Redis queue to be processed asynchronously by the worker

  1. Retrieve converted PDF

Upon completion of the OCR process, a message is placed in the ocr_results Redis queue. This response is, for now, using specific Uwazi terminology. To check if the process for a specific file has been completed:

queue = RedisSMQ(host=[redis host], port=[redis port], qname='ocr_results', quiet=True)
results_message = queue.receiveMessage().exceptions(False).execute()

# The message.message contains the following information:
# {
#   "namespace": "namespace",
#   "task": "pdf_name.pdf",
#   "success": true,
#   "error_message": "",
#   "file_url": "http://localhost:5050/processed_pdf/[namespace]/[pdf_name]"
#   }


curl -X GET http://localhost:5050/processed_pdf/[namespace]/[pdf_name]

HTTP server

The container HTTP server is coded using Python 3.10 and uses the FastApi web framework.

The endpoints code can be found inside the file ./src/api/app.py.

Queue processor

The container Queue processor is coded using Python 3.10, and it is in charge of communications with the Redis queue.

The code can be found in the file ./src/worker/queue_processor.py and it uses the library RedisSMQ to interact with the Redis queues.

convert-to-pdf-service's People

Contributors

daneryl avatar dependabot[bot] avatar elreplicante avatar rafapolit avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

convert-to-pdf-service's Issues

Install fonts for non-latin characters documents

In order to properly convert documents containing non-latin characters, some fonts need to be installed on the Docker image:

fonts-indic fonts-noto fonts-noto-cjk fonts-arabeyes fonts-kacst fonts-freefont-ttf

Libreoffice always return result code 0

We cannot handle LibreOffice subprocess errors with the process result code, as it is always 0. One way to check for errors can be piping stderr to the subprocess and checking for its existence.

Re-write README

README still shows the legacy service implementation. Re-write it to show:

  • New compose up methods
  • Remove 'sync' method
  • Fix routes and names

Pin all dependencies

To have repeatable builds for the docker image, is a good practice to pin dependencies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.