pd3f / pd3f Goto Github PK

View Code? Open in Web Editor NEW

272.0 6.0 35.0 952 KB

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Home Page: https://pd3f.com

License: GNU Affero General Public License v3.0

Python 38.65% Dockerfile 3.11% HTML 54.56% Shell 3.69%

pdf text-extraction pdf-to-text pipeline machine-learning ocr language-model extract-text parsr python

pd3f's Introduction

`pd3f`

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

calculate runtime based on job.started_at and job.ended_at
Get average runtime of jobs and store data in redis list

more information about PDF

NER
entity linking
extract keywords
use textacy

add more language

check if flair has model
what to do if there is no fast model?

Python client

simple client based on request
send whole folders

Markdown / HTML export

go beyond text

use pdf-scripts / allow more processing

reduce size
repair PDF
detect if scanned
force to OCR again

improve logs / get better feedback

show uncertainty of ML model
allow different log levels

Related Work

https://github.com/axa-group/Parsr
https://github.com/jzillmann/pdf-to-markdown
some PDF processing tools in my blog post

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

pd3f's People

Contributors

Stargazers

Watchers

pd3f's Issues

Update to Parsr 1.2.2

Description

Parsr 1.2.2 supports a new ImageMagick policy which limits its disk usage in storing temp data.
It would be useful to update it, in order to not be forced in a periodically manual disk handling.

Repair broken PDFs

Footnote numbering in output text generates non-numeric values

Using experimental mode for footnotes to endnotes, the position of the numeric superscript values from the input PDF are detected with very high accuracy. However, in the output TXT, numeric values of the input superscript are never generated. Instead various non-numeric output values are generated.

Examples:
input: safeguarded.⁴ output: safeguarded.' (output value is the apostrophe symbol)
input: times.⁶ output: times.° (output value is the degrees symbol)
input: individual.¹⁴ output: individual.'* (output values are apostrophe, asterisk)
input: ahead¹⁵ output: ahead.!° (output values are exclamation mark, degrees symbol)

Using the settings in the web UI:

language | en
fast mode | False
extract tables | False
experimental mode | True

Improve Dockerfile handling

Don't push pd3f-ocr and the dashboard separately to Docker hub. Instead, built the images (via inheritance + small adjustments) when running docker-compose up the first time.

demo is down

Hi, I wanted to try out the demo on the website, but when I click on submit a server error 500 is returned. Would be lovely to try out a difficult pdf I have been having trouble with and see if this will fix my issue

Download of lm-mix-german-forward-v0.2rc.pt fails with status code 301

language: de
[ ] fast (but less accurate)
[ ] extract tables
[x] deduplicate page header/footer & transform footnotes to endnotes (experimental)

The textoutput of PDFs using docker-compose is

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 636.96
INFO:pd3f.doc_info:median line height: 20
INFO:pd3f.doc_info:median line space: 9.210000000000036
INFO:pd3f.doc_info:counter width: [(638.93, 21), (638.02, 19), (639.09, 19), (638.95, 17), (638.05, 16)]
INFO:pd3f.doc_info:counter height: [(20, 1861), (21, 602), (19, 395), (24.14, 31), (22, 21)]
INFO:pd3f.doc_info:counter lineheight: [(9.0, 487), (10.0, 454), (9.789999999999964, 158), (10.210000000000036, 138), (9.210000000000036, 127)]
ERROR:rq.worker:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 53, in extract
e = Export(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 171, in __init__
self.export()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 239, in export
cleaned_header, cleaned_footer, new_footnotes = self.export_header_footer()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 198, in export_header_footer
headers = remove_duplicates(headers, self.lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/doc_info.py", line 136, in remove_duplicates
if single_score(only_text(r), lang) <= single_score(
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 65, in single_score
scorer = get_scorer(lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 30, in get_scorer
scorer = FlairScorer(lang=lang)
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in __init__
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/flair/embeddings/token.py", line 567, in __init__
model = cached_path(base_path, cache_dir=cache_dir)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 90, in cached_path
return get_from_cache(url_or_filename, dataset_cache)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 166, in get_from_cache
raise IOError(
OSError: HEAD request failed for url https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-mix-german-forward-v0.2rc.pt with status code 301.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 53, in extract
e = Export(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 171, in __init__
self.export()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 239, in export
cleaned_header, cleaned_footer, new_footnotes = self.export_header_footer()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 198, in export_header_footer
headers = remove_duplicates(headers, self.lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/doc_info.py", line 136, in remove_duplicates
if single_score(only_text(r), lang) <= single_score(
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 65, in single_score
scorer = get_scorer(lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 30, in get_scorer
scorer = FlairScorer(lang=lang)
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in __init__
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/flair/embeddings/token.py", line 567, in __init__
model = cached_path(base_p```ath, cache_dir=cache_dir)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 90, in cached_path
return get_from_cache(url_or_filename, dataset_cache)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 166, in get_from_cache
raise IOError(
OSError: HEAD request failed for url https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-mix-german-forward-v0.2rc.pt with status code 301.

Stuck on find /to-ocr -name '*.pdf' -type f

Hey there, I'm running pd3f on a workstation and accessing via SSH tunnel to my local machine.

I'm using the browser GUI to (hopefully 🤞) OCR a book scan with several hundred pages. The book scan is already in PDF format - the file size is around 30MB.

In the log output of the web GUI I am seeing:

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 174.0
INFO:pd3f.doc_info:median line height: 9.0
INFO:pd3f.doc_info:median line space: 4.159999999999968
INFO:pd3f.doc_info:counter width: [(409.44, 1036), (8, 1036), (409.68, 1014), (410.16, 982), (409.2, 974)]
INFO:pd3f.doc_info:counter height: [(10, 19582), (9, 11277), (8, 10001), (7, 1238), (9.24, 1180)]
INFO:pd3f.doc_info:counter lineheight: [(4.159999999999968, 3830), (4.160000000000025, 2457), (4.159999999999997, 2251), (4.399999999999977, 2118), (2.759999999999991, 1806)]
INFO:pd3f.export:export page #0

It's been at least 20 minute since I started the conversion, so I'm surprised to see the tool is still on page #0.

In the terminal I'm seeing the following:

ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:00] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:01] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:02] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:03] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:04] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:05] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:06] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:07] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:08] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1

I'm new to pd3f but it looks like the ocr worker is stuck in a loop waiting to receive a file?

Any suggestions for troubleshooting are much appreciated.

Stuck forever in "INFO:root:setting up ocr"

Hi, after I did a couple of successful convert, the pd3f has ran into problem of getting stuck in "INFO:root:setting up ocr" forever. Even after I ran docker-compose run --rm worker rqscheduler --host redis --burst.

allow users to add / modify parsr config

Improve layout detection

First of all: pd3f is working great!
It's a wonderful tool. Thank you so much for creating it.

There's this small issue though, that text blocks / columns aren't recognized as such. So articles written in columns and similar things are currently not recognized within their blocks.
Thus highlighting or searching things that span over a line is broken in these cases.

I'm not sure why this is the case, since I though that tesseract has actually a proper layout analysis integrated ("Page Segmentation Mode" and its default should be "Fully automatic page segmentation, but no OSD").

Option to delete a PDF / job

User should have the option to delete the results after the job finished (on the results page)

Rotated PDFs don't work properly

When I use Gnome's Document Viewer app to orient a PDF correctly, it doesn't detect the rotation (which is probably just a meta-parameter thing) and produces scrambled text.

test files:

original.pdf
rotated.pdf
result

output:

uswnen Inu Is n aw 3x3, uabnys n m pun uaßnyau a n p je uornom 'wyeyasaß ses xXaypu n n g Jau n ap n spe ya n aqey OS 'Usas3]| Jaww n y9OU USII37 3Salp IS Uayjos pun 'ulas
[...]

OCR worker won't start on Windows

When running Windows the ocr-worker container won't start, because the ocr_folder.sh will have windows line endings.

A temporary solution is to change the line endings of the file.
A permanent solution would be nice.

Folder of CSV files not found for document ID

I got this error when I tried to convert a PDF:

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
ERROR:rq.worker:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 50, in extract
input_json, tables_csv = run_parsr(
File "/usr/local/lib/python3.8/site-packages/pd3f/parsr_wrapper.py", line 86, in run_parsr
for page, table in parsr.get_tables_info():
File "/usr/local/lib/python3.8/site-packages/parsr_client/parsr_client.py", line 223, in get_tables_info
return [(table.rsplit('/')[-2], table.rsplit('/')[-1]) for table in ast.literal_eval(self.get_table(request_id=request_id).columns[0])]
File "/usr/local/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/local/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "", line 1
Error: Folder of CSV files not found for document ID 8fe98ff6ca95d82ef0fb214959ea78
^
SyntaxError: invalid syntax
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 50, in extract
input_json, tables_csv = run_parsr(
File "/usr/local/lib/python3.8/site-packages/pd3f/parsr_wrapper.py", line 86, in run_parsr
for page, table in parsr.get_tables_info():
File "/usr/local/lib/python3.8/site-packages/parsr_client/parsr_client.py", line 223, in get_tables_info
return [(table.rsplit('/')[-2], table.rsplit('/')[-1]) for table in ast.literal_eval(self.get_table(request_id=request_id).columns[0])]
File "/usr/local/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/local/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "", line 1
Error: Folder of CSV files not found for document ID 8fe98ff6ca95d82ef0fb214959ea78
^
SyntaxError: invalid syntax

It was just in the case of a single document, though.
Otherwise it's working perfectly.
Thank you so much for pd3f! 🙏

Example API call goes to Waiting forever

I can't seem to run the example script to inference anything from the docker image running pd3f. It seems to go on waiting state forever.
I cloned the repo and ran the ./dev.sh script and use the code below for inferencing:

import time

import requests

files = {
    "pdf": (
        "CreditCardStatement (1).pdf.pdf",
        open(r"./test/pdfs/Admit Card.pdf", "rb"),
    )
}
response = requests.post("http://localhost:1616", files=files, data={"lang": "de"})
id = response.json()["id"]

while True:
    r = requests.get(f"http://localhost:1616/update/{id}")
    j = r.json()
    if "text" in j:
        break
    print("waiting...")
    time.sleep(1)
print(j["text"])

TERMINAL OUTPUT:

waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...

Terminal output for docker:

./dev.sh

[+] Running 2/0
 ✔ Network pd3f_default                                                                                                                                      Created0.0s
 ⠋ Container pd3f-ocr_worker-1                                                                                                                               Creating0.0s
[+] Running 10/5f-parsr-1                                                                                                                    ✔ Network pd3f_default                                                                                                                                      Created0.0s                                                                                                                 ✔ Container pd3f-ocr_worker-1                                                                                                                               Created0.1s mage's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platfo ✔ Container pd3f-parsr-1                                                                                                                                    Created0.1s
 ✔ Container pd3f-redis-1                                                                                                                                    Created0.1s
 ! ocr_worker The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s
 ! parsr The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested      0.0s
 ✔ Container pd3f-worker-1                                                                                                                                   Created0.0s
 ! worker The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested     0.0s
 ✔ Container pd3f-web-1                                                                                                                                      Created0.0s
 ! web The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested        0.0s
Attaching to ocr_worker-1, parsr-1, redis-1, web-1, worker-1
ocr_worker-1  | + mkdir -p /to-ocr
ocr_worker-1  | + sleep 1
redis-1       | 1:C 02 Mar 2024 14:23:56.137 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis-1       | 1:C 02 Mar 2024 14:23:56.137 # Redis version=6.2.14, bits=64, commit=00000000, modified=0, pid=1, just started
redis-1       | 1:C 02 Mar 2024 14:23:56.137 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
redis-1       | 1:M 02 Mar 2024 14:23:56.137 * monotonic clock: POSIX clock_gettime
redis-1       | 1:M 02 Mar 2024 14:23:56.137 * Running mode=standalone, port=6379.
redis-1       | 1:M 02 Mar 2024 14:23:56.137 # Server initialized
redis-1       | 1:M 02 Mar 2024 14:23:56.138 * Ready to accept connections
parsr-1       | Starting par.sr API : node api/server/dist/index.js
worker-1      | 14:23:56 Worker rq:worker:82fe312997394624a7e13aea0ea16aa5: started, version 1.5.2
worker-1      | 14:23:56 *** Listening on default...
worker-1      | 14:23:56 Cleaning registries for queue: default
web-1         |  * Serving Flask app "/app/app.py" (lazy loading)
web-1         |  * Environment: development
web-1         |  * Debug mode: on
web-1         |  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
parsr-1       | [2024-03-02T14:23:57] INFO  (parsr-api/12 on c03dbcc7bc0f): Api listening on port 3001!
web-1         |  * Restarting with stat
web-1         |  * Debugger is active!
web-1         |  * Debugger PIN: 334-469-980
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f

Configuring where queue and temporary results are stored

Hi,

I don't know if this is currently possible or not; maybe it just needs some easy docker change that I haven't figured out yet, or changing the location of a temporary directory in some script. In that case, perhaps it should be better documented.

I tried to extract text from ~20k PDFs over a weekend, but only managed to do so for 489 before running out of RAM on a computer with 32 GiB of RAM. Some of the docker containers seemed to have a ton of stuff under /tmp, which I think was a tmpfs.

./dev.sh fails with build failed

Some error occured during build,but some errors were encountered:

Get:113 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 qpdf amd64 9.1.1-1ubuntu0.1 [475 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 54.9 MB in 15s (3649 kB/s)
(Reading database ... 9619 files and directories currently installed.)
Preparing to unpack .../gcc-10-base_10.5.0-1ubuntu1~20.04_amd64.deb ...

and also

mv: cannot move '/etc/kernel/postinst.d/apt-auto-removal' to a subdirectory of itself, '/etc/kernel/postinst.d/apt-auto-removal.dpkg-remove'
dpkg: error processing archive /var/cache/apt/archives/apt_2.0.10_amd64.deb (--unpack):
 new apt package pre-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 /var/cache/apt/archives/apt_2.0.10_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
The command '/bin/sh -c apt-get update && apt-get upgrade -y' returned a non-zero code: 100
ERROR: Service 'ocr_worker' failed to build : Build failed

Timeout or similar on some weird PDFs

I've queued about 2000 PDFs. However, it seems to stop and wedge halfway through.

It's hard to diagnose which PDF it is, because a painful divide and conquer. (Is there a log I can see so I can easily replicate it for you?)

A potential workaround would be a simple timeout. Things that time out can be removed from the queue, and the user and try them later with a longer timeout. (And also to identify PDFs that are problematic, for future debugging

/list/ in web UI

Can you have a /list/ command in the web UI that shows all the IDs available? And perhaps their state?

Support for Italian language

Feature Descrition

It would be nice to have support for Italian language, since it's already supported by both Tesseract OCR and Flair.