Giter VIP home page Giter VIP logo

pd3f's Introduction

pd3f

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

  • calculate runtime based on job.started_at and job.ended_at
  • Get average runtime of jobs and store data in redis list

more information about PDF

  • NER
  • entity linking
  • extract keywords
  • use textacy

add more language

  • check if flair has model
  • what to do if there is no fast model?

Python client

  • simple client based on request
  • send whole folders

Markdown / HTML export

  • go beyond text

use pdf-scripts / allow more processing

  • reduce size
  • repair PDF
  • detect if scanned
  • force to OCR again

improve logs / get better feedback

  • show uncertainty of ML model
  • allow different log levels

Related Work

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

pd3f's People

Contributors

0xflotus avatar jfilter avatar paoloq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pd3f's Issues

Update to Parsr 1.2.2

Description

Parsr 1.2.2 supports a new ImageMagick policy which limits its disk usage in storing temp data.
It would be useful to update it, in order to not be forced in a periodically manual disk handling.

Footnote numbering in output text generates non-numeric values

Using experimental mode for footnotes to endnotes, the position of the numeric superscript values from the input PDF are detected with very high accuracy. However, in the output TXT, numeric values of the input superscript are never generated. Instead various non-numeric output values are generated.

Examples:
input: safeguarded.4 output: safeguarded.' (output value is the apostrophe symbol)
input: times.6 output: times.° (output value is the degrees symbol)
input: individual.14 output: individual.'* (output values are apostrophe, asterisk)
input: ahead15 output: ahead.!° (output values are exclamation mark, degrees symbol)

Using the settings in the web UI:

language | en
fast mode | False
extract tables | False
experimental mode | True

Improve Dockerfile handling

Don't push pd3f-ocr and the dashboard separately to Docker hub. Instead, built the images (via inheritance + small adjustments) when running docker-compose up the first time.

demo is down

Hi, I wanted to try out the demo on the website, but when I click on submit a server error 500 is returned. Would be lovely to try out a difficult pdf I have been having trouble with and see if this will fix my issue

Download of lm-mix-german-forward-v0.2rc.pt fails with status code 301

language: de
[ ] fast (but less accurate)
[ ] extract tables
[x] deduplicate page header/footer & transform footnotes to endnotes (experimental)

The textoutput of PDFs using docker-compose is

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 636.96
INFO:pd3f.doc_info:median line height: 20
INFO:pd3f.doc_info:median line space: 9.210000000000036
INFO:pd3f.doc_info:counter width: [(638.93, 21), (638.02, 19), (639.09, 19), (638.95, 17), (638.05, 16)]
INFO:pd3f.doc_info:counter height: [(20, 1861), (21, 602), (19, 395), (24.14, 31), (22, 21)]
INFO:pd3f.doc_info:counter lineheight: [(9.0, 487), (10.0, 454), (9.789999999999964, 158), (10.210000000000036, 138), (9.210000000000036, 127)]
ERROR:rq.worker:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 53, in extract
e = Export(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 171, in __init__
self.export()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 239, in export
cleaned_header, cleaned_footer, new_footnotes = self.export_header_footer()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 198, in export_header_footer
headers = remove_duplicates(headers, self.lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/doc_info.py", line 136, in remove_duplicates
if single_score(only_text(r), lang) <= single_score(
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 65, in single_score
scorer = get_scorer(lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 30, in get_scorer
scorer = FlairScorer(lang=lang)
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in __init__
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/flair/embeddings/token.py", line 567, in __init__
model = cached_path(base_path, cache_dir=cache_dir)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 90, in cached_path
return get_from_cache(url_or_filename, dataset_cache)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 166, in get_from_cache
raise IOError(
OSError: HEAD request failed for url https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-mix-german-forward-v0.2rc.pt with status code 301.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 53, in extract
e = Export(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 171, in __init__
self.export()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 239, in export
cleaned_header, cleaned_footer, new_footnotes = self.export_header_footer()
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 198, in export_header_footer
headers = remove_duplicates(headers, self.lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/doc_info.py", line 136, in remove_duplicates
if single_score(only_text(r), lang) <= single_score(
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 65, in single_score
scorer = get_scorer(lang)
File "/usr/local/lib/python3.8/site-packages/pd3f/dehyphen_wrapper.py", line 30, in get_scorer
scorer = FlairScorer(lang=lang)
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in __init__
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/dehyphen/scorer.py", line 26, in
self.lms = [FlairEmbeddings(x).lm for x in model_names]
File "/usr/local/lib/python3.8/site-packages/flair/embeddings/token.py", line 567, in __init__
model = cached_path(base_p```ath, cache_dir=cache_dir)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 90, in cached_path
return get_from_cache(url_or_filename, dataset_cache)
File "/usr/local/lib/python3.8/site-packages/flair/file_utils.py", line 166, in get_from_cache
raise IOError(
OSError: HEAD request failed for url https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-mix-german-forward-v0.2rc.pt with status code 301.

Stuck on find /to-ocr -name '*.pdf' -type f

Hey there, I'm running pd3f on a workstation and accessing via SSH tunnel to my local machine.

I'm using the browser GUI to (hopefully 🤞) OCR a book scan with several hundred pages. The book scan is already in PDF format - the file size is around 30MB.

In the log output of the web GUI I am seeing:

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 174.0
INFO:pd3f.doc_info:median line height: 9.0
INFO:pd3f.doc_info:median line space: 4.159999999999968
INFO:pd3f.doc_info:counter width: [(409.44, 1036), (8, 1036), (409.68, 1014), (410.16, 982), (409.2, 974)]
INFO:pd3f.doc_info:counter height: [(10, 19582), (9, 11277), (8, 10001), (7, 1238), (9.24, 1180)]
INFO:pd3f.doc_info:counter lineheight: [(4.159999999999968, 3830), (4.160000000000025, 2457), (4.159999999999997, 2251), (4.399999999999977, 2118), (2.759999999999991, 1806)]
INFO:pd3f.export:export page #0

It's been at least 20 minute since I started the conversion, so I'm surprised to see the tool is still on page #0.

In the terminal I'm seeing the following:

ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:00] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:01] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:02] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:03] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:04] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:05] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:06] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:07] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:08] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1

I'm new to pd3f but it looks like the ocr worker is stuck in a loop waiting to receive a file?

Any suggestions for troubleshooting are much appreciated.

Stuck forever in "INFO:root:setting up ocr"

Hi, after I did a couple of successful convert, the pd3f has ran into problem of getting stuck in "INFO:root:setting up ocr" forever. Even after I ran docker-compose run --rm worker rqscheduler --host redis --burst.

Improve layout detection

First of all: pd3f is working great!
It's a wonderful tool. Thank you so much for creating it.

There's this small issue though, that text blocks / columns aren't recognized as such. So articles written in columns and similar things are currently not recognized within their blocks.
Thus highlighting or searching things that span over a line is broken in these cases.

I'm not sure why this is the case, since I though that tesseract has actually a proper layout analysis integrated ("Page Segmentation Mode" and its default should be "Fully automatic page segmentation, but no OSD").

Rotated PDFs don't work properly

When I use Gnome's Document Viewer app to orient a PDF correctly, it doesn't detect the rotation (which is probably just a meta-parameter thing) and produces scrambled text.

test files:

original.pdf
rotated.pdf
result

output:

uswnen Inu Is n aw 3x3, uabnys n m pun uaßnyau a n p je uornom 'wyeyasaß ses xXaypu n n g Jau n ap n spe ya n aqey OS 'Usas3]| Jaww n y9OU USII37 3Salp IS Uayjos pun 'ulas
[...]

OCR worker won't start on Windows

When running Windows the ocr-worker container won't start, because the ocr_folder.sh will have windows line endings.

A temporary solution is to change the line endings of the file.
A permanent solution would be nice.

Folder of CSV files not found for document ID

I got this error when I tried to convert a PDF:

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
ERROR:rq.worker:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 50, in extract
input_json, tables_csv = run_parsr(
File "/usr/local/lib/python3.8/site-packages/pd3f/parsr_wrapper.py", line 86, in run_parsr
for page, table in parsr.get_tables_info():
File "/usr/local/lib/python3.8/site-packages/parsr_client/parsr_client.py", line 223, in get_tables_info
return [(table.rsplit('/')[-2], table.rsplit('/')[-1]) for table in ast.literal_eval(self.get_table(request_id=request_id).columns[0])]
File "/usr/local/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/local/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "", line 1
Error: Folder of CSV files not found for document ID 8fe98ff6ca95d82ef0fb214959ea78
^
SyntaxError: invalid syntax
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/rq/worker.py", line 936, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 684, in perform
self._result = self._execute()
File "/usr/local/lib/python3.8/site-packages/rq/job.py", line 690, in _execute
return self.func(*self.args, **self.kwargs)
File "./app.py", line 273, in do_the_job
text, tables = extract(
File "/usr/local/lib/python3.8/site-packages/pd3f/export.py", line 50, in extract
input_json, tables_csv = run_parsr(
File "/usr/local/lib/python3.8/site-packages/pd3f/parsr_wrapper.py", line 86, in run_parsr
for page, table in parsr.get_tables_info():
File "/usr/local/lib/python3.8/site-packages/parsr_client/parsr_client.py", line 223, in get_tables_info
return [(table.rsplit('/')[-2], table.rsplit('/')[-1]) for table in ast.literal_eval(self.get_table(request_id=request_id).columns[0])]
File "/usr/local/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/local/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "", line 1
Error: Folder of CSV files not found for document ID 8fe98ff6ca95d82ef0fb214959ea78
^
SyntaxError: invalid syntax

It was just in the case of a single document, though.
Otherwise it's working perfectly.
Thank you so much for pd3f! 🙏

Example API call goes to Waiting forever

I can't seem to run the example script to inference anything from the docker image running pd3f. It seems to go on waiting state forever.
I cloned the repo and ran the ./dev.sh script and use the code below for inferencing:

import time

import requests

files = {
    "pdf": (
        "CreditCardStatement (1).pdf.pdf",
        open(r"./test/pdfs/Admit Card.pdf", "rb"),
    )
}
response = requests.post("http://localhost:1616", files=files, data={"lang": "de"})
id = response.json()["id"]

while True:
    r = requests.get(f"http://localhost:1616/update/{id}")
    j = r.json()
    if "text" in j:
        break
    print("waiting...")
    time.sleep(1)
print(j["text"])

TERMINAL OUTPUT:

waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...
waiting...

Terminal output for docker:

./dev.sh

[+] Running 2/0
 ✔ Network pd3f_default                                                                                                                                      Created0.0s
 ⠋ Container pd3f-ocr_worker-1                                                                                                                               Creating0.0s
[+] Running 10/5f-parsr-1                                                                                                                    ✔ Network pd3f_default                                                                                                                                      Created0.0s                                                                                                                 ✔ Container pd3f-ocr_worker-1                                                                                                                               Created0.1s mage's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platfo ✔ Container pd3f-parsr-1                                                                                                                                    Created0.1s
 ✔ Container pd3f-redis-1                                                                                                                                    Created0.1s
 ! ocr_worker The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s
 ! parsr The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested      0.0s
 ✔ Container pd3f-worker-1                                                                                                                                   Created0.0s
 ! worker The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested     0.0s
 ✔ Container pd3f-web-1                                                                                                                                      Created0.0s
 ! web The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested        0.0s
Attaching to ocr_worker-1, parsr-1, redis-1, web-1, worker-1
ocr_worker-1  | + mkdir -p /to-ocr
ocr_worker-1  | + sleep 1
redis-1       | 1:C 02 Mar 2024 14:23:56.137 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis-1       | 1:C 02 Mar 2024 14:23:56.137 # Redis version=6.2.14, bits=64, commit=00000000, modified=0, pid=1, just started
redis-1       | 1:C 02 Mar 2024 14:23:56.137 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
redis-1       | 1:M 02 Mar 2024 14:23:56.137 * monotonic clock: POSIX clock_gettime
redis-1       | 1:M 02 Mar 2024 14:23:56.137 * Running mode=standalone, port=6379.
redis-1       | 1:M 02 Mar 2024 14:23:56.137 # Server initialized
redis-1       | 1:M 02 Mar 2024 14:23:56.138 * Ready to accept connections
parsr-1       | Starting par.sr API : node api/server/dist/index.js
worker-1      | 14:23:56 Worker rq:worker:82fe312997394624a7e13aea0ea16aa5: started, version 1.5.2
worker-1      | 14:23:56 *** Listening on default...
worker-1      | 14:23:56 Cleaning registries for queue: default
web-1         |  * Serving Flask app "/app/app.py" (lazy loading)
web-1         |  * Environment: development
web-1         |  * Debug mode: on
web-1         |  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
parsr-1       | [2024-03-02T14:23:57] INFO  (parsr-api/12 on c03dbcc7bc0f): Api listening on port 3001!
web-1         |  * Restarting with stat
web-1         |  * Debugger is active!
web-1         |  * Debugger PIN: 334-469-980
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker-1  | + sleep 1
ocr_worker-1  | ++ find /to-ocr -name '*.pdf' -type f

Configuring where queue and temporary results are stored

Hi,

I don't know if this is currently possible or not; maybe it just needs some easy docker change that I haven't figured out yet, or changing the location of a temporary directory in some script. In that case, perhaps it should be better documented.

I tried to extract text from ~20k PDFs over a weekend, but only managed to do so for 489 before running out of RAM on a computer with 32 GiB of RAM. Some of the docker containers seemed to have a ton of stuff under /tmp, which I think was a tmpfs.

./dev.sh fails with build failed

Some error occured during build,but some errors were encountered:

Get:113 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 qpdf amd64 9.1.1-1ubuntu0.1 [475 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 54.9 MB in 15s (3649 kB/s)
(Reading database ... 9619 files and directories currently installed.)
Preparing to unpack .../gcc-10-base_10.5.0-1ubuntu1~20.04_amd64.deb ...

and also

mv: cannot move '/etc/kernel/postinst.d/apt-auto-removal' to a subdirectory of itself, '/etc/kernel/postinst.d/apt-auto-removal.dpkg-remove'
dpkg: error processing archive /var/cache/apt/archives/apt_2.0.10_amd64.deb (--unpack):
 new apt package pre-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 /var/cache/apt/archives/apt_2.0.10_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
The command '/bin/sh -c apt-get update && apt-get upgrade -y' returned a non-zero code: 100
ERROR: Service 'ocr_worker' failed to build : Build failed

Timeout or similar on some weird PDFs

I've queued about 2000 PDFs. However, it seems to stop and wedge halfway through.

It's hard to diagnose which PDF it is, because a painful divide and conquer. (Is there a log I can see so I can easily replicate it for you?)

A potential workaround would be a simple timeout. Things that time out can be removed from the queue, and the user and try them later with a longer timeout. (And also to identify PDFs that are problematic, for future debugging

/list/ in web UI

Can you have a /list/ command in the web UI that shows all the IDs available? And perhaps their state?

Support for Italian language

Feature Descrition

It would be nice to have support for Italian language, since it's already supported by both Tesseract OCR and Flair.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.