Comments (7)
Back in June when I last tried this, I also resorted to using OCRmyPDF after trying zotero-ocr.
I used the Docker version and OCRed my pdf with the following command
docker run -i --rm jbarlow83/ocrmypdf - - <jp.pdf >jp_ocr.pdf -O3 --jobs 2
from zotero-ocr.
I'm using https://github.com/ocrmypdf/OCRmyPDF as a manual workaround. Maybe it will be sufficient for your case.
from zotero-ocr.
I second this request.
from zotero-ocr.
pdftoppm is producing pngs. Maybe if it could be swapped over to jpeg, perhaps as an option, it would shrink the pdf
from zotero-ocr.
Having just tested the plugin for the first time, I really feel this need to be prioritised. Here are my file differences for comparison:
Original file size: 59.1 MiB
OCR'ed file size: 438.7 MiB
-- cp. combined size of the generated image files: 321.3 MiB
It seems that not only the file format but also the resolution and possibly colour space of the image files could use some tweaking? I doubt that most scanned or otherwise rasterised PDFs come as high as 300dpi, so exporting PNGs at that resolution will definitely increase file size. Assuming that these files are only used for generate OCR information -- ie, colour elements from the original PDF will remain intact in the OCR'ed file -- a compromise could be to export the page images as greyscale, which will shrink the file size by half, and might also reduce image noise.
Exporting pages as JPGs can also contribute to smaller file sizes. If I save the same greyscale image as PNG and JPG (90% quality), the latter is only third of the PNG file size. But lowering JPG quality might also impact the readability of the text. Issue #23 suggests making image resolution configurable by the user, and it could be really helpful in reducing interim image file sizes, but at the same time makes the process more fidgety, as I know I would end up trying different resolutions to balance OCR quality vs file size...
It may be necessary instead to post-process the generated PDF; I can't tell from the Poppler documentation if it is any help in file compression? I resorted to an online PDF service which reduced the 438.7 Mib PDF to 46.9 Mib, with the OCR intact, but it would be nice to save the bandwidth and process the file locally. Especially since I have close to a hundred PDFs in my Zotero library that need the OCR treatment...
from zotero-ocr.
Yeah, the size can become quite large. Tesseract itself creates the PDF with the input we give it. Tesseract would also run on jpg images, but the quality of the OCR output also depends on the inputed images and the colors.
The -O3
option form OCRmyPDF looks good and this tool also uses tesseract under the hood. Maybe, one could consider to switch the workflow to it...
from zotero-ocr.
Reducing the resolution like in pull request #41 would reduce the size a lot. Using JPEG 2000 files with lossy compression would allow really small PDF files. Ideally that should be implemented in Tesseract.
from zotero-ocr.
Related Issues (20)
- Change language to chi_sim_vert, perform OCR didn't response HOT 3
- plugin does not find tesseract HOT 3
- No pdftoppm.exe executive found HOT 2
- Corrupted PDF HOT 8
- Issue with Farsi OCR HOT 1
- An Academic Workflow: Zotero & Obsidian | by Alexandra Phelan | Medium
- OCR Produces corrupted file HOT 3
- Zotero 7 Support HOT 16
- Automatic installation on ArchLinux HOT 3
- Unclear when working HOT 1
- PDF does not auto-link to group libraries
- Arabic language "Saudi Arabia" HOT 1
- Automatically OCR new pdfs
- couldn't open 'nameToUnicode' HOT 1
- No bin.exe executable found HOT 5
- OCR option not in Z7 context menu HOT 19
- 无法调用ocr软件 HOT 9
- TypeError: IOUtils.DirectoryIterator is not a constructor HOT 7
- bugs with newest version & questions on developing HOT 3
- Increase multithreading processing capability HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zotero-ocr.