ub-mannheim / zotero-ocr Goto Github PK

Zotero Plugin for OCR

License: GNU Affero General Public License v3.0

Shell 4.96% JavaScript 85.83% Fluent 0.44% HTML 8.78%

zotero-ocr's Introduction

Zotero OCR

This Zotero plugin adds the functionality to perform an OCR for the PDFs selected in Zotero. It can add a new PDF including the recognized text, a note with the recognized text only, and HTML (HOCR) file(s). Tesseract OCR is used for the text recognition itself.

Prerequisites

Tesseract OCR is installed
- for Windows see https://github.com/UB-Mannheim/tesseract/wiki
- for Linux, Mac see https://tesseract-ocr.github.io/tessdoc/Installation.html
pdftoppm from poppler library is downloaded and installed
- some hints for the installation: https://github.com/UB-Mannheim/zotero-ocr/wiki/Install-pdftoppm

Installation

To install the extension:

Download the XPI file of the latest release.
In Zotero, go to Tools → Add-ons and drag the .xpi onto the Add-ons window.
Possibly, adjust the path to Tesseract in the add-on options.

Configuration

The configuration can be accessed under Tools → Zotero OCR Preferences (Zotero 6) or under Zotero → Settings (Zotero 7).

By default the fields for the paths to the OCR engine and pdftoppm are empty, which means, that the usual locations are looked at. If that does not work, then you should locate the tools on your local machine and enter the full paths including the name of the tools itself.

The default language/script to use with Tesseract, can only be one of the installed models. If you leave that field empty, then the English model (eng) will be used, which is always installed with Tesseract.

Moreover, these options are saved as Zotero preferences variables, which are also available through the Config Editor.

Build and release

Run build.sh script, which creates a new .xpi file.

For a new release, run the script release.sh. It runs the build.sh script, commits the code changes for the new release and adds a tag. Push the updated local master branch and the tag to GitHub. Then publish a new release on GitHub and attach the .xpi file there.

Development

After any code changes one can build a new extension file by ./build.sh <version>. Then in Zotero go to Tools, Add-ons, Install Add-on From File... and choose there the newly created .xpi-file. Zotero 6 will restart with the newly built add-on version. Zotero 7 does not require a restart and will activate it immediately.

If any error occurs then you will see more details in the Help, Report Error... dialog. For some debugging messages you can activate in Zotero the debugging in the Help, Debug Output Logging.

License

Zotero OCR is free and Open Source software. The source code is released under GNU Affero General Public License v3.

zotero-ocr's People

Contributors

Stargazers

Watchers

zotero-ocr's Issues

Resulting PDF is invalid (on M1 Macbook)

Not sure if this is the right place for a help request so feel free to move or delete this post. I've already posted over at the Zotero forum.

I cannot get this to work.

I've installed tesseract and poppler with Homebrew, installed the zotero plugin and set the path in the Zotero plugin to:

(/opt/homebrew/Cellar/tesseract/4.1.1
/opt/homebrew/Cellar/poppler/21.03.0_1/bin

I can confirm that this where those files do live.

I've also copied the pdftoppm into /Applications/Zotero.app/Contents/MacOS/pdftoppm according a recommendation on the Zotero forum, although I have tried to run the ocr with and without this step.

When I run the plugin, an ocr file appears but when I try to open it I get the following error:

Format Error: Not a PDF or corrupted.

PDF.js v2.8.146 (build: 7dd64325d)
Message: Invalid PDF structure.

Help? ...

Can't config the path to OCR engine on linux

tesseract-ocr is he engine used by Zotero OCR to recognize and extract content, but the installation guide only shows the path for windows machine.

I tried whereis tessarect-ocr to locate the path for the engine and I got /usr/share/tesseract-ocr as a result, but when I applied to the preferences in Zotero, it says no executable found.

Does anyone knows what to do to config?

thanks

No pdftoppm.exe executive found

I right clicked a pdf and choose ocr. But it says no exe found under zotero file. I installed tesseract and both are newest version.

Is there a way to reduce the size of output pdf files?

I tested this ocr tool on some PDFs I downloaded from Academia.edu and the results were great. However, there's a problem: it increased the file size by A LOT (ex: a 11.8 MB file turned a 107 MB pdf).

I was hoping to use this tool to create searchable and conveniently highlightable PDFs using scans from physical books I have, but scanned files are normally huge on their own. When I ran zotero-ocr on one of my scans (257 MB) I ended up with a file that's over 2GB in size (it won't even open). :(

Is there something I can do to decrease the file sizes?

(I use Zotero 6.0.9 on Windows and have installed the latest version of zotero-ocr)

Arabic language "Saudi Arabia"

Does it support the Arabic language?

tesseract.js with WebAssembly

Thanks for this package! I just tried out this extension yesterday and found it great.

Just out of curiosity, is it possible to compile the tesseract into WebAssembly by using tesseract.js so that the end-user don't need to install the tesseract package on their local machine?

I know little about JS and I am not sure if this is viable. But I am asking because I think the current installation might be a bit troublesome for non-programmer/developer who doesn't know how to install homebrew, how to use apt-get install, etc. I can see that many people ask questions on Zotero forum regarding the installation and I just wonder how to make it easier.

Thanks again.

Idea: Output searchable pdf and hocr/html version

Besides outputting a searchable pdf we can also output a hocr file which can be attachted to the Zotero item as a html file. I would suggest to already include hocrjs as a script, such visualizing this html representation is interesting. It is needed for this feature to store and keep the images in a subfolder.

'Overwrite the initial PDF' duplicates Zotero pdf record

When running z-ocr with the options shown below, a single pdf is generated but the record in Zotero is duplicated.

I'm using ZotFile, as you can see from the filepaths.

The links aren't broken. Both records refer to the same pdf.

It's easy enough to fix manually, of course, but a bit messy.

producing scanned pdf from a wrong file

system is macbook air, M1, Monterey, Zotero 5.0.97-beta.59+abe8c39c5, add-on the latest.

when i choose an attached pdf and click OCR selected PDF, after some while it does produce a scanned pdf file, but not the one i selected, and no matter which i choose to scan, i always get the same file

Nothing happens when used

Hey devs,
first of all: Nice idea to implement OCR for zotero! I have a problem with it: I installed the self-built .xpi-File (based on commit 63dacf0) in zotero (Desktop version 5.0.77) and set the path to /usr/bin/tesseractwhich is where my tesseract is installed to. But now, when I right-click on an entry and select "OCR selected PDF(s)" -> nothing happens.

Expected behaviour:
Something like: write the words or a set of the words in the textfield of the associated .pdf

Current behavour:
Nothing noticeable happens, CPU usage stays low.

I'm on Manjaro Linux (Arch derivative) with kernel 5.3.8 with tesseract 4.1.0. My installed languages are deu, eng, osd and I tried to OCR an english pdf.

Am I doing something completely wrong?

Thanks in advance,
Lukas

Upload xpi file of a release with special content-type

Currently, when clicking on a XPI file in Firefox, it first tries to install this as a Firefox plugin. Thus, one usually need to right click on the file and save it before installing in Zotero. The idea is to use a special content-type which can then also be handled by the Zotero connector, see also zotero/zotero-connectors#297 .

PDF does not auto-link to group libraries

Note: the process works fine, it's just that the OCR'd PDF does not link to group libraries so it seems like nothing at all has happened.

It works fine in 'my library.'

Group Library:

My Library:

Determine available languages and provide a choice for them

Currently, we use a fixed language as deu or eng for OCR with Tesseract. But in a lot of cases it is even better to choose script/Latin, or for old texts script/Fraktur. Also other languages or scripts should be available to choose from.

There are several things to consider here:

How can we find out the available languages for the currently installed tesseract? - It is possible to run commands like tesseract --list-langs from the extension, but we cannot access the output or pipe the output somewhere from Zotero. Should we just ship a one-liner script (shell script for linux/mac and bat file for windows) which is then calling the command above and pipe it to a file, which we then can analyze? Other ideas?
It is possible to have some general options and defining a standard model there. In the setting pane you can then also change this model depending on the languages you have installed (see 1.).
It is possible to analyze the language field of each Zotero entry to choose a different option. This would then allow for example to use deu model for German texts and eng model for English texts. However, this might not always be that simple. For example for older German texts one should maybe use script/Fraktur model instead and even the script/Latin model is quite often better for texts including names also in foreign languages etc.
Maybe it is better to ask before each call which language to choose etc. Then you can manually select all the entries which can be recognized by the same language. Moreover, one could possible have some more Tesseract options to toggle on/off etc. What do you think?

CC @stweil @luerhard

OCR option not in Z7 context menu

I'm using Zotero 7 on a mac, trying to OCR a pdf on Zotero 7.
This was working fine a week ago in Zotero 6, with the latest version of the addon installed and all the same settings (i.e. tesseract pathways etc).
I was waiting for OCR and translate addons to work with Z7 - now that they both do (thank you!!) I've moved across.
The addon appears to be installed properly, and the settings correct in the Zotero preferences (as far as I can tell). So has something changed in the latest version of Zotero or the addon that means it's not appearing in the context menu?

Idea: pdfimages from poppler (xpdf) to convert pdf to images

Zotero already uses pdfinfo and pdftotext from the poppler (xpdf) library. A modfied version of these two command line tools are created in https://github.com/zotero/cross-poppler and shipped with Zotero. After the Zotero installation these two tools are in the Zotero program folder, e.g. C:\Program Files (x86)\Zotero\.

If we have also pdfimages in this folder, then it can be used similarly from within Zotero. I would suggest for a first test to download it manually and place it in that folder. If we see that this than all works together, then we can think about, whether we can install it during the Zotero plugin installation.

Example call:

pdfimages.exe -png ../../Funke_1996_Meth.pdf out/funke

Is XPDF an alternative of poppler tools?

Zotero OCR Fails with Linked Attachments

When I run Zotero OCR with PDFs inside the Zotero storage folder, things go as expected. When instead Zotero has only a link to a PDF stored elsewhere, the OCR fails to finish.

OCR run on a linked PDF can create

a Zotero note for the OCR text, but this note remains empty and never gets text added to it,
a .ocr.pdf file, but this file fails to open with the error "[PXCLib]: Required value not found,"
a .ocr.txt file, but this file remains empty.

Just looking at the behavior on the screen (which may or may not be transparent to what's happening under the hood), it looks like pdftoppm.exe fails to launch when the OCR process is attempted on a linked PDF.

If Zotero OCR could

handle linked PDFs the same as it does stored PDFs, that would be ideal, or
produce an error when OCR is attempted on a linked PDF that would ask the user to make the linked PDF a stored PDF in order to proceed with the conversion, that would be helpful.

Thanks so much for considering this request!

Zotero 6.0.9
Zotero OCR 0.6.0
Poppler 22.04
Tesseract 5.2.0.20220712 (64-bit)

Corrupted PDF

I got tesseract and poppler to work, but the resulting process always stops on the fourth page on multiple pds and the resulting pdf is corrupted when I try to open it in Zotero, in the browser or with a different pdf viewer.

Error Message
PDF.js v2.8.o (build: fdde957)
Message: Invalid PDF structure.

Issue with missing .dll Files

I'm way out of my depth here, so I'm not sure what to do:

I've followed the instructions and the Tesseract install was no problem.

I found a StackOverflow tutorial to install the Poppler files.

I've entered the paths to Tesseract and Poppler in the Zotero extension window, but when I try OCR'ing a file in Zotero via right-click I get a series of error messages about missing .dll Files: freetype.dll, libcurl.dll, zlib.dll, openjp2.dll.

It says to reinstall, but I'm not sure what to reinstall.

Any advice?
Thanks!

No line breaks in the attached note

Also there are line breaks etc. in the recognized txt file.

Change language to chi_sim_vert, perform OCR didn't response

① installed tesseract as well as poppler
② entered the path of two exe's in the zotero ocr plugin
③ in order to ocr the Chinese pdf, entered chi_sim_vert in the language field

④ Run the plugin, only a black window popped up, but did not produce any running results

Document to configure linux and mac have moved, suggestion to include them in the repo's docs

Hi,

The linux and the mac configuration document have moved.
Can we include a simple screenshot of the right parameters to use for the configuration?

Thanks

More flexibility with pdftoppm dependency

Today, I installed the plugin on a macOS and there was the problem that the pdftoppm was not found initially. Then, the easiest was to install it with brew install poppler which installed thereby a lot of other things as well. However, that was still not fine then, because the plugin expected the program to lie in the Zotero directory, which could then finally solved by copy and paste.

How can we make this process easier for macOS?

It would be possible to look for other locations of pdftoppm similar to what we already do for tesseract. Is there also an easier installation method for the poppler tools? For windows it was enough to download one file and move it to the correct directory.

Issue with Farsi OCR

The OCR actually works great (it's all in the txt file created during OCR), but the text recognized does not get linked into the PDF document properly - as in, most of it is not there in the actual PDF file. Can't even search for it! Let me know what further information you may need to troubleshoot this.

Difficulty to OCR with the Zotero Beta (latest) on Mac M1

I am trying to use the plugin with Zotero Beta (latest version) on an Apple M1.

I have tesseract, and poppler tools configured. But there seems to some issue as I am not able to OCR any file. Here is the error I see when looking at Debug Output.

(1)(+0000007): NS_ERROR_FAILURE Exception: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.init] [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: chrome://zotero/content/xpcom/utilities_internal.js :: Zotero.Utilities.Internal.exec< :: line 762" data: no] Zotero.Utilities.Internal.exec<@chrome://zotero/content/xpcom/utilities_internal.js:762:5 From previous event: oncommand@chrome://zotero/content/standalone/standalone.xul:1:1

Either I have made some mistake in the configuration or how I setting the path ( though I double checked and the paths seem fine, but may be this is a Beta or an M1 issue?

Appreciate your advice. Thanks in advance.

Zotero 7 Support

Per the Zotero Forums, all Zotero plugins will need to be rewritten to support the next major version.

Add files directly and not only the links to them

An Academic Workflow: Zotero & Obsidian | by Alexandra Phelan | Medium

https://medium.com/@alexandraphelan/an-academic-workflow-zotero-obsidian-56bf918d51ab

Document names with spaces don't work on Windows

In a test on Windows, the OCR failed when the name of the PDF includes space characters.

The same test still has to be done for Linux and macOS.

Idea: Give option to force OCR

I frequently encounter PDFs downloaded from sources like JSTOR that do already have an OCR layer embedded, but which is not actually accurate. Issues can be simply bad OCR (rare), but more often layers that are not exactly overlapping with the visible text. When highlighted, it looks fine in a reader, but when extracted via tools like ZotFile or trying to copy the text, the output is different from what it should be. In these cases I like to do a run of ocrmypdf with --force-ocr flag, which drops the existing layer and creates a new one, which tends to be way more accurate. I tried to do the same thing with zotero-ocr, but it does not seem to process PDFs with existing OCR layers. This seems an efficient way to handle this, but it would be nice to have an option to enable a forcing mode, which creates a entirely new layer.

Greetings from Freiburg!

Issue: Gets stuck after producing PNGs

Zotero OCR initially worked. Now it only produces pdfinfo and image-list text files and PNGs (see image).

Stopped working after I moved data directory location and Zotfile location.

Tried to reinstall OCR and restore default locations, still doesn't work.

MacOS 12.1. (Intel, not M1)

pdftoppm not Found

Using v.0.3.0 on linux with Zotero v 5.0.89. I'm using the standard Ubuntu 20.04 installation of tesseract and pdftoppm (both located in /usr/bin).

To reproduce:

Install latest verion
Right click on a PDF to OCR

This results in the following error message:
Cannot find /usr/local/Zotero_linux-x86_64/pdftoppm

Work-around is to create symbolic link to the system pdftoppm in /usr/local/Zotero_linux-x86_64.

Nested option in preferences is always enabled

The nested option in the preferences is always enabled, also there is the function to disable that (and it worked at some point). One sees then the following error:

[JavaScript Error: "TypeError: document.getElementById(...) is null" {file: "chrome://zoteroocr/content/zoteroocr.js" line: 21}]
Zotero.OCR</this.updatePDFOverwritePref/<@chrome://zoteroocr/content/zoteroocr.js:21:76

PDFs very large, compressing in Preview removes OCR'd text

I love this plug-in, but I am running into two issues:

Scanned PDF size increase x10–20: I just scanned a 14 MB file and the scanned PDF was 200 MB.
When I compress this file in Mac's Preview, the text is no longer searchable. It gets turned into this: "􏰀􏰀􏰀􏰀􏰀"

When I compress the file in with Acrobat's online tool, I can compress without losing the text, but overall it makes this plugin much less useful. The issue seems to be related to the tools the plug in relies on, but I couldn't find an easy solution when I tried googling poppler and tesseract.

Suggestions? thanks for the plugin!

No bin.exe executable found

Can not found bin.exe in pdftoppm.

OCR Produces corrupted file

Running the plugin produces a corrupted PDF file that cannot be opened in Zotero nor in other PDF readers. When the settings are set to "overwrite the initial PDF" the process fails after page 5, but with this setting disabled the plugin will complete the process. However, in both cases the output PDF is corrupted.

Current settings:

No output?

I have popplertoppm.exe copied to my Zotero directory and the extension finds Tesseract correctly. When I right-click a PDF attachment ("test.pdf") and select "OCR selected PDF(s)", nothing happens. The folder where the PDF is saved has a new "test.info.txt" file, but it's 0KB and completely blank. This is 10 minutes later for a very small PDF. Tried with a couple others and had the same problem. Any suggestions?

Nothing happens (Linux)

I have pdftoppm and tesseract in appropriate directories but get no output when I right click > OCR selected pdf. I'm on arch linux but I'm pretty new to the distro. Cannot find any solution that works in the previously solved issues. Thanks!

Unclear when working

The addon gives the user no feedback that it's actually doing anything, potentially leading to action spamming and then making multiple copies of everything

Zotero-ocr not compatible with latest beta 6

Attempting to install zotero-ocr-0.5.0.xpi on
Zotero 6.0-beta.1+8c846468f
gives the message that this add-on is not compatible with this version of Zotero.
See this discussion
https://forums.zotero.org/discussion/94617/all-plugins-removed-from-zotero-6-0-beta-1-c6d03753f#latest

Make some output optionally to keep

Currently all Files that are created during the process are kept. That includes the png-Files, the pdf. For my testfile that results in 35,8 MB for a 274 KB file. Assuming that scaling: If a perform OCR on all files of my 4GB zotero library (which I would want to search for stuff), it will grow to 520 GB. I would suggest to get rid of all the png files and to set an option if the *.ocr.pdf should get created at all (which takes around 15 MB of the 35.8).

Originally posted by @luerhard in #6 (comment)

Error message

I installed zotero-ocr via xpi. I was able to set the tesseract path in the preference window and I can select "OCR selected pdf" from the context menu. However, nothing happens. This is the terminal output:

Possibly unhandled rejection:

ZoteroPane is not defined

ReferenceError: ZoteroPane is not defined
    Zotero.OCR</this.recognize<@chrome://zoteroocr/content/zoteroocr.js:7:7
From previous event:
    oncommand@chrome://zotero/content/standalone/standalone.xul:1:1

keeps running without finishing job

I have installed Tesseract but when I go to OCR a document the log keeps going for hours but the OCR'd pdf is never produced.

Debug ID D405980180.

Make resolution of extracted images configurable

It would be really appreciated, if possible, the addition of a setting to raise the default image quality of the produced OCRed pdf, since the default quality is suboptimal in many instances.

E.g. (Top: Original, Bottom: OCRed version)

Automatically OCR new pdfs

This is not really an issue and more of a request for improvement so I'm going to start by thanking you for implementing a great addon! Would there be an easy way for you to implement an automatic evaluation of whether a new pdf is OCR-ized already or not, and if not, run the OCR on it?

On the user end, I feel that that would make the process very streamlined & nice to have (at least as an option)

plugin does not find tesseract

I get this message on debian:

No /usr/bin/tesseract executable found.

but the file exists.

couldn't open 'nameToUnicode'

couldn't open 'nameToUnicode' ,
I/O Error: Couldn't open 'nameToUnicode' file 'node_modules\node-poppler\src\lib\win32\poppler-22.04.0\share\poppler\nameToUnicode\Bulgarian'
I/O Error: Couldn't open 'nameToUnicode' file 'node_modules\node-poppler\src\lib\win32\poppler-22.04.0\share\poppler\nameToUnicode\Greek'
I/O Error: Couldn't open 'nameToUnicode' file 'node_modules\node-poppler\src\lib\win32\poppler-22.04.0\share\poppler\nameToUnicode\Thai'

Automatic installation on ArchLinux

I have created a package so that the plugin can be easily installed on ArchLinux.

This can be achieved by running yay -S zotero-extension-ocr (see https://aur.archlinux.org/packages/zotero-extension-ocr). Should this be mentioned in the installation doc?

Updating PDF

This works very well for me, but when I choose "override PDF" it does not tell Zotero that it has modified the PDF. This means that Zotero won't sync over the modified file to my other devices (I use the Zotero full sync, not linked files). Am I overlooking an option?

Generates linked pdf (ocr.pdf) immediately without OCR scan

The plugin was working amazing for me, but for some reason it stopped working recently. When I invoke it in Zotero, it generates an ocr.pdf file immediately, but nothing is in that file. I reinstalled the plugin, but this did not help. Perhaps I there has been an issue with one of the dependencies? Is there any way to debug what's happening? I don't know at what step it is failing.