zapolnoch / node-tesseract-ocr Goto Github PK

View Code? Open in Web Editor NEW

294.0 4.0 38.0 528 KB

A Node.js wrapper for the Tesseract OCR API

License: MIT License

JavaScript 100.00%

tesseract ocr text-recognition image-to-text

node-tesseract-ocr's Introduction

Tesseract OCR for Node.js

Installation

First, you need to install the Tesseract project. Instructions for installing Tesseract for all platforms can be found on the project site. On Debian/Ubuntu:

apt-get install tesseract-ocr

After you've installed Tesseract, you can go installing the npm-package:

npm install node-tesseract-ocr

Usage

const tesseract = require("node-tesseract-ocr")

const config = {
  lang: "eng", // default
  oem: 3,
  psm: 3,
}

async function main() {
  try {
    const text = await tesseract.recognize("image.jpg", config)
    console.log("Result:", text)
  } catch (error) {
    console.log(error.message)
  }
}

main()

Also you can pass URL:

const img = "https://tesseract.projectnaptha.com/img/eng_bw.png"
const text = await tesseract.recognize(img)

or Buffer:

const tesseract = require("node-tesseract-ocr")
const fs = require("fs/promises")

async function main() {
  const img = await fs.readFile("image.jpg")
  const text = await tesseract.recognize(img)

  console.log("Result:", text)
}

If you want to process multiple images in a single run, then pass an array:

const images = ["./samples/file1.png", "./samples/file2.png"]
const text = await tesseract.recognize(images)

In the config object you can pass any OCR options. Also you can pass here any control parameters or use ready-made sets of config files (like hocr):

await tesseract.recognize("image.jpg", {
  load_system_dawg: 0,
  tessedit_char_whitelist: "0123456789",
  presets: ["tsv"],
})

Alternatives

If you want to use Tesseract in the browser, choose Tesseract.js package, which compiles original Tesseract from C to JavaScript WebAssembly. You can also use it in Node.js, but the performance may not be as good.

node-tesseract-ocr's People

Contributors

Stargazers

Watchers

node-tesseract-ocr's Issues

Support --dpi OCR option

node-tesseract-ocr lacks support of tesseract's option --dpi which is not yet included in the package's ocrOptions.

From tesseract --help-extra:

> tesseract --help-extra
Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  --dpi VALUE           Specify DPI for input image.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.
…

Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed

My code

const tesseract = require("node-tesseract-ocr")

const config = {
  lang: "eng",
  oem: 1,
  psm: 3,
}

const img = "https://tesseract.projectnaptha.com/img/eng_bw.png"

tesseract
  .recognize(img, config)
  .then((text) => {
    console.log("Result:", text)
  })
  .catch((error) => {
    console.log(error.message)
  })

Error

Command failed: tesseract stdin stdout -l eng --oem 1 --psm 3
/bin/sh: 1: tesseract: not found

events.js:291
      throw er; // Unhandled 'error' event
      ^

Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed
    at doWrite (_stream_writable.js:399:19)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Socket.Writable.write (_stream_writable.js:318:11)
    at IncomingMessage.ondata (_stream_readable.js:718:22)
    at IncomingMessage.emit (events.js:314:20)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at IncomingMessage.Readable.push (_stream_readable.js:213:10)
    at HTTPParser.parserOnBody (_http_common.js:135:24)
    at TLSSocket.socketOnData (_http_client.js:474:22)
Emitted 'error' event on Socket instance at:
    at errorOrDestroy (internal/streams/destroy.js:108:12)
    at Socket.onerror (_stream_readable.js:754:7)
    at Socket.emit (events.js:314:20)
    at errorOrDestroy (internal/streams/destroy.js:108:12)
    at onwriteError (_stream_writable.js:418:5)
    at onwrite (_stream_writable.js:445:5)
    at doWrite (_stream_writable.js:399:11)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Socket.Writable.write (_stream_writable.js:318:11)
    at IncomingMessage.ondata (_stream_readable.js:718:22) {
  code: 'ERR_STREAM_DESTROYED'
}

Have been waiting for more than an hour for this Image

Can I afferent in a binary image?

await tesseract.recognize("image.jpg")
I want to pass in a binary image directly at image.jpg?Thank you for your answer!!!

Can't get 'digits' preset to work

This is my source:

const tesseract = require('node-tesseract-ocr')


const config = {
  lang: 'eng',
  oem: 1,
  psm: 6,
  // tessedit_char_whitelist: '0123456789',
  presets: ['digits']
}

tesseract
  .recognize('./image4.png', config)
  .then(text => {
    console.log('Result:', text)
  })
  .catch(err => {
    console.log('error:', err)
  })

This is the image:

The output I'm getting is, "Result: x 4606 : -4809 Z: 698".

Expected output: "Result: 4606-4809698". (I don't know if it should be delimited or not but there shouldn't be any letters. I have tried a number of things:

Using the whitelist in config directly,

const config = {
  lang: 'eng',
  oem: 1,
  psm: 6,
  tessedit_char_whitelist: '0123456789'
}

I've also tried without a digits file and with. With a digits file, the filename is just 'digits' with no extension and I've tried putting it in the same location as my app.js (same dir where I run the script above) and I've also tried putting the digits file here: './tessdata/configs/digits'.

My digits file contains,

tessedit_char_whitelist 0123456789

What am I doing wrong?

V8 crashes while batch OCRing

I've just switched to node-tesseract-ocr from the much slower Tesseract.js. Unfortunately I started consistently getting the following error. It occurs at about, but not exactly, the same place in a batch of images. I've tried OCRing them all individually, and in that case it does not crash.

`Fatal error in , line 0
Check failed: result.second.

FailureMessage Object: 0x7ffd149a3550
1: 0xb6f151 [node]
2: 0x1bf56f4 V8_Fatal(char const*, ...) [node]
3: 0xfc3f61 v8::internal::GlobalBackingStoreRegistry::Register(std::shared_ptrv8::internal::BackingStore) [node]
4: 0xd151c8 v8::ArrayBuffer::GetBackingStore() [node]
5: 0xb1df77 [node]
6: 0xd4a18e [node]
7: 0xd4b5af v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [node]
8: 0x15e7959 [node]
Trace/breakpoint trap (core dumped)`

Node version v16.13.0.
node-tesseract-ocr version 2.2.1
tesseract --version tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Error during processing

Code

const tesseract = require("node-tesseract-ocr");

const config = {
  lang: "swe",
  oem: 1,
  psm: 3,
};

tesseract
  .recognize(`${__dirname}/doc.png`, config)
  .then((text) => {
    console.log("Result:", text);
  })
  .catch((error) => {
    console.log(error);
  });

Output

Error: Command failed: tesseract "/Users/albingroen/Developer/demos/node-ocr-test/doc.png" stdout -l swe --oem 1 --psm 3
Error during processing.

    at ChildProcess.exithandler (child_process.js:319:12)
    at ChildProcess.emit (events.js:376:20)
    at maybeClose (internal/child_process.js:1055:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:288:5) {
  killed: false,
  code: 1,
  signal: null,
  cmd: 'tesseract "/Users/albingroen/Developer/demos/node-ocr-test/doc.png" stdout -l swe --oem 1 --psm 3'
}

Tesseract version (macOS)

tesseract 5.0.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0
 Found NEON
 Found libarchive 3.5.2 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0
 Found libcurl/7.77.0 SecureTransport (LibreSSL/2.8.3) zlib/1.2.11 nghttp2/1.42.0

Am I missing something?

Error: Unhandled 'error' event write EOF ... Emitted 'error' event on Socket instance at:

On Windows 10 and windows 11:

the basic example dos not works well

npm install node-tesseract-ocr
npm install fs.promises

const tesseract = require("node-tesseract-ocr");
const fs = require("fs/promises");

async function main() {
  const img = await fs.readFile("image.png");
  const text = await tesseract.recognize(img);

  console.log("Result:", text);
}

main();

The basic example generate an error :

node index.js
node:events:491
      throw er; // Unhandled 'error' event
      ^

Error: write EOF
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -4095,
  code: 'EOF',
  syscall: 'write'
}

Node.js v18.12.1

Buffer Array

Any way to pass an array with buffers to process them all in once?

Run e2e tests on Windows

I want to run e2e tests on Windows. Tesseract is successfully installed from Chocolatey, but then I get an error in my Travis Job log:

'tesseract' is not recognized as an internal or external command, operable program or batch file.

I found reports of a similar problem, but couldn't solve the problem. Can somebody help me with the refreshenv?

PR is welcome!

Remove the hard dependency on the Tesseract

After installation of package node-tesseract-ocr we can to download binary files of Tesseract OCR. Like it is done in node-sass.

Running in Windows gives error but running in Ubuntu with same script is fine

Hello,

Here is the working script which works on Linux.

const tesseract = require('node-tesseract-ocr');

const config = {
  lang: 'eng',
  oem: 1,
  psm: 3
}
 
tesseract
  .recognize('image.jpg', config)
  .then(text => {
    console.log('Result:', text)
  })
  .catch(err => {
    console.log('error:', err)
  })

Here is the non-working script I used on Windows (same package.json).

const tesseract = require('node-tesseract-ocr');

const config = {
  lang: 'eng',
  oem: 1,
  psm: 3
}
 
tesseract
  .recognize('image.jpg', config)
  .then(text => {
    console.log('Result:', text)
  })
  .catch(err => {
    console.log('error:', err)
  })

Identical.

Here is the error message:

error: { Error: Command failed: tesseract image.jpg stdout -l eng --oem 1 --psm 3
'tesseract' is not recognized as an internal or external command,
operable program or batch file.

    at ChildProcess.exithandler (child_process.js:294:12)
    at ChildProcess.emit (events.js:198:13)
    at maybeClose (internal/child_process.js:982:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:259:5)
  killed: false,
  code: 1,
  signal: null,
  cmd: 'tesseract image.jpg stdout -l eng --oem 1 --psm 3' }

What is going on? How can I solve this error?

How do we use --user-words?

Command injection

If :";echo TEST;": is passed into tesseract.recognize then TEST is printed to the terminal (On Linux, I have not tested this on windows)

Error: cannot open input file: stdin

CODE:

const tesseract = require("node-tesseract-ocr");

const config = {
    lang: "eng",
    oem: 1,
    psm: 3,
};

tesseract
.recognize("https://tesseract.projectnaptha.com/img/eng_bw.png", config)
.then((text) => {
    console.log("Result:", text);
})
.catch((error) => {
    console.log(error.message);
});

ERROR:

Command failed: tesseract stdin stdout -l eng --oem 1 --psm 3
read_params_file: Can't open 1
read_params_file: Can't open -psm
read_params_file: Can't open 3
Tesseract Open Source OCR Engine v3.02 with Leptonica
Cannot open input file: stdin

ADDITIONAL INFORMATION:

Platform: Windows 11
Node Version: v18.12.1
node-tesseract-ocr Version: ^2.2.1
Tesseract Version: 3.02

I am trying to extract text from images using NodeJS but getting this error what I am doing wrong here?

How to use in a mobile app, without downloading Tesseract Library ?

Hi,
I would like to use this tesseract wrapper in a mobile app. The server would use Express and be hosted online.

But it says in the README that for using this module, we first need to download Tesseract Library to the computer.
Of course the people using the app will not download this library to their phone.

Is it possible to download the tesseract library in the Express Server ?

Thanks

Support Localization of Words via `tsv` or `hocr` flag

By adding tsv or hocr to the end of the tesseract command, you can get the positions of words, example tsv shown below:

We should support this as a return type, possible converted to json (this might be cleanest)

Error

(node:1382) UnhandledPromiseRejectionWarning: Error: Command failed: tesseract stdin stdout
/bin/sh: 1: tesseract: not found

at ChildProcess.exithandler (child_process.js:308:12)
at ChildProcess.emit (events.js:314:20)
at ChildProcess.EventEmitter.emit (domain.js:483:12)
at maybeClose (internal/child_process.js:1022:16)
at Socket.<anonymous> (internal/child_process.js:444:11)
at Socket.emit (events.js:314:20)
at Socket.EventEmitter.emit (domain.js:483:12)
at Pipe.<anonymous> (net.js:675:12)

(node:1382) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:1382) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
events.js:291
throw er; // Unhandled 'error' event
^

Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed
at doWrite (_stream_writable.js:399:19)
at writeOrBuffer (_stream_writable.js:387:5)
at Socket.Writable.write (_stream_writable.js:318:11)
at IncomingMessage.ondata (_stream_readable.js:718:22)
at IncomingMessage.emit (events.js:314:20)
at IncomingMessage.EventEmitter.emit (domain.js:483:12)
at IncomingMessage.Readable.read (_stream_readable.js:507:10)
at flow (stream_readable.js:1007:34)
at resume (_stream_readable.js:988:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
at errorOrDestroy (internal/streams/destroy.js:108:12)
at Socket.onerror (_stream_readable.js:754:7)
at Socket.emit (events.js:314:20)
at Socket.EventEmitter.emit (domain.js:483:12)
at errorOrDestroy (internal/streams/destroy.js:108:12)
at onwriteError (_stream_writable.js:418:5)
at onwrite (_stream_writable.js:445:5)
at doWrite (_stream_writable.js:399:11)
at writeOrBuffer (_stream_writable.js:387:5)
at Socket.Writable.write (_stream_writable.js:318:11) {
code: 'ERR_STREAM_DESTROYED'
}

cmd fail when filename contains spaces (Windows)

Input directory is C:\Users\rc08281\code\code-challenges\ocr\work\input
Found 1 files in input directory
{ Error: Command failed: tesseract C:\Users\rc08281\code\code-challenges\ocr\work\converted\WI -Exemption -Invalid -Handwritten.png stdout -c 0=e -c 1=n -c 2=g
read_params_file: Can't open stdout
read_params_file: Can't open c
read_params_file: Can't open 0=e
read_params_file: Can't open c
read_params_file: Can't open 1=n
read_params_file: Can't open c
read_params_file: Can't open 2=g
Could not set option: 0=e
Could not set option: 1=n
Could not set option: 2=g
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.

The filename is "WI -Exemption -Invalid -Handwritten.png"

Can you please enclose it in quotes when you spawn convert?

his tesseract has no URL support

Is there anything I can do to add URL support to this package? I keep getting error:
Command failed: tesseract "https://tesseract.projectnaptha.com/img/eng_bw.png" stdout -l eng --oem 1 --psm 3 Error, this tesseract has no URL support Error during processing.

tesseract version I have is: 5.0.0-alpha-20201231-171-g04173

crash when whitelisting some characters

This crashes the program:

options.tessedit_char_whitelist = "(";
tesseract.recognize(buffer, options);

also )

Support Stream

Tesseract supports image recognition from stdin.
Can we add the ability to input in tesseract.recognize() not only the filename but also the Buffer stream?

Tesseract v4.x support

Hi, I moved from tesseractjs to node-tesseract-ocr for performance reasons.
I tried to install it on a linux machine, it seems to work fine with Tesseract 3.x but having issues with Tesseract 4.
I'm passing an arraybuffer image as input, so wonder if that's the issue or if v4.x required additional configs.
Thanks,
Christian

'tesseract' is not recognized as an internal or external command, operable program or batch file.

I have added the path to environment variables , still I am getting this error while using node-tesseract-ocr library in my Node JS code.

tesseract .recognize(imgPath, config) .then((text) => { console.log("Result:", text) var words = text.split(" "); console.log(words); var code = words[0].substring(0, 6); console.log("the verification code is: " + code); return code; }) .catch((error) => { console.log(error.message) })

Can someone help me?

pipeInput example

How can I use pipeInput?
Can share any example?

Thanks in advance