Giter VIP home page Giter VIP logo

tesstrain_package's Introduction

tesstrain.py

Utilities for working with Tesseract >= 4 using artificial training data.

About

This repository contains a standalone fork of the official/upstream code at https://github.com/tesseract-ocr/tesstrain/tree/main/src to allow easier packaging for PyPI.

Installation

This package requires the Tesseract training tools to be available on your system. Additionally, a supported Python version (at least 3.6) is required for running.

You can install this package from PyPI:

python -m pip install tesstrain

Alternatively, you may use pip install . to install the package from a source checkout.

Running

  • Use the terminal interface to directly interact with the tools: python -m tesstrain --help.
  • Call it from your own code using the high-level interface tesstrain.run().

License

This package is subject to the terms of the Apache-2.0 license.

tesstrain_package's People

Contributors

stefan6419846 avatar shreeshrii avatar stweil avatar dependabot[bot] avatar zdenop avatar nagadomi avatar zhuangzhuang avatar bharatr21 avatar armyke avatar

Watchers

 avatar

tesstrain_package's Issues

Option `--exposures` has no effect

Hi,

Please excuse if this is the wrong place to report the issue (or maybe here in this repository or in https://github.com/tesseract-ocr/tesstrain?)

In short, the option option --exposures is not respected.

According to arguments.py, that reads:

    parser.add_argument(
        "--exposures",
        metavar="EXPOSURES",
        action="append",
        nargs="+",
        help="A list of exposure levels to use (e.g. -1,0,1).",
    )

the option can be used more than once and accepts more than one value. Therefore I tried the following:

$ python -m tesstrain --exposures 1 5 ...

then in tesseract.log we can see the following line that indicates that the values provided on the command line are overriden:

... - DEBUG - tesstrain.language_specific - exposures = [0] (was [['1', '5']])

And even more complex case:

$ python -m tesstrain --exposures 1 5 ... --exposures -1 ...

then in tesseract.log:

... DEBUG - tesstrain.language_specific - exposures = [0] (was [['1', '5'], ['-1']])

I would expect files to be generated: *.exp1.{tif,box,lstmf}, *.exp5.{tif,box,lstmf}, and *.exp-1.{tif,box,lstmf}, but only the files for exposure=0 are here: *.exp0.{tif,box,lstmf}

In the code, the value provided on the command line is overriden by this line https://github.com/stefan6419846/tesstrain_package/blob/main/tesstrain/language_specific.py#L1327-L1328 that reads

if not EXPOSURES:
  EXPOSURES = [0]

If I understand correctly, the culprit of the problem is this line https://github.com/stefan6419846/tesstrain_package/blob/main/tesstrain/language_specific.py#L920 that does not use the values from the command line:

EXPOSURES: List[int] = []

If it can be changed to something like:

EXPOSURES: List[int] = [v for vs in ctx.exposures for v in vs]

this should fix the issue.

Thanks again,

BR, Nikolai

Opening file for writing should have mode="w" or "w+"

Hi

If i am not mistaken, this very repository contains python tesstrain package that is installed via pip install tesstrain. If not, please disregard this issue.

I have installed tesstrain==0.1.2 and have an issue running the command, namely

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/output_dir/eng.training_files.txt

This is due to this line: https://github.com/stefan6419846/tesstrain_package/blob/main/tesstrain/generate.py#L379

 with open(lstm_list, encoding="utf-8", newline="\n") as fd:
        fd.write("\n".join(dir_listing))

as can be seen, here the output file is opened for reading only, which is the default mode for opening files (https://docs.python.org/3.10/library/functions.html#open).

In the other repository, this line uses correct mode correct: https://github.com/tesseract-ocr/tesstrain/blob/main/src/tesstrain/generate.py#L375

with pathlib.Path(lstm_list).open(mode="w", encoding="utf-8", newline="\n") as f:
        f.write("\n".join(dir_listing))
  1. Can it be fixed and published to pypy?
  2. It would be nice to sync two repositories: this one and what is in https://github.com/tesseract-ocr/tesstrain/src or delete one of them to remove confusion.

thanks in advance!

BR, Nikolai

Seeing `tesseract - read_params_file: Can't open ...` in the logs

Hello

I have been running tesstrain tool and seen it printing lines like this to the log file:

[2024-03-26 02:01:39,355] - DEBUG - tesseract - read_params_file: Can't open /path/to/tessdata_best/configs/

The directory exists (it is TESSDATA_PREFIX). The message is not an error, the script continues running and completes successfully. The message is however misleading, it makes me think something is wrong. Would it be possible to silence it?

The above message is generated when this line

has a value of empty string

            future = executor.submit(
                run_command,
                "tesseract",
                img_file,
                pathlib.Path(img_file).with_suffix(""),
                *box_config,
                config,   # <-- this is "" and causes a message being printed
                env=tessdata_environ,
            )

Best regards,
Nikolai

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.