The ocr_generate_text_data from codeachievedream

ocr_generate_text_data's Introduction

OCR diversified text data generation project

Introduction

The project code is mainly used to generate OCR related text training data. The project contains a large number of font files, corpus files, near-shaped characters, etc., of which there are more than 20 fonts, corpus data up to 2G, including company corpus Address corpus, novel corpus to meet various needs.

Features

The data generated by the project is used for the training of multiple identification models such as ID cards, business cards, and bills. Comprehensive Synthetic Chinese String Dataset extraction code: fh6h, training recognition The accuracy of the model can reach more than 99%. The text generated by the project is very similar to the 360w data, so it can supplement the insufficient data in the 360w data. The project contains functions such as image rotation, perspective transformation, enhancement, etc. It is powerful, flexible and diverse! Because the corpus files and background images are too large, corpus data network disk address extraction code: awfn

Run

The selected background picture is stored in the back_ground folder in the project. The background picture processing can use the text.py file in the back_ground to enhance the background to meet the needs of recognizing the scene.

The font_file /font_all/ directory in the project stores font files. New fonts need to run the check_font.py file to extract font fonts. If the corresponding file in font_in_all is blank, the font may have errors in use. Results in a blank.

The text_file folder in the project stores corpus data, and the newly added corpus text can be placed in this directory.

Imagaug_image.py in the project can be added in various forms. For details, please refer to the imagaug library

Configure the path in the main.py file, run the main.py file to generate it, and modify the word.py to increase the text sequence of the required scene.

Picture example

ocr_generate_text_data's People

Contributors

Stargazers

Watchers

ocr_generate_text_data's Issues

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0

when I replace the fonts with different fonts I have got an error. why could this possibly be? can remove fanti from the code?

Traceback (most recent call last):
File "main.py", line 184, in
local_font = LocalFont()
File "//localfont.py", line 14, in init
self.all_font = self.get_all_font(self.path)
File "/localfont.py", line 29, in get_all_font
font_txt = open(os.path.join(path, font), 'r').read()
File "/Users//opt/anaconda3/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 11: invalid start byte

codeachievedream / ocr_generate_text_data Goto Github PK

ocr_generate_text_data's Introduction

OCR diversified text data generation project

Introduction

Features

Run

Picture example

ocr_generate_text_data's People

Contributors

Stargazers

Watchers

ocr_generate_text_data's Issues

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0

How to resume training?

ModuleNotFoundError: No module named 'CV_Lib'

fanti_similar.txt

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent