Giter VIP home page Giter VIP logo

ocr_dataset's Introduction

Todo

  • 提供数据集百度云链接
  • 数据集转换为统一格式(检测和识别)
    • icdar2015
    • MLT2019
    • COCO-Text_v2
    • ReCTS
    • SROIE
    • ArT
    • LSVT
    • Synth800k
    • icdar2017rctw
    • MTWI 2018
    • 百度中文场景文字识别
    • mjsynth
    • Synthetic Chinese String Dataset(360万中文数据集)
    • 英文识别数据大礼包
  • 提供读取脚本

下载

下载数据集之后,记得修改标注文件里对应的路径为自己的路径

通过百度网盘分享的文件:所有数据集一起压… 链接:https://pan.baidu.com/s/1TkTWql2XxqPLDnFmVvHsUA?pwd=4358  提取码:4358 复制这段内容打开「百度网盘APP 即可获取」

数据集

数据集 主页 适用情况 数据情况 标注形式 说明
ICDAR2015 https://rrc.cvc.uab.es/?ch=4 检测&识别 语言: 英文 train:1,000 test:500 x1, y1, x2, y2, x3, y3, x4, y4, transcription 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息
MLT2019 https://rrc.cvc.uab.es/?ch=15 检测&识别 语言: 混合 train:10,000 test:10,000 x1,y1,x2,y2,x3,y3,x4,y4,script,transcription 坐标: x1, y1, x2, y2, x3, y3, x4, y4 script: 文字所属语言 transcription : 框内的文字信息
COCO-Text_v2 https://bgshih.github.io/cocotext/ 检测&识别 语言: 混合 train:43,686 validation:10,000 test:10,000 json
ReCTS https://rrc.cvc.uab.es/?ch=12&com=introduction 检测&识别 语言: 混合 train:20,000 test:5,000 { “chars”: [ {“points”: [x1,y1,x2,y2,x3,y3,x4,y4], “transcription” : “trans1”, "ignore":0 }, {“points”: [x1,y1,x2,y2,x3,y3,x4,y4], “transcription” : “trans2”, " ignore ":0 }], “lines”: [ {“points”: [x1,y1,x2,y2,x3,y3,x4,y4] , “transcription” : “trans3”, "ignore ":0 }], } points: x1,y1,x2,y2,x3,y3,x4,y4 chars: 字符级别的标注 lines: 行级别的标注. transcription : 框内的文字信息 ignore: 0:不忽略,1:忽略
SROIE https://rrc.cvc.uab.es/?ch=13 检测&识别 语言: 英文 train:699 test:400 x1, y1, x2, y2, x3, y3, x4, y4, transcription 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息
ArT(已包含Total-Text和SCUT-CTW1500) https://rrc.cvc.uab.es/?ch=14 检测&识别 语言: 混合 train: 5,603 test: 4,563 { “gt_1”: [ {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans1”, “language” : “Latin”, "illegibility": false }, {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans2”, “language” : “Chinese”, "illegibility": false }], } points: x1,y1,x2,y2,x3,y3,x4,y4…xn,yn transcription : 框内的文字信息 language: 语言信息 illegibility: 是否模糊
LSVT https://rrc.cvc.uab.es/?ch=16 检测&识别 语言: 混合 全标注 train: 30,000 test: 20,000 只标注文本 400,000 { “gt_1”: [ {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans1”, "illegibility": false }, {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans2”, "illegibility": false }], } points: x1,y1,x2,y2,x3,y3,x4,y4…xn,yn transcription : 框内的文字信息 illegibility: 是否模糊
Synth800k http://www.robots.ox.ac.uk/~vgg/data/scenetext/ 检测&识别 语言: 英文 800,000 imnames: wordBB: charBB: txt: imnames: 文件名称 wordBB: 24n,每张图像内的文本框 charBB: 24n,每张图像内的字符框 txt: 每张图形内的字符串
icdar2017rctw https://blog.csdn.net/wl1710582732/article/details/89761818 检测&识别 语言: 混合 train:8,034 test:4,229 x1,y1,x2,y2,x3,y3,x4,y4,<识别难易程度>,transcription 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息
MTWI 2018 识别: https://tianchi.aliyun.com/competition/entrance/231684/introduction 检测: https://tianchi.aliyun.com/competition/entrance/231685/introduction 检测&识别 语言: 混合 train:10,000 test:10,000 x1, y1, x2, y2, x3, y3, x4, y4, transcription 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息
百度中文场景文字识别 https://aistudio.baidu.com/aistudio/competition/detail/20 识别 语言: 混合 train:未统计 test:未统计 h,w,name,value h: 图片高度 w: 图片宽度 name: 图片名 value: 图片上文字
mjsynth http://www.robots.ox.ac.uk/~vgg/data/text/ 识别 语言: 英文 9,000,000 - -
Synthetic Chinese String Dataset(360万中文数据集) 链接:https://pan.baidu.com/s/1jefn4Jh4jHjQdiWoanjKpQ 提取码:spyi 识别 语言: 混合 300k - -
英文识别数据大礼包(https://github.com/clovaai/deep-text-recognition-benchmark) 训练:MJSynth和SynthText 验证:IIIT, SVT, IC03, IC13, IC15, SVTP, CUTE 链接:https://pan.baidu.com/s/1KSNLv4EY3zFWHpBYlpFCBQ 提取码:rryk 识别 语言: 英文 - -

数据生成工具

https://github.com/TianzhongSong/awesome-SynthText

数据集读取脚本

ocr_dataset's People

Contributors

wenmuzhou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocr_dataset's Issues

关于所有数据集一起压缩

非常感谢您的分享!辛苦了!
请问百度云中所有数据集一起压缩的文件是什么格式的呢?该如何读取数据?

关于2019LVST标签问题

感谢作者开源这个项目,我看了您的代码,是将.txt转成josn格式。有没有可以将2019LVST这种josn格式的标签转成.txt标签的功能呢?

关于转换数据的一个坑

我是通过ArT数据集试出来的,如果说不同文本框的坐标数表示不同的话,即文本框A用5个坐标表示,文本框B用7个坐标表示,那么DetCollateFN.py的line26会报错,原因是tensor和numpy每一维的坐标数个数必须要相同,否则numpy会自动把每一维转换为list object,从而无法转换为tensor,当然直接把list转换为tensor也是不行的

百度云盘链接

作者你好,百度云盘链接目前来看过期了,可以重新生成一个链接么?

coco-text,synthtext的部分coor是负数

你好,我有遇到这两个数据集的部分标注polygon是负数的情况,,请问是正常的吗?直接抛弃掉这部分数据集?但是似乎有特别多的标注都出现了这种情况

单独数据集里面的文件

大佬你好,多谢整理了这么多有用的数据集。
我下载了cocotextv2 baidu中文识别 mtwi2018等几个 发现里面有一些文件是只有600-900字节大小的,部分无法显示比如mtwi识别里面的几个,或者图像看不出文字内容的。训练的时候会报错。这些是原本如此的吗?还是预处理切割导致的?

词表问题

请问如何从数据集中构建词表呢? 以及构建完成的词表中如何解决类别分布不均的问题呢?
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.