Giter VIP home page Giter VIP logo

facebookresearch / imgur5k-handwriting-dataset Goto Github PK

View Code? Open in Web Editor NEW
275.0 275.0 53.0 11 MB

IMGUR5K handwriting set. It is a handwritten in-the-wild dataset, which contains challenging real world handwritten samples from different writers.The dataset is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

License: Other

Python 100.00%

imgur5k-handwriting-dataset's Introduction

Word Images

IMGUR5K Handwriting Dataset

To run the code for downloading the urls and generate corresponding annotations :

Usage: python download_imgur5k.py --dataset_info_dir <dir_with_annotaion_and_hashes> --output_dir <path_to_store_images>

Requirements

IMGUR5K download code works with

  • Python3
  • Numpy
  • Requests
  • PIL

Downloading images of IMGUR5K

Run the command and set <path_to_store_images> to the target image directory

How IMGUR5K download works

The code checks the validity of urls by checking the hash of the url with the groundtruth md5 hash. If the image is pristine, the annotations are added to the generated annotations file and the respective splits.

Full documentation

IMGUR5K is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

REQUIRED FILES:

  • download_imgur5k.py : Code to download the URLs for the dataset building.
  • <dataset_info_dir>/imgur5k_data.lst : File containing URLs with annotations and bounding box
  • <dataset_info_dir>/imgur5k_hashes.lst : File containins URL indexes with groundtruth md5 hash.
  • <dataset_info_dir>/train_index_ids.lst : File containins URL indexes belonging to train split.
  • <dataset_info_dir>/val_index_ids.lst : File containins URL indexes belonging to val split.
  • <dataset_info_dir>/test_index_ids.lst : File containins URL indexes belonging to test split.

Output:

  • <path_to_store_images>/.jpg :
    • Images dowloaded to output_dir
  • imgur5k_annotations.json :
    • json file with image annotation mappings -> dowloaded to dataset_info_dir
      • Format: { "index_id" : {indexes}, "index_to_annotation_map" : { annotations ids for an index}, "annotation_id": { each annotation's info } }
      • Annotation ID: bounding_box in (xc,yc,w,h,a) format in absolute floating points coordinates.
        • (xc, yc) is the center of the rotated box
        • (w, h) is the width and height
        • (a) is the angle in degrees counterclockwise.
      • Bounding boxes with '.' mean the annotations were not done for various reasons
  • imgur5k_annotations_train.json :
    • json file with image annotation mappings of TRAIN split only -> dowloaded to dataset_info_dir
  • imgur5k_annotations_val.json :
    • json file with image annotation mappings of VAL split only -> dowloaded to dataset_info_dir
  • imgur5k_annotations_test.json :
    • json file with image annotation mappings of TEST split only -> dowloaded to dataset_info_dir

[All imgur5k_annotations_*.json's format is similar to the format of imgur5k_annotations.json]

NOTE: Apart from the ~5K images employed in TextStyleBrush paper, ~4K more images are added to the dataset to foster the research in Handwritten Recognition.

Statistics

Description Count
# Page Images 8,177
# Word Images 230,573
# Lexicons (case-sensitive) 49,317

The ratio for train/val/test splits is 80%:10%:10% at the level of page images and the details are provided in the respective json files created as part of the output.

Disclaimer: The dataset is provided using public links to each image, and the availability of these images is controlled by IMGUR or the original user (who uploaded it).

Contribution

See the CONTRIBUTING file for how to help out.

License

IMGUR5K is Creative Commons Attribution-NonCommercial 4.0 International Public licensed, as found in the LICENSE file.

Citation

If you find this data useful, please consider citing our paper:

Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev and Tal Hassner, TextStyleBrush: Transfer of Text Aesthetics from a Single Example, arXiv: 2106.08385 2021.

@misc{krishnan2021textstylebrush,
      title={TextStyleBrush: Transfer of Text Aesthetics from a Single Example}, 
      author={Praveen Krishnan and Rama Kovvuri and Guan Pang and Boris Vassilev and Tal Hassner},
      year={2021},
      eprint={2106.08385},
      archivePrefix={arXiv},
}

imgur5k-handwriting-dataset's People

Contributors

colemurray avatar dmitrijsk avatar kris314 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imgur5k-handwriting-dataset's Issues

download_imgur5k.py not working

[jorj@jorj-systemproductname IMGUR5K-Handwriting-Dataset-main]$ python download_imgur5k.py
For IMG: lRgjZ, ref hash: c64945bd74c067f29e01f2f3b5eeff60 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: NeVsJy7, ref hash: 924eb5398cea242b01f43e73b1a12811 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 7IUPrpZ, ref hash: 3e4f912a1e9d91c35c68c0880826e680 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: eDAsVEc, ref hash: adfd2c3ec792447999c478d67b7d8f80 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: MDvEHIM, ref hash: 15fdfc190a67677d519979c03b18f37f != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: hdZPXS2, ref hash: 40f094f7bf1e56ed56cc2fcb8adcff14 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 1fyjan3, ref hash: bdd3b29b6a4e1980ca9541a32fbdc22a != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: k7R0Mfv, ref hash: c66e25a217286e8ac03806deec3a7257 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: Nokn65F, ref hash: 8ef5355846c5806d280a7ef563bc3f45 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: TAsDhPW, ref hash: 42396e05b53eee4592706814b485430e != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: HFIyBwb, ref hash: c0ce1d8d1656a649a01f7fa2997988c4 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: hwq46gA, ref hash: 8c27046ac37905291bbd9cb2cec72241 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: xBG71ye, ref hash: 9a7ea2e2e5c1ee3f5da627af7d253f09 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: DwLfSKp, ref hash: c6d69eca85379be3f04dcb3d0aee30ae != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 5L3hO70, ref hash: e7357212cd2513c908e6904600de0c78 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: Vubu0si, ref hash: bc98f84c094558e90318bf73ceae2e82 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 5BNF7aJ, ref hash: a2ac1472db26212731070354f8008990 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: KSSvYJH, ref hash: f7f71a1646fbdba638eca0365f09cff6 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: nUnLGVR, ref hash: 0e8e826cb85b53f5a459c0f0eed36d4f != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: PFTXe3d, ref hash: 1f372a9c9fc035ff0b57a4f21e070b9c != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: fa3NgTS, ref hash: 616f5801a0bdbdacf26e78536641a860 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: XaCS4cw, ref hash: 8a324606e7ba86562d6883131a191864 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: hdWyqc6, ref hash: ca0ecf71433d37ab6a34ee7229586d07 != cur hash: d835884373f4d6c8f24742ceabe74946

Licensing for commercial use?

Hello,

we would like to use the dataset as additional training data for our OCR model. However, the current license does not allow to use the data for commercial purpose. Is it possible to license the dataset as a company for such purposes? If so, who can we contact in that regard? Or is this not possible because of the imgur origin of the data?

Some images were removed from imgur

2021.11.24

For IMG: lRgjZ, ref hash: c64945bd74c067f29e01f2f3b5eeff60 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: NeVsJy7, ref hash: 924eb5398cea242b01f43e73b1a12811 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 7IUPrpZ, ref hash: 3e4f912a1e9d91c35c68c0880826e680 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: hdZPXS2, ref hash: 40f094f7bf1e56ed56cc2fcb8adcff14 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: Nokn65F, ref hash: 8ef5355846c5806d280a7ef563bc3f45 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: hwq46gA, ref hash: 8c27046ac37905291bbd9cb2cec72241 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: xBG71ye, ref hash: 9a7ea2e2e5c1ee3f5da627af7d253f09 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: KSSvYJH, ref hash: f7f71a1646fbdba638eca0365f09cff6 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: nUnLGVR, ref hash: 0e8e826cb85b53f5a459c0f0eed36d4f != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: PFTXe3d, ref hash: 1f372a9c9fc035ff0b57a4f21e070b9c != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: fa3NgTS, ref hash: 616f5801a0bdbdacf26e78536641a860 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: pyXUSxO, ref hash: ea7b303e76d47ce8555286f67bccad5b != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: GjgtyBl, ref hash: b6980d39ce80b2a3085cd89c537327b7 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: Cs0smsA, ref hash: ac942db4d0071e882db20dbca2de8d5d != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: kHgtG4H, ref hash: 532bf487cee2a3266f6985ce322626f2 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: kyRKrOy, ref hash: 79dab6bff97aa22fb8aac47676dd150f != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 91V1uHF, ref hash: 6fd3da585984de869c9e3f85ab96fd72 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 1IYlYlq, ref hash: 641076db3f95efea3fb35782777dabaf != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: sAOdjXq, ref hash: 730fd6033c8f255d4f1774b2d049922e != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: mlmRA89, ref hash: cb2b6705e71a3f8fb4ca29640b3de230 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 3TIryzT, ref hash: fef7718a45ee39d5ab324a0d792f8ee0 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: pldp0ke, ref hash: 7cbd0528faa5018e08d4e08834ebc8ab != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: s7WGXwr, ref hash: 6c25277ca43925cd93eac806fb646937 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: HIAwuPd, ref hash: 150c87bb0dc4d7819abf46807eafbf39 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: DGafbuR, ref hash: 212a52ab552f75a6d4655e07865188c0 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: PPqWkdx, ref hash: d8c4a27288f0c4db3a716dc3fd06dee2 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 03IJytp, ref hash: 9ff28f403eac64b136006a5c86a49c84 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: pjRXC0f, ref hash: 36016c1784a21f092f26e78c27c7d064 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 6De62VB, ref hash: 3ad6e31174112f63b633db85644238a0 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: 00Wo8nQ, ref hash: 7cedb0a7914a5336d2de9a21a58eb788 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: bX1Ajfi, ref hash: d7cfd20cddfe6a9fee3b9bea5e1f6564 != cur hash: d835884373f4d6c8f24742ceabe74946
For IMG: Idip0tp, ref hash: 8785533373eb588fd1e49a7537894692 != cur hash: d835884373f4d6c8f24742ceabe74946

Can't understand box format

Hello, thank you for sharing the dataset! It looks like hard work has been done! ๐Ÿ‘‹๐Ÿ‘‹๐Ÿ‘‹
I have one question about annotations: I can't understand the format of the bboxes: bounding_box in xywha format?
What means "a" at the end?
I know three popular formats: pascal_voc [x_min, y_min, x_max, y_max]; coco [x_min, y_min, width, height]; yolo [x_center, y_center, width, height], x_center and y_center are the normalized coordinates; https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/
But I can't find any "a" in them. Can you please help me to understand this moment!

URL retrieval failed

Hi, I'm trying to download the images, but when running the download script I keep getting URL retrieval for image failed for all images (f.e "URL retrieval for ZfOptdN failed!!"). The only thing I have touched in the code is the dataset_info and images directory. Could you help me with that? Thank you in advance :)

[Bug] bad annotation format in imgur5k_annotations_train.json

There exist a label that has bad bonding box format in imgur5k_annotations_train.json (line 422016) under dataset_info folder.

Specifically, the closing bracket is missing
"bounding_box": "[1272.67, 1122.67, 1099.33, 2169.33, -10.67", which can be fixed easily.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.