escvm / oidv4_toolkit Goto Github PK

Download and visualize single or multiple classes from the huge Open Images v4 dataset

License: GNU General Public License v3.0

Python 100.00%

oidv4_toolkit's Introduction

~ OIDv4 ToolKit ~

Do you want to build your personal object detector but you don't have enough images to train your model? Do you want to train your personal image classifier, but you are tired of the deadly slowness of ImageNet? Have you already discovered Open Images Dataset v4 that has 600 classes and more than 1,700,000 images with related bounding boxes ready to use? Do you want to exploit it for your projects but you don't want to download gigabytes and gigabytes of data!?

With this repository we can help you to get the best of this dataset with less effort as possible. In particular, with this practical ToolKit written in Python3 we give you, for both object detection and image classification tasks, the following options:

(2.0) Object Detection

download any of the 600 classes of the dataset individually, taking care of creating the related bounding boxes for each downloaded image
download multiple classes at the same time creating separated folder and bounding boxes for each of them
download multiple classes and creating a common folder for all of them with a unique annotation file of each image
download a single class or multiple classes with the desired attributes
use the practical visualizer to inspect the donwloaded classes

(3.0) Image Classification

download any of the 19,794 classes in a common labeled folder
exploit tens of possible commands to select only the desired images (ex. like only test images)

The code is quite documented and designed to be easy to extend and improve. Me and Angelo are pleased if our little bit of code can help you with your project and research. Enjoy ;)

Open Image Dataset v4

All the information related to this huge dataset can be found here. In these few lines are simply summarized some statistics and important tips.

Object Detection

	Train	Validation	Test	#Classes
Images	1,743,042	41,620	125,436	-
Boxes	14,610,229	204,621	625,282	600

Image Classification

	Train	Validation	Test	#Classes
Images	9,011,219	41,620	125,436	-
Machine-Generated Labels	78,977,695	512,093	1,545,835	7,870
Human-Verified Labels	27,894,289	551,390	1,667,399	19,794

As it's possible to observe from the previous table we can have access to images from free different groups: train, validation and test. The ToolKit provides a way to select only a specific group where to search. Regarding object detection, it's important to underline that some annotations has been done as a group. It means that a single bounding box groups more than one istance. As mentioned by the creator of the dataset:

IsGroupOf: Indicates that the box spans a group of objects (e.g., a bed of flowers or a crowd of people). We asked annotators to use this tag for cases with more than 5 instances which are heavily occluding each other and are physically touching. That's again an option of the ToolKit that can be used to only grasp the desired images.

Finally, it's interesting to notice that not all annotations has been produced by humans, but the creator also exploited an enhanced version of the method shown here reported 1

1.0 Getting Started

1.1 Installation

Python3 is required.

Clone this repository

git clone https://github.com/EscVM/OIDv4_ToolKit.git

Install the required packages
```
pip3 install -r requirements.txt
```

Peek inside the requirements file if you have everything already installed. Most of the dependencies are common libraries.

1.2 Launch the ToolKit to check the available options

First of all, if you simply want a quick reminder of al the possible options given by the script, you can simply launch, from your console of choice, the main.py. Remember to point always at the main directory of the project

python3 main.py

or in the following way to get more information

python3 main.py -h

2.0 Use the ToolKit to download images for Object Detection

The ToolKit permit the download of your dataset in the folder you want (Datasetas default). The folder can be imposed with the argument --Dataset so you can make different dataset with different options inside.

As previously mentioned, there are different available options that can be exploited. Let's see some of them.

2.1 Download different classes in separated folders

Firstly, the ToolKit can be used to download classes in separated folders. The argument --classes accepts a list of classes or the path to the file.txt (--classes path/to/file.txt) that contains the list of all classes one for each lines (classes.txt uploaded as example).

Note: for classes that are composed by different words please use the _ character instead of the space (only for the inline use of the argument --classes). Example: Polar_bear.

Let's for example download Apples and Oranges from the validation set. In this case we have to use the following command.

 python3 main.py downloader --classes Apple Orange --type_csv validation

The algorith will take care to download all the necessary files and build the directory structure like this:

main_folder
│   main.py
│
└───OID
    │   file011.txt
    │   file012.txt
    │
    └───csv_folder
    |    │   class-descriptions-boxable.csv
    |    │   validation-annotations-bbox.csv
    |
    └───Dataset
        |
        └─── test
        |
        └─── train
        |
        └─── validation
             |
             └───Apple
             |     |
             |     |0fdea8a716155a8e.jpg
             |     |2fe4f21e409f0a56.jpg
             |     |...
             |     └───Labels
             |            |
             |            |0fdea8a716155a8e.txt
             |            |2fe4f21e409f0a56.txt
             |            |...
             |
             └───Orange
                   |
                   |0b6f22bf3b586889.jpg
                   |0baea327f06f8afb.jpg
                   |...
                   └───Labels
                          |
                          |0b6f22bf3b586889.txt
                          |0baea327f06f8afb.txt
                          |...

If you have already downloaded the different csv files you can simply put them in the csv_folder. The script takes automatically care of the download of these files, but if you want to manually download them for whatever reason here you can find them.

If you interupt the downloading script ctrl+d you can always restart it from the last image downloaded.

2.2 Download multiple classes in a common folder

This option allows to download more classes, but in a common folder. Also the related notations are mixed together with the already explained format (the first element is always the name of the single class). In this way, with a simple dictionary it's easy to parse the generated label to get the desired format.

Again if we want to download Apple and Oranges, but in a common folder

 python3 main.py downloader --classes Apple Orange --type_csv validation --multiclasses 1

Annotations

In the original dataset the coordinates of the bounding boxes are made in the following way:

XMin, XMax, YMin, YMax: coordinates of the box, in normalized image coordinates. XMin is in [0,1], where 0 is the leftmost pixel, and 1 is the rightmost pixel in the image. Y coordinates go from the top pixel (0) to the bottom pixel (1).

However, in order to accomodate a more intuitive representation and give the maximum flexibility, every .txt annotation is made like:

name_of_the_class left top right bottom

where each coordinate is denormalized. So, the four different values correspond to the actual number of pixels of the related image.

If you don't need the labels creation use --noLabels.

Optional Arguments

The annotations of the dataset has been marked with a bunch of boolean values. This attributes are reported below:

IsOccluded: Indicates that the object is occluded by another object in the image.
IsTruncated: Indicates that the object extends beyond the boundary of the image.
IsGroupOf: Indicates that the box spans a group of objects (e.g., a bed of flowers or a crowd of people). We asked annotators to use this tag for cases with more than 5 instances which are heavily occluding each other and are physically touching.
IsDepiction: Indicates that the object is a depiction (e.g., a cartoon or drawing of the object, not a real physical instance).
IsInside: Indicates a picture taken from the inside of the object (e.g., a car interior or inside of a building).
n_threads: Select how many threads you want to use. The ToolKit will take care for you to download multiple images in parallel, considerably speeding up the downloading process.
limit: Limit the number of images being downloaded. Useful if you want to restrict the size of your dataset.
y: Answer yes when have to download missing csv files.

Naturally, the ToolKit provides the same options as paramenters in order to filter the downloaded images. For example, with:

 python3 main.py downloader -y --classes Apple Orange --type_csv validation --image_IsGroupOf 0

only images without group annotations are downloaded.

3.0 Download images from Image-Level Labels Dataset for Image Classifiction

The Toolkit is now able to acess also to the huge dataset without bounding boxes. This dataset is formed by 19,995 classes and it's already divided into train, validation and test. The command used for the download from this dataset is downloader_ill (Downloader of Image-Level Labels) and requires the argument --sub. This argument selects the sub-dataset between human-verified labels h (5,655,108 images) and machine-generated labels m (8,853,429 images). An example of command is:

python3 main.py downloader_ill --sub m --classes Orange --type_csv train --limit 30

The previously explained commands Dataset, multiclasses, n_threads and limit are available. The Toolkit automatically will put the dataset and the csv folder in specific folders that are renamed with a _nl at the end.

Commands sum-up

	downloader	visualizer	downloader_ill
Dataset	O	O	O	Dataset folder name
classes	R		R	Considered classes
type_csv	R		R	Train, test or validation dataset
y	O		O	Answer yes when downloading missing csv files
multiclasses	O		O	Download classes toghether
noLabels	O			Don't create labels
Image_IsOccluded	O			Consider or not this filter
Image_IsTruncated	O			Consider or not this filter
Image_IsGroupOf	O			Consider or not this filter
Image_IsDepiction	O			Consider or not this filter
Image_IsInside	O			Consider or not this filter
n_threads	O		O	Indicates the maximum threads number
limit	O		O	Max number of images to download
sub			R	Human-verified or Machine-generated images (h/m)

R = required, O = optional

4.0 Use the ToolKit to visualize the labeled images

The ToolKit is useful also for visualize the downloaded images with the respective labels.

   python3 main.py visualizer

In this way the default Dataset folder will be pointed to search the images and labels automatically. To point another folder it's possible to use --Dataset optional argument.

   python3 main.py visualizer --Dataset desired_folder

Then the system will ask you which folder to visualize (train, validation or test) and the desired class. Hence with d (next), a (previous) and q (exit) you will be able to explore all the images. Follow the menu for all the other options.

5.0 Community Contributions

Denis Zuenko has added multithreading to the ToolKit and is currently working on the generalization and speeding up process of the labels creation
Skylion007 has improved labels creation reducing the runtime from O(nm) to O(n). That massively speeds up label generation
Alex March has added the limit option to the ToolKit in order to download only a maximum number of images of a certain class
Michael Baroody has fixed the toolkit's visualizer for multiword classes

Citation

Use this bibtex if you want to cite this repository:

@misc{OIDv4_ToolKit,
  title={Toolkit to download and visualize single or multiple classes from the huge Open Images v4 dataset},
  author={Vittorio, Angelo},
  year={2018},
  publisher={Github},
  journal={GitHub repository},
  howpublished={\url{https://github.com/EscVM/OIDv4_ToolKit}},
}

Reference

"We don't need no bounding-boxes: Training object class detectors using only human verification"Papadopolous et al., CVPR 2016.

oidv4_toolkit's People

Contributors

Stargazers

Watchers

Forkers

keldrom quantron zuenko berkerlogoglu georg-w skylion007 jurjsorinliviu daoyijushi mbaroody jillelajitta akshai lxy5513 rajaskakodkar musicbeer shiyongde e-leaf heonedream beardedambivert rongyan236 francisobiagwu balak4 srihari-palivela tsok-xyz atrisaxena arruda davidko3 jogiji benjydel sidpatki dennistang742 xiaojinu chris-mh-wu jmast daicoolb ppjhang shooter2062424 yu-jingrui erikfather nguyendieuhienk15 lucaspedro jcr179 taksau venkataramansubramanian sl07h rajatmodi62 julixquid cdleong cchinnaraj lijuny amruthaajay omarsayedmostafa pkusnail calvin4471 wswday dreadlord1984 lewstherin511 adewin oanaucs deaplearn peace-zy mondher-bouazizi pythonlessons monocongo wnitbrrt yhsmiley elavin11 kelvinto05 giegiey vescarecrow eduguiu andreasmarxer alisher-ai victor1600 iamweiweishi othmane-kada rajashekary pmin91 thekindler jilee3 k4droid3 blackandrose rajasekhar06 shivam0403 myounus96 thiagodangelo theaiguyscode doutdex jensanderer horse007666 jokermachtsachen saptechengineer github2016-yuan sunn-e kediarahul135 manhminno manjeet87 chuanche-huang-bose naveen-dodda spandan09 khuongphp

oidv4_toolkit's Issues

Crop out the bounding box

How can i get the part of the image which is only in the bounding box? So just the bounding box, not the surrounding of it.

Thank you!

Feature Request: Kitti Labels Format?

Not so much a bug or an error report, but more of a feature request. If it's possible to have the labels in the kitti labels format? A reference exists here:

https://github.com/NVIDIA/DIGITS/tree/master/digits/extensions/data/objectDetection#label-format

Multi-word class names

When label files are created it would be nice if multi-word class names like "adhesive tape" and "brown bear" be in quotes or space gets replaced with underscore. Otherwise it's little bit problematic to process those files.

PS.
The readme suggests using underscore in classes file but such classes (e.g. adhesive_tape) aren't found. But it happily accepts natural names so i don't know what's going on.

Support for different annotation file format, such as Pascal VOC

Hello, you tool is great and very useful.
I would like to use the downloaded dataset with tensorflow so I have to build a TFRecord file from it. However, before start training, I think I am going to modify/add some labels via the labelImg tool so I firstly need to build xml files in Pascal VOC format. It looks like it is not that hard to obtain them from your txt files but I am wondering whether you consider including this file format as output of you toolkit (I am a ML newbie so I hope I am not asking something obvious...).

Annotation bounding box coordinates with invalid values?

I have downloaded images and label files for an image class, and I'm wondering if maybe I've done something to monkey with the bounding box coordinates in the label *.txt files.

For example, I have a label file (0a7df07bbac03159.txt) with the following contents:

Sword 14.72 7.68 669.44 767.360256

My understanding from the README is that the annotation bounding box coordinates should be within the normalized range [0, 1], but that's obviously not what I'm getting in my label files, as seen above.

Can anyone comment as to how I should interpret the bounding box values in the *.tx files, and/or what I've done wrong to get values that appear to be outside the expected range? Are the float values present in these files computed from the original/normalized bounding box values against the height/width of the corresponding image, and if I want to have the actual integer pixel numbers for the bounding box then I can just round these to the nearest integer? For example, the bounding box for the above would be (left_x=15, top_y=8, right_x=669, bottom_y=767)?

Thanks in advance for any comments or suggestions.

instance segmentation

Thank you for your wonderful work. The open Images Dataset V5 was just released, which contained instance segmentation. Would you mind update the package to support download of instance mask?

Thank you again for your work.

Missing n_threads argument causing a TypeError from download() function call

I have attempted to use this software for downloading a certain group of image classes ("Weapon").

I have used the following command:

python main.py --Dataset ~/data/openimages --classes Weapon --type_csv 'all' downloader

Once this starts working I was prompted to save missing files and then many messages indicating a missing aws command:

    [INFO] | Downloading Weapon.
   [ERROR] | Missing the class-descriptions-boxable.csv file.
[DOWNLOAD] | Do you want to download the missing file? [Y/n] Y
...145%, 0 MB, 1653 KB/s, 0 seconds passed
[DOWNLOAD] | File class-descriptions-boxable.csv downloaded into OID/csv_folder/class-descriptions-boxable.csv.
   [ERROR] | Missing the train-annotations-bbox.csv file.
[DOWNLOAD] | Do you want to download the missing file? [Y/n] Y
...100%, 1138 MB, 10685 KB/s, 109 seconds passed
[DOWNLOAD] | File train-annotations-bbox.csv downloaded into OID/csv_folder/train-annotations-bbox.csv.

-----------------------------------------------Weapon-----------------------------------------------
    [INFO] | Downloading all images.
    [INFO] | [INFO] Found 1646 online images for train.
    [INFO] | Download of 1646 images in train.
sh: 1: aws: not found
sh: 1: aws: not found
sh: 1: aws: not found
sh: 1: aws: not found
...

Finally I am seeing the following error:

    [INFO] | Done!
    [INFO] | Creating labels for Weapon of test.
    [INFO] | Labels creation completed.
Traceback (most recent call last):
  File "main.py", line 36, in <module>
    bounding_boxes_images(args, DEFAULT_OID_DIR)
  File "/home/james/git/OIDv4_ToolKit/modules/bounding_boxes.py", line 89, in bounding_boxes_images
    download(args, df_val, folder[i], dataset_dir, class_name, class_code, threads = int(args.n_threads))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Perhaps this is being caused by the n_threads argument not having a reasonable default argument? The help information shows that this value is 20 by default, maybe this should be verified?

In any event thanks for making this code available. Once I get it to work it seems that it will save me lots of time for collecting a sub-dataset from OpenImages.

Create custom model for 230.000 images using darknet

I have jpg files that contains 230.000 car plate images for graduate project and i'm new for this topic.
Can i convert this images to yolo format using this repo ?

Support for Filtering Image-Level Labels

Hi, thanks for the great toolkit. Is it possible to download images based on image-level labels (19,995 classes), rather than the 600 boxable classes only? I have downloaded the csv needed.

Label data not downloaded

I do not know whether this is intended behavior, but the Label folder comes out empty after running the command

python main.py downloader --classes Dolphin Whale --sub h --type_csv train

I am need of these files so that I can convert to Pascal/VOC using this other tool.

(Just to be sure, I installed awscli and which aws outputs ~/.local/bin/aws, and I think that's intended.)

What could be failing?

IndexError: index 0 is out of bounds for axis 0 with size 0

Traceback (most recent call last):
  File "main.py", line 36, in <module>
    bounding_boxes_images(args, DEFAULT_OID_DIR)
  File "/mountdir/OIDv4_ToolKit/modules/bounding_boxes.py", line 56, in bounding_boxes_images
    class_code = df_classes.loc[df_classes[1] == class_name].values[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

download() method crashes program for casting NoneType

Download method has misplaced indent in master

Misplaced indent causes the following

  File "main.py", line 38, in <module>
    bounding_boxes_images(args, DEFAULT_OID_DIR)
  File "/home/dkendall/Documents/senior_design/tools/OIDv4_ToolKit/modules/bounding_boxes.py", line 89, in bounding_boxes_images
    download(args, df_val, folder[i], dataset_dir, class_name, class_code, threads = int(args.n_threads))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

downloading.py:88

						if not args.n_threads:
 							download(args, df_val, folder[i], dataset_dir, class_name, class_code)
					else:
						download(args, df_val, folder[i], dataset_dir, class_name, class_code, threads = int(args.n_threads))'''

Proposed change

						if not args.n_threads:
 							download(args, df_val, folder[i], dataset_dir, class_name, class_code)
						else:
						        download(args, df_val, folder[i], dataset_dir, class_name, class_code, threads = int(args.n_threads))

download multiclass images

I want to know how I can download many classes images together, such as 1000 classes. The parameter "class" how can i set?

Possibility to download license of file?

Hello I just tried your script, it works perfect, I would like to know if a posibility to download also the license of the file could be added.

Thanks

How can I download images with more than one class and get all classes bounding boxes, in the corresponding notation text file ?

For example, if there is an image of a Person riding a Bicycle, I would like to get 2 bounding box data of the Person and the Bicycle.

ValueError: not enough values to unpack (expected 2, got 0)

While downloading the images on windows machine it is throwing the below traceback:

Traceback (most recent call last):
  File "main.py", line 36, in <module>
    bounding_boxes_images(args, DEFAULT_OID_DIR)
  File "E:\python\experiment\OIDv4_ToolKit\modules\bounding_boxes.py", line 87, in bounding_boxes_images
    download(args, df_val, folder[i], dataset_dir, class_name, class_code)
  File "E:\python\experiment\OIDv4_ToolKit\modules\downloader.py", line 21, in download
    rows, columns = os.popen('stty size', 'r').read().split()
ValueError: not enough values to unpack (expected 2, got 0)

how to stop it?

The tool is downloading. But i wanna stop it. I was pressed Ctrl+C anytime but i can interrupt.
what need i do?

Only downloads 13 images

I tried to use the software, but it only downloads 13 images, even though I set the limit to 500

Usage with v3 classes

Is there any way to download and label image classes from v3 version? I change csv_folder content with files from v3, OID url to https://storage.googleapis.com/openimages/2017_11/, but for example downloading Chimney images it outputs:
[INFO] Downloading Chimney.
----------Chimney----------
[INFO] Downloading train images.
[INFO] Found 0 online images for train.
[INFO] All images already downloaded.
[INFO] Creating labels for Chimney of train.
[INFO] Labels creation completed.

Unable to parse the bounding boxes from train-annotations-bbox.csv file

How long does it takes to download single class images?

I started to download 'Person' class train images like 14 hours ago and it still go for 75% completed. I just wonder how long does it takes for others.
I'm not very familiar with data issues so not that sure what could I do for speed accelerate.
Can I use GPU to make download faster? Or is it normal speed to others too?
Btw thanks for the toolkit. It really helps me.

How to download a class with two words?

Dear all,

I need to download classes with two words, e.g.: Human mouth, Human head, etc.
I try using space or _ (underscore) with no luck.
Thank you very much in advance.

Warmest Regards,
Suryadi

No action when running command

I tried to use the script but when I run any command besides "python main.py -h", nothing happens, no error, nothing. New line appears immediately after. Has anyone encountered something like this?
I installed all the requirements btw

IndexError: index 0 is out of bounds for axis 0 with size 0

Glenns-iMac:OIDv4_ToolKit glennjocher$ python3 main.py downloader --classes knife kitchen_knife --type_csv train

                   ___   _____  ______            _    _    
                 .'   `.|_   _||_   _ `.         | |  | |   
                /  .-.  \ | |    | | `. \ _   __ | |__| |_  
                | |   | | | |    | |  | |[ \ [  ]|____   _| 
                \  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
                 `.___.'|_____||______.'   \__/     |_____|
        

             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        
    [INFO] | Downloading knife.
   [ERROR] | Missing the class-descriptions-boxable.csv file.
[DOWNLOAD] | Do you want to download the missing file? [Y/n] Y
...145%, 0 MB, 9540 KB/s, 0 seconds passed
[DOWNLOAD] | File class-descriptions-boxable.csv downloaded into OID/csv_folder/class-descriptions-boxable.csv.
Traceback (most recent call last):
  File "main.py", line 37, in <module>
    bounding_boxes_images(args, DEFAULT_OID_DIR)
  File "/Users/glennjocher/PycharmProjects/OIDv4_ToolKit/modules/bounding_boxes.py", line 56, in bounding_boxes_images
    class_code = df_classes.loc[df_classes[1] == class_name].values[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

"chipper" option to cut out image chips using bounding boxes

Sometimes one does not want the entire image, only the part with the class of interest in it. For example, when training an image classifier or a GAN.

A crude implementation is shown below:

new file, modules/chip.py

import cv2
import os
import re
import numpy as np

class_list = []
flag = 0


def chip(class_name, download_dir, label_dir,total_images, index):
    '''
    '''

    global class_list

    if not os.listdir(download_dir)[index].endswith('.jpg'):
        index += 2
    img_file = os.listdir(download_dir)[index]
    current_image_path = str(os.path.join(download_dir, img_file))
    img = cv2.imread(current_image_path)
    file_name = str(img_file.split('.')[0]) + '.txt'
    file_path = os.path.join(label_dir, file_name)
    f = open(file_path, 'r')



    for idx, line in enumerate(f):
        print(f"f is {f}")
        print(f"current img is {current_image_path}")
        print(f"line is {line}")
        # each row in a file is class_name, XMin, YMix, XMax, YMax
        match_class_name = re.compile('^[a-zA-Z]+(\s+[a-zA-Z]+)*').match(line)
        class_name = line[:match_class_name.span()[1]]
        ax = line[match_class_name.span()[1]:].lstrip().rstrip().split(' ')
    # opencv top left bottom right

        if class_name not in class_list:
            class_list.append(class_name)

        xmin = int(float(ax[-4]))
        ymin = int(float(ax[-3]))
        xmax = int(float(ax[-2]))
        ymax = int(float(ax[-1]))

        roi = img[ymin:ymax, xmin:xmax]
        print(f"xmin, xmax, ymin, ymax = ({xmin}, {xmax}, {ymin}, {ymax})")
        chips_folder="chips/"
        img_chip = img[roi]
        chip_filename = os.path.splitext(os.path.basename(current_image_path))[0]+"_chip"+str(idx)+".jpg"
        print(f"chip filename is {chip_filename}")
        chip_path = os.path.join(chips_folder, chip_filename)
        print(f"chip_path is {chip_path}")
        cv2.imwrite(chip_path, roi)

Added to bounding_boxes.py:

an import statement at the top...

from modules.chip import chip

...and this section:

    elif args.command == "chipper":
        for image_dir in ["train", "test", "validation"]:
                class_image_dir = os.path.join(dataset_dir, image_dir)
                for class_name in os.listdir(class_image_dir):

                    download_dir = os.path.join(dataset_dir, image_dir, class_name)
                    label_dir = os.path.join(dataset_dir, image_dir, class_name, 'Label')
                    if not os.path.isdir(download_dir):
                        print("[ERROR] Images folder not found")
                        exit(1)
                    if not os.path.isdir(label_dir):
                        print("[ERROR] Labels folder not found")
                        exit(1)

                    index = 0


                    chip(class_name, download_dir, label_dir,len(os.listdir(download_dir))-1, index)

                    while True:
                        if index < (len(os.listdir(download_dir)) - 2):
                           index += 1
                           chip(class_name, download_dir, label_dir,len(os.listdir(download_dir))-1, index)

Problem with classes containing a space

Then class name contain space download directory also contain space.
So "aws s3 cp" command fails cause unable to parse path.

Sample: class_name=Vehicle registration plate
[INFO] Downloading Vehicle registration plate.
----------Vehicle registration plate----------
[INFO] Downloading all images.
[INFO] Found 3944 online images for train.
[INFO] Download of 3876 images in train.
0%| | 0/3876 [00:00<?, ?it/s]
Unknown options: registration,plate
0%| | 1/3876 [00:01<1:16:07, 1.18s/it]
Unknown options: registration,plate

I create small merge request, thats that fixes this problem:
#2

Error: cound not connect to the endpoint URL

This toolkit looks very useful to extract parts of Open image!!
I work on windows 7 (sorry), python 3.6.
When i try to launch the basic command line you provide:
python main.py downloader --classes Apple Orange --type_csv validation

i get first a bunch of lines saying "File association not found for extension .py" , and in between,
for each image it tries to pick "fatal error: cound not commect to the endpoint URL: "https://open-images...."
At the end of the process, of course, i get empty folders where the pictures should be.

I don't know for the file association message, but for the second error, it seems to be because of my proxy. Is there an option to add in command line the proxy? that would be very useful. Like what we can see in other dataset downloader, --proxy "http:XXXXXXXX:port"

or where can i hardcode proxy information in the python files of the toolkit?

thanks

.gitignore doesn't appear to be valid

Rather than being a valid .gitignore the file in place appears to a page that was copied as HTML. Replace with a valid .gitignore, perhaps from the GitHub template provided for Python.

Getting Error While downloading multiple classes images

Thanks for great tool to download Open Images.
I am facing below error while downloading images of multiple classes.
Traceback (most recent call last): File "main.py", line 37, in <module> bounding_boxes_images(args, DEFAULT_OID_DIR) File "/Users/OpenImages/OID/OIDv4_ToolKit/modules/bounding_boxes.py", line 106, in bounding_boxes_images class_dict[class_name] = df_classes.loc[df_classes[1] == class_name].values[0][0] IndexError: index 0 is out of bounds for axis 0 with size 0

I tried two method where listing all classes as command line argument and passing a text file which contains classes as a command line argument. But facing same error.

Also searched previous issues who faced similar issues and tried suggested method but no luck
Issue 37
Issue 13

Commands I ran are:
1)python3 main.py downloader --classes Car Person Bicycle Taxi Truck Building Traffic_light Tree Traffic_sign Stop_sign Billboard Missle Motorcycle Van Tire Airplane Wheel Tank Stree_light Submarine --type_csv all --multiclasses 1 --limit 100

2)python3 main.py downloader --classes classes_custom.txt --type_csv all --multiclasses 1 --limit 100

contents in classes_custom.txt are:
Car
Person
Bicycle
Taxi
Truck
Building
Traffic light
Tree
Traffic sign
Stop sign
Billboard
Missle
Motorcycle
Van
Tire
Airplane
Wheel
Tank
Street light
Submarine

Unable to download from OIDv5

Hi!
With the new version of OID it is impossible to download the images, thinking that this memory error is due to this incompatibility?

WIN10
PYTHON 3.7.4

[INFO] | Downloading Handgun. Traceback (most recent call last): File "main.py", line 37, in <module> bounding_boxes_images(args, DEFAULT_OID_DIR) File "C:\Users\aless\Desktop\YOLOv3_GUN\Tool\OIDv4_ToolKit-master\modules\bounding_boxes.py", line 60, in bounding_boxes_images df_val = TTV(csv_dir, name_file, args.yes) File "C:\Users\aless\Desktop\YOLOv3_GUN\Tool\OIDv4_ToolKit-master\modules\csv_downloader.py", line 21, in TTV df_val = pd.read_csv(CSV) File "C:\Users\aless\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "C:\Users\aless\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 463, in _read data = parser.read(nrows) File "C:\Users\aless\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1154, in read ret = self._engine.read(nrows) File "C:\Users\aless\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 2059, in read data = self._reader.read(nrows) File "pandas\_libs\parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read File "pandas\_libs\parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas\_libs\parsers.pyx", line 2165, in pandas._libs.parsers._concatenate_chunks File "<__array_function__ internals>", line 6, in concatenate MemoryError: Unable to allocate array with shape (14610229,) and data type object

Can't download grape train set

I am able to download the grape test and validation sets using the command below.

python main.py downloader --classes Grape --type_csv validation
python main.py downloader --classes Grape --type_csv test

However, when i try and download the train set, the program hangs for a few seconds, and then fails with a memory error in pandas.
python main.py downloader --classes Grape --type_csv train
Note: I'm running python 3 on windows

Any help is appreciated, thanks.

fatal error: An error occurred (404) when calling the HeadObject

Following the guide in the ReadMe, after:

python3 main.py downloader_ill --sub m --classes Orange --type_csv train --limit 30

I get the error:

fatal error: An error occurred (404) when calling the HeadObject operation: Key "train/0d72ff3e2601d71c.jpg" does not exist

Remove pycache folder and .pyc files from repo

https://stackoverflow.com/questions/32110126/should-i-put-pyc-files-under-version-control

Couldn't download images

Since OID has migrated from oidv4 to oidv5 the aws is not working thereby the images are not getting downloaded please migrate this toolkit from oidv4 to oidv5 as soon as possible

cvs_downloader: urlretrieve returns 403 forbidden when calling save().

I've added a print(FILE_URL) just to check the url we're trying to retrieve. At first I suspected that the missing User-Agent header might be the problem, but adding it didn't make any difference

    opener = urllib.request.build_opener()
    opener.addheaders = [('User-Agent','Mozilla/5.0')]
    urllib.request.install_opener(opener)

    urllib.request.urlretrieve(url, filename, reporthook)

Here's the error output when trying to download the missing CSV.

[DOWNLOAD] Do you want to download the missing file? [Y/n] y
https://storage.googleapis.com/openimages/2018_04/test\test-annotations-bbox.csv
Traceback (most recent call last):
  File "main.py", line 97, in <module>
    df_val = TTV(csv_dir, name_file)
  File "C:\wa\OIDv4_ToolKit\modules\csv_downloader.py", line 18, in TTV
    error_csv(name_file, csv_dir)
  File "C:\wa\OIDv4_ToolKit\modules\csv_downloader.py", line 43, in error_csv
    save(FILE_URL, FILE_PATH)
  File "C:\wa\OIDv4_ToolKit\modules\csv_downloader.py", line 57, in save
    urllib.request.urlretrieve(url, filename, reporthook)
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "c:\users\march\appdata\local\programs\python\python37\Lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Please update the read me file as in the command it is written download not downloader

OIDv5?

Hello!

Any plans to release a version for OIDv5 anytime soon?

Add option to default prompts to "Yes"

First of all thanks for making this application available, it's quite helpful in my work creating datasets for inputs when training object detection models.

I would like to call this application from within a script. This is a bit tricky at the moment because I get various prompts asking me if I want to download missing files. This doesn't always occur (I'm not sure why) but when it does it requires keyboard input (I always choose yes). The prompt I'm seeing may be coming from here.

I would like to run this module's main.py script in a non-interactive mode where it always downloads any missing files without confirmation from the user. I don't see an option to disable confirmation prompts.

If someone can advise as to where to modify this code to allow for this then I am happy to make the changes myself and submit a PR once I've verified that it's working. My suggestion is to add a command line option such as --yes to disable confirmation, such as what's available on the command line for conda (as an example).

Thanks in advance for any assistance with this issue.

visualize the labeled images does not work

!python3 OIDv4_ToolKit/main.py visualizer

does not work on Google colab.
Error: : cannot connect to X server

wrong txt format for classes with underscore

python main.py downloader --classes Human_face --type_csv 'validation' --multiclasses 1

The txt label files look like the following:
Human face 272.903578 214.58227200000002 448.06849 435.37612800000005

According to the documentation there should be a underscore between human and face.

However, in order to accomodate a more intuitive representation and give the maximum flexibility, every .txt annotation is made like:

name_of_the_class left top right bottom

This also causes the OID_to_yolo_gist to fail

What format are the labels in?

What format are the labels in?
I'm seeing this:
Man 141.44 149.02449 636.16 684.36021
Man 414.08 117.68437 807.68 684.36021
Man 629.76 449.63126 1023.36 684.36021
Is is class x1 y1 x2 y2
or what...?

Image level download for positive labels only

Hi guys,
Love this library, it's super helpful and I've been using it to download images for my research project. I was downloading image level label images related to alcohol and noticed that an image will get downloaded into the directory corresponding to a label even if the label is negative for that image. ie an image where Beer=1 and Wine=0 in the train-annotations-human-imagelabels.csv will get downloaded in both the Beer and Wine directories. Is there a way to download images only into directories that correspond to positive labels?

Typo in README file in command line

Hi Vittorio. Thanks for this cool tool, very helpful. Just want to notice that you have typo in your README file, when you give the command to download or visualize. Instead of using command downloader we should use download. The same with visualizer.

I made YOLO annotations script

here
It create .txt yolo format annotation out from downloaded images
compatible with other visualizer e.g. labelImg, boobs

*now read .csv in small chunk so it wont crash my potato PC and re-arrange if condition to improve speed.

IndexError: index 0 is out of bounds for axis 0 with size 0

Hi, thanks for the work!

I am facing the following issue:

python main.py downloader --classes backpack --type_csv all

[INFO] Downloading backpack.
Traceback (most recent call last):
File "main.py", line 77, in
class_code = df_classes[df_classes[1] == class_name].values[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

Any help?

Thanks

[ERROR] | Missing the class-descriptions-boxable.csv file.

Hi,

I want to download Man, Woman data and using this line:
python main.py downloader --classes Man Woman --multiclasses 1 --type_csv all

I get the error Missing class-description-boxable.csv and train-annotations-bbox.csv.

I assume train-annotations-bbox.csv is an important file and thus would be required. Any solution to this?

Fatal error using downloader_ill

I will check this issue but it's not due to us; it seems that the image 0d72ff3e2601d71c.jpg is present on the csv file but it's not on the OIDv4 server.

Originally posted by @keldrom in #30 (comment)

Hello, I would like to know if this is solved?

How to get all classes

OSError: [WinError 6] The handle is invalid

Command: python main.py downloader --classes Shirt --type_csv all

StackTrace:

[DOWNLOAD] | File train-annotations-bbox.csv downloaded into OID\csv_folder\trai n-annotations-bbox.csv. Traceback (most recent call last): File "D:\Projects\Product Tagging\tensorflow1\addons\OIDv4_ToolKit\modules\dow nloader.py", line 25, in download columns, rows = os.get_terminal_size(0) OSError: [WinError 6] The handle is invalid During handling of the above exception, another exception occurred: Traceback (most recent call last): File "main.py", line 36, in bounding_boxes_images(args, DEFAULT_OID_DIR) File "D:\Projects\Product Tagging\tensorflow1\addons\OIDv4_ToolKit\modules\bou nding_boxes.py", line 87, in bounding_boxes_images download(args, df_val, folder[i], dataset_dir, class_name, class_code) File "D:\Projects\Product Tagging\tensorflow1\addons\OIDv4_ToolKit\modules\dow nloader.py", line 27, in download columns, rows = os.get_terminal_size(1) OSError: [WinError 6] The handle is invalid

Register and upload to PyPI

I'd like to get the package installed into my Python virtual environment from PyPI via something along the lines of pip install oid_toolkit.

I may be able to cook this up myself and submit a PR once it's done, but if the developers in charge can give me any guidance on making that happen then please advise.

BTW (in case this helps) when I've done this before for projects of my own I've followed the relevant Real Python tutorial:

Get the setup.py file in shape
$ rm -rf dist
$ python setup.py sdist bdist_wheel
$ twine check dist/*
$ twine upload --repository-url https://test.pypi.org/legacy/ dist/*
$ twine upload dist/*

escvm / oidv4_toolkit Goto Github PK

oidv4_toolkit's Introduction

~ OIDv4 ToolKit ~

Open Image Dataset v4

1.0 Getting Started

1.1 Installation

1.2 Launch the ToolKit to check the available options

2.0 Use the ToolKit to download images for Object Detection

2.1 Download different classes in separated folders

2.2 Download multiple classes in a common folder

Annotations

Optional Arguments

3.0 Download images from Image-Level Labels Dataset for Image Classifiction

Commands sum-up

4.0 Use the ToolKit to visualize the labeled images

5.0 Community Contributions

Citation

Reference

oidv4_toolkit's People

Contributors

Stargazers

Watchers

Forkers

oidv4_toolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org