The idt from deliton

ReadTimeout Exception is not handled when the main search engine request times out and downloading stops for duckgo engine

Describe the bug
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='duckduckgo.com', port=443): Read timed out. (read timeout=3.0) on line 26 in idt/duckgo.py.

def search(self):
    URL = 'https://duckduckgo.com/'
    PARAMS = {'q': self.data}
    HEADERS = {
    'authority': 'duckduckgo.com',
    'accept': 'application/json, text/javascript, */*; q=0.01',
    'sec-fetch-dest': 'empty',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'referer': 'https://duckduckgo.com/',
    'accept-language': 'en-US,en;q=0.9'}

    res = requests.post(URL, data=PARAMS, timeout=3.000) # exception occurs here after timeout is exhausted
    search_object = re.search(r'vqd=([\d-]+)\&', res.text, re.M|re.I)
    #print(search_object)

    if not search_object:
        return -1;

To Reproduce
Reproduction for valid URLs (i.e. URL = 'https://duckduckgo.com/') might take time, but invalid URLs can be used as well
Steps to reproduce the behavior:

change line 26 in idt/duckgo.py to URL= 'https://duckduckgozzzzzzz1238971873 .com/'
exception is not handled and downloading stops, instead of retrying for some set number of times

Expected behavior
The requests.exceptions.ReadTimeout should be handled and the query should retry requests and then move to the next keyword or class

Desktop (please complete the following information):

OS: Mac OSX 10.14.6

Currently we catch the error like this starting from line 26 in duckgo.py

cur_req_num = 0
max_req_num = 500
while True:
    try:
        res = requests.post(URL, data=PARAMS, timeout=3.000)
        search_object = re.search(r'vqd=([\d-]+)\&', res.text, re.M|re.I)
        #print(search_object)
        if not search_object:
            cur_req_num += 1
            print(f"Attempt {cur_req_num}\nRequest failed occured for {URL}. Retrying again!")
            if cur_req_num >= max_req_num:
                print(f"Max request({max_req_num}) to {URL} reached. Moving to next keyword if any.")
                return -1
            continue
        break
    except Exception as e:
        cur_req_num += 1
        print(f"Attempt {cur_req_num}\nException {e} occured for {URL}. Retrying again!")
        if cur_req_num >= max_req_num:
            print(f"Max request({max_req_num}) to {URL} reached. Moving to next keyword if any.")
            return -1

Should be a better way than this.

Current issues, bugs and things to refine

Code needs to be refactored to better fit good code patterns
Deviantart scraper only downloads thumbnails
Duckgo progress bar is off. It's because of corrupted/unsupported image downloads. It's necessary to find a way to display a realtime pregress bar that consider unsupported images.
Except for duckgo, all the other scrapers don't download the exact amount provided in the yaml file. It's because of corrupt/unsupported image files that are counted during the initial phases. It's necessary to find a way of keep searching and downloading while the amount of downloaded files is not what was provided in the dataset.yaml file

pip install idt not working on Ubuntu 24.04 LTS

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to terminal
sudo pip install idt --break-system-packages

Expected behavior
The package should be downloaded.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu 24.04 LTS
Browser: Firefox 125.0.2

Additional context
Had to use --break-system-packages because without it, I kept getting 'error: externally-managed-environment'.

bing API endpoint changed by Microsoft

Bing API endpoint is changed by Microsoft. It is now https://api.bing.microsoft.com/v7.0/images/search
please update line

idt/idt/bing_api.py

Line 28 in 050d82a

BING_IMAGE = 'https://api.cognitive.microsoft.com/bing/v7.0/images/search'

Discussion: Should Image augmentations be implemented?

Definition

Data augmentations are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It helps reduce overfitting when training a machine learning. It is closely related to oversampling in data analysis.

Feature

With that in mind, should IDT also include a image augmentation feature that adds random variations such as crops, flips and noise?
Why do you think you would benefit from such feature?
Would you prefer doing it via cli or via code?

If you have any suggestions about this topic, feel free to comment in this issue.

Thanks for your time,
IDT Team

How the tool works and use cases

The IDB is a tool developed to make creating image datasets easy and fast. The user will be able to create a whole image dataset automatically by using the CLI.

CLI commands:

idb version: returns the tool's current version
idb authors: return the name of the creators
idb run: do a simple data collection. parameters: --i: corrensponds to the input name, --s: corresponds the desired amount of images to be collected, --v: flag parameter to activate verbose
idb init: initialize the cli in order to set up dataset parameters
idb build: creates the dataset. It is required to run idb init first

Use cases:

User can see the current tool version
User can see who created and contributed to the project
User can to do a quick image collection of a single class
User can set up multiple classes to create a complete dataset
User can see the progress of data collection

Implement Baidu Search Engine

Definition

Baidu has the second largest search engine in the world, and held a 76.05% market share in China's search engine market. In order to help our chinese users, baidu search engine would be a great addition.

Contributing

If you're interested, you can try to implement this feature and then submit a pull request. However, since we've been trying to keep the project as light as possible, we don't encourage the use of tools like Selenium, Scrapy and Beautiful Soup.

As much as I love IDT, I'm having a hard time pulling more than 500 images per keyword

Is your feature request related to a problem? Please describe.
I often need 5,000+ images for a good deep learning dataset. I feel frustrated that I can only pull around 500 images per keyword. For instance, "abstract painting" and "abstract art painting" only got me 500 results each with the duckduckgo engine.

Describe the solution you'd like
Perhaps it would be better to be more flexible on the size, and then allow resizing to whatever size necessary. For instance, when we specify the image size as 512 or whatever, maybe there could be an "any" parameter that takes in all non-duplicate images and can resize them using the other parameters. Might add more results 👍

I love the "deleting duplicates" feature by the way.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

deliton / idt Goto Github PK

idt's Introduction

Hi there 👋

Quick facts

Recent personal projects

Some tools I use

Find me on

idt's People

Stargazers

Watchers

Forkers

idt's Issues

Definition

Feature

Definition

Contributing

Recommend Projects

Recommend Topics

Recommend Org