Giter VIP home page Giter VIP logo

idt's Introduction

Hi there ๐Ÿ‘‹

I'm Deliton Junior, a full-stack web and app developer currently living in Portugal, and welcome to my Github Profile! I like to build cool stuff and solve real-world problems. I hope that someday, I'll be able to do something meaningful and help the world for the better. I'm also super excited about the upcoming human missions on Mars ๐Ÿš€

GIF

Quick facts

  • ๐Ÿ“– Currently learning DL, ML Ops, Go and Data Science algorithms
  • ๐ŸŒŸ Fields I like the most: WEB ๐Ÿ–ฅ, Mobile๐Ÿ“ฑ and Machine Learning ๐Ÿค–
  • โ›ณ I always excited to learn new things
  • ๐ŸŽฎ My first interaction with programming happened I was 10 years old, developing mods for my private Tibia server (Otserver)
  • ๐Ÿฆ– Strongly support the Open Source movement

Recent personal projects

Some tools I use

git React React Native C Javascript TypeScript html5 css3 GraphQL Apollo Docker Heroku Vercel Netlify Nodejs Express MongoDB PostgreSQL Golang java Python Django Pytorch

Find me on

Email Instagram Facebook LinkedIn Medium

Deliton's github stats

idt's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

idt's Issues

ReadTimeout Exception is not handled when the main search engine request times out and downloading stops for duckgo engine

Describe the bug
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='duckduckgo.com', port=443): Read timed out. (read timeout=3.0) on line 26 in idt/duckgo.py.

def search(self):
    URL = 'https://duckduckgo.com/'
    PARAMS = {'q': self.data}
    HEADERS = {
    'authority': 'duckduckgo.com',
    'accept': 'application/json, text/javascript, */*; q=0.01',
    'sec-fetch-dest': 'empty',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'referer': 'https://duckduckgo.com/',
    'accept-language': 'en-US,en;q=0.9'}

    res = requests.post(URL, data=PARAMS, timeout=3.000) # exception occurs here after timeout is exhausted
    search_object = re.search(r'vqd=([\d-]+)\&', res.text, re.M|re.I)
    #print(search_object)

    if not search_object:
        return -1;

To Reproduce
Reproduction for valid URLs (i.e. URL = 'https://duckduckgo.com/') might take time, but invalid URLs can be used as well
Steps to reproduce the behavior:

  1. change line 26 in idt/duckgo.py to URL= 'https://duckduckgozzzzzzz1238971873 .com/'
  2. exception is not handled and downloading stops, instead of retrying for some set number of times

Expected behavior
The requests.exceptions.ReadTimeout should be handled and the query should retry requests and then move to the next keyword or class

Desktop (please complete the following information):

  • OS: Mac OSX 10.14.6

Currently we catch the error like this starting from line 26 in duckgo.py

cur_req_num = 0
max_req_num = 500
while True:
    try:
        res = requests.post(URL, data=PARAMS, timeout=3.000)
        search_object = re.search(r'vqd=([\d-]+)\&', res.text, re.M|re.I)
        #print(search_object)
        if not search_object:
            cur_req_num += 1
            print(f"Attempt {cur_req_num}\nRequest failed occured for {URL}. Retrying again!")
            if cur_req_num >= max_req_num:
                print(f"Max request({max_req_num}) to {URL} reached. Moving to next keyword if any.")
                return -1
            continue
        break
    except Exception as e:
        cur_req_num += 1
        print(f"Attempt {cur_req_num}\nException {e} occured for {URL}. Retrying again!")
        if cur_req_num >= max_req_num:
            print(f"Max request({max_req_num}) to {URL} reached. Moving to next keyword if any.")
            return -1

Should be a better way than this.

Current issues, bugs and things to refine

  • Code needs to be refactored to better fit good code patterns
  • Deviantart scraper only downloads thumbnails
  • Duckgo progress bar is off. It's because of corrupted/unsupported image downloads. It's necessary to find a way to display a realtime pregress bar that consider unsupported images.
  • Except for duckgo, all the other scrapers don't download the exact amount provided in the yaml file. It's because of corrupt/unsupported image files that are counted during the initial phases. It's necessary to find a way of keep searching and downloading while the amount of downloaded files is not what was provided in the dataset.yaml file

pip install idt not working on Ubuntu 24.04 LTS

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to terminal
  2. sudo pip install idt --break-system-packages

Expected behavior
The package should be downloaded.

Screenshots
Screenshot from 2024-05-30 10-04-28
Screenshot from 2024-05-30 10-04-48
Screenshot from 2024-05-30 10-04-58

Desktop (please complete the following information):

  • OS: Ubuntu 24.04 LTS
  • Browser: Firefox 125.0.2

Additional context
Had to use --break-system-packages because without it, I kept getting 'error: externally-managed-environment'.

Discussion: Should Image augmentations be implemented?

data-augmentation

Definition

Data augmentations are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It helps reduce overfitting when training a machine learning. It is closely related to oversampling in data analysis.

Feature

  • With that in mind, should IDT also include a image augmentation feature that adds random variations such as crops, flips and noise?
  • Why do you think you would benefit from such feature?
  • Would you prefer doing it via cli or via code?

If you have any suggestions about this topic, feel free to comment in this issue.

Thanks for your time,
IDT Team

How the tool works and use cases

The IDB is a tool developed to make creating image datasets easy and fast. The user will be able to create a whole image dataset automatically by using the CLI.

CLI commands:

  • idb version: returns the tool's current version
  • idb authors: return the name of the creators
  • idb run: do a simple data collection. parameters: --i: corrensponds to the input name, --s: corresponds the desired amount of images to be collected, --v: flag parameter to activate verbose
  • idb init: initialize the cli in order to set up dataset parameters
  • idb build: creates the dataset. It is required to run idb init first

Use cases:

  • User can see the current tool version
  • User can see who created and contributed to the project
  • User can to do a quick image collection of a single class
  • User can set up multiple classes to create a complete dataset
  • User can see the progress of data collection

Implement Baidu Search Engine

baidu

Definition

Baidu has the second largest search engine in the world, and held a 76.05% market share in China's search engine market. In order to help our chinese users, baidu search engine would be a great addition.

Contributing

If you're interested, you can try to implement this feature and then submit a pull request. However, since we've been trying to keep the project as light as possible, we don't encourage the use of tools like Selenium, Scrapy and Beautiful Soup.

As much as I love IDT, I'm having a hard time pulling more than 500 images per keyword

Is your feature request related to a problem? Please describe.
I often need 5,000+ images for a good deep learning dataset. I feel frustrated that I can only pull around 500 images per keyword. For instance, "abstract painting" and "abstract art painting" only got me 500 results each with the duckduckgo engine.

Describe the solution you'd like
Perhaps it would be better to be more flexible on the size, and then allow resizing to whatever size necessary. For instance, when we specify the image size as 512 or whatever, maybe there could be an "any" parameter that takes in all non-duplicate images and can resize them using the other parameters. Might add more results ๐Ÿ‘

I love the "deleting duplicates" feature by the way.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

TODO

  • Implement Google Images Search Scraper

  • Implement Flickr Scraper

  • Implement DeviantArt Search

  • Refactor code to better code standards (factories, constants, app states, etc) (PARTIALLY DONE)

  • Fix bugs related to the amount of downloaded files being less than what was provided in the yaml file.

  • Create a yaml parser outside main.py.

  • Make the search engine factory instantiate the yaml parser and then select the according search engine

  • Implement a way to generate a csv file for every dataset, containing its labels, amount of images in each class, image sources, score, etc.

  • Implement a program that splits dataset's classes between TRAIN and VALID, asking for the user for the desired proportion

  • Implement a program that executes common data augmentations/albumentaions in order to increase the size of the whole dataset.

  • Implement a functional logger, that allows the user to debug and see what's happening with the scraping process.

  • Consider the implementation of Selenium tools, in order to allow use of advanced scraping scripts and spiders. Keep in mind that it slows the process and require browser tools in order to run.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.