Giter VIP home page Giter VIP logo

imagenet-downloader's Introduction

ImageNet Downloader

Download from ImageNet Image URLs

Preparation

Install Python Packages:

pip install -r requirements.txt

Download image url list from http://image-net.org/download-imageurls. For example, do the following:

wget http://image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz

NOTE: This URL is currently dead link. See also #21.

Download category list from http://image-net.org/archive/words.txt:

wget http://image-net.org/archive/words.txt

Select category and create list with one category written per line. For example, for ILSVRC2012, create list based on http://image-net.org/challenges/LSVRC/2012/browse-synsets. If you want to use the ILSVRC2012 list that we created, you can do the following:

wget https://git.io/vdUng -O urllist.txt

Usage

  1. Generate Download URL List

    python gen_urls.py
  2. Download Image from URL List

    cat <generated urllist> | xargs -n 2 ./download.sh
    • It takes several hours to download about 1.3 million images, and their size is about 100 GB.
    • We recommend running it in the backend, ex. nohup cat list/urllist.txt | xargs -n 2 ./download.sh > download.log 2> error.log &
  3. (optionally) Generate Image File List

    python gen_list.py

imagenet-downloader's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar xkumiyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

imagenet-downloader's Issues

error generating urls

ubuntu@ip-10-0-175-243:/imagenet-downloader$ python3 gen_urls.py
Traceback (most recent call last):
File "gen_urls.py", line 73, in
main()
File "gen_urls.py", line 50, in main
df = get_categories(args.words, args.categories)
File "gen_urls.py", line 12, in get_categories
all_list = pd.read_csv(all_list_file, header=None, delimiter='\t')
File "/home/ubuntu/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 705, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 445, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 814, in init
self._make_engine(self.engine)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1684, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 391, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'list/words.txt' does not exist
ubuntu@ip-10-0-175-243:
/imagenet-downloader$ ls
download.sh gen_list.py gen_urls.py README.md requirements.txt resize_images.py

Rate limiting with the suggested download script

Hi, this is very helpful now that the imagenet website seems to be down. It's been several months and they still haven't granted me direct download access. So, URL's are the way to go.

While the suggested downloading script works well, I found that opening so many wget requests crashed my home internet connection. Receiving skyrockets to the cap speed, and I get through about 3GB of download before the connection dies. Instead, I use gnu-parallel to limit the number of concurrent downloads. The difference is in download.sh to keep the requests in foreground, and to strip double quotes from urllists.txt, then run this command:

download.sh:

#!/bin/sh

if [ $# -ne 2 ]; then
  exit 1
fi

# original line
# wget $2 -O $1 -T 1 -t 5 -nc -b -a wget.log

# new line
wget $2 -O $1 -T 1 -t 5 -nc
sed 's/\"//g' list/urllist.txt > list/urllist_noquote.txt
cat list/urllists_noquote.txt | parallel --jobs 12 --colsep ' ' ./download.sh {1} {2}

It's slower, yes, but for people on a limited connection this way lets you keep working during the download :)

999 classes?

After running gen_urls, the clist.csv file contains class ids from 0 to 998. It seems as if there's one class that is not present in the file to round it up to 1000 classes. Is this a normal behaviour?

Thanks

The dead links

Is it possible to share these files somewhere like a google driver?

Without these files, this repo is useless.

URLs on the ImageNet page are not working anymore?

When I try:

wget http://image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz

I get

--2020-07-01 08:27:14--  http://image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz
Resolving image-net.org (image-net.org)... 171.64.68.16
Connecting to image-net.org (image-net.org)|171.64.68.16|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-07-01 08:27:14 ERROR 404: Not Found.

Also, when I go to page http://image-net.org/download-imageurls and try clicking on the links, it says that URL is not valid ...

Not saying that there is an error in this script, but just that it looks like it cannot be used anymore?

Image NoneType

Several images are still visible but return NoneType when read
Ex: 2711

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.