Giter VIP home page Giter VIP logo

isic-archive-downloader's Introduction

Note

Kaggle now offers a competition regarding the isic archive, and it appears to have a larger dataset than the one provided in the isic archive website. In addition the discussion threads hold useful information about the data itself (e.g about existing duplicate images).

ISIC Archive Downloader

The ISIC Archive contains over 23k images of skin lesions, labeled as 'benign' or 'malignant'.
The archive can be found here: https://www.isic-archive.com/#!/onlyHeaderTop/gallery

The current ways to download the archive, provided by the ISIC foundation and which are known to me, are the following:

  1. Download the entire archive via the direct download button on their website.
  2. Download all the partitions of the archive, called 'datasets' one by one
  3. Downloading the images one by one via the Grider API provided in the site

The first option (which is the easiest and most comfortable way) doesn't always finish successfully for some reason.
We suspect this is happening due to the large file size.

The second option seems rather good if you plan to download the archive only a few times
and the third option seems unfeasible.

If you find the options above too laborious or unavailable, this script provides a comfortable alternative.
This script can download the entire ISIC archive (or parts of it)
all you have to do is run python download_archive.py

Requirements

  1. Python 3.6 or later
  2. requests pip install requests
  3. PIL pip install Pillow
  4. tqdm pip install tqdm

Or you could just pip install -r requirements.txt

Instructions

  1. download or clone the repository
  2. run download_archive.py python download_archive.py

Notes

  1. By default if you call the script in the following way:
    python <root>/.../download_archive.py
    images will be download to <root>/Data/Images
    their descriptions will be downloaded to <root>/Data/Descriptions

  2. In case you choose to download segmentations of images, Note that some images have multiple segmentations of different expertise levels. This script currently downloads one in random, and unnecessarily one of the highest expertise.

Warnings

  1. Make sure you have enough space in the download destination. Otherwise the download will run into errors.
  2. The download might take a few hours.

Optional download abilities

  1. You can download a subset of the archive by specifying how many images you would like.
    python download_archive.py --num-images 1000
    If this option isn't present, the program will download all the available images.

  2. You can start downloading images from an offset.
    python download_archive.py --offset 100
    This is useful for example if you would like to append upon a prior download.

  3. You can choose to download either only benign or malignant images.
    python download_archive.py --filter benign
    Note: If you would like k benign images instead of all the benign images, you could do
    python download_archive.py --num-images k --filter benign

  4. You can choose to download the segmentation of the images
    python download_archive.py -s
    and the directory which they will be downloaded to.
    python download_archive.py -s --seg-dir /Data/Segmentations
    Some images have multiple segmentations offered, made with different skill level.
    You can choose a preferred skill level (e.g expert).
    python download_archive.py -s --seg-level novice
    That means that, when available, the script will download a segmentation with the preferred skill level.
    If no preference was given, the first available segmentation will be downloaded.
    Note: It has been suggested that sometimes segmentations tagged as 'novice' skill are more accurate than there 'expert' alternative. So perhaps relying the the 'expert' segmentations are always better can be incorrect.

  5. You can choose not to download the lesion images.
    python download_archive.py --no-images
    This might be useful if you would like to download only the descriptions of segmentation images.

  6. You can change the default directories the images and the descriptions will be downloaded into.
    python download_archive.py --images-dir /Data/Images --descs-dir /Data/Descriptions

  7. You can also change the default amount of processes that will work in parallel to download the archive.
    python download_archive.py --p 16
    But if you have no knowledge about this one, the default will be fine.

How does it work

Searching for a few images using the API provided by the website, we found that the images are stored
at a url which is in the template of <prefix> <image id> <suffix>
and that their description are stored in <prefix> <image id>
while the prefix and suffix parts are the same for all the images.

The website API also provides a way to request all the ids of all the images.

So the basic portion of the script is:

  1. Request the ids of all the images
  2. Build the urls by the given template
  3. Download the images and descriptions from the built urls
Note

As mentioned above, we assume that the urls of the images and descriptions are built by a certain template.
If the template ever changes (and you start getting errors for example)
just let us know and we will change it accordingly :)
Feel free to use the issues tab for that.

Finally

We hope this script will allow researchers, who had similliar difficulties accessing ISIC's archive, to have easier access and enable them to provide further work on this field, as the ISIC foundation wishes :)

If you stumble into any issues - let us know in the issues section!

In addition, Any contributions or improvement ideas to our code that will improve the comfort of the users will be dearly appreciated :)

Written By

Oren Talmor & Gal Avineri

isic-archive-downloader's People

Contributors

camlloyd avatar erolrecep avatar galavineri avatar javism avatar szkocot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

isic-archive-downloader's Issues

Download k samples of each class

Right now if you wanted to download k samples of each class (malignant and benign) you would have to manually download the malignants first

python download_archive.py --num-images k --filter malignant

And then in a separate directory download the benigns

python download_archive.py --num-images k --filter benign

Otherwise you'd overwrite some of the images. And because some images will have the same filenames, you have to do some preprocessing to rename them all consistently before merging them together.

It would be nice if the script was able to do this in one go.

Filter not working!!

The following command to choose between benign or malignant from the datasets does not work:

python download_archive.py --filter benign

Getting the following error:
Traceback (most recent call last):
File "download_archive.py", line 254, in
main(sys.argv[1:])
File "download_archive.py", line 248, in main
seg_skill=args.seg_skill, num_processes=args.p)
File "download_archive.py", line 36, in download_archive
descs_dir=descs_dir)
File "download_archive.py", line 138, in download_descriptions_and_filter
ImgDownloader.save_description(description, descs_dir)
AttributeError: type object 'LesionImageDownloader' has no attribute 'save_description'

Selective download

Is it possible to use the code you provided to perform a selective download like only benign or only malignant images?

README.md change "num_image" to "num-images"

Under "Optional download abilities":

  • you give the command "python download_archive.py --num_images 1000"
  • but the usage says this:
  • "usage: download_archive.py [-h] [--num-images NUM_IMAGES]"

Fix the readme.md

  • please change "python download_archive.py --num_images 1000"
  • to "python download_archive.py --num-images 1000"

Choosing Datasets

Hey, do you think it's possible for you to implement a feature where we can choose what dataset we want to download from? There are many datasets within the ISIC Archive, but I just want the HAM10000. Do you think that's possible?

Thanks for your help!

Syntax error in script

Hello, I download all requirements and when I try to start script ( python download_archive.py / python3.6 download_archive.py) I get :
File "download_archive.py", line 93
def download_descriptions(ids: list, descs_dir: str, num_processes: int) -> list:
^
SyntaxError: invalid syntax

Could you please help to find the cause?

Syntax error

This is what I get when I try to run the script:

File "download_archive.py", line 93
def download_descriptions(ids: list, descs_dir: str, num_processes: int) -> list:
^
SyntaxError: invalid syntax

macOS Catalina
Python 3.7.3

Download the best segmentation available for each image instead of random

Some images have multiple segmentation masks available.
As far as i've researched, their differ in their skill level.

Currently the system just chooses one of the masks, without consideration of the skill level.
It would be preferred if there were an option to choose the highest skill level available

Please fix malignant downloading issue

It gets stuck at 196 for me. When I set offset 200 and num-image 3000 it gets stuck at 100. I read your comment on trying to fix it tried to implement didn't work. Please fix this asap.

Problem with new parallel version

I am running the new code in Python 2.7.12 in Linux and I get the following error. In addition no image/description was downloaded:

$ python download_dataset.py
Collecting all images ids
Thread 0 started
Thread 1 started
downloading image (0/29)
downloading image (30/39)
url_image = https://isic-archive.com/api/v1/image/5436e3abbae478396759f0cf/download?contentDisposition=inline
...
downloading image (27/29)
url_image = https://isic-archive.com/api/v1/image/5436e3aebae478396759f105/download?contentDisposition=inline
downloading image (28/29)
url_image = https://isic-archive.com/api/v1/image/5436e3aebae478396759f107/download?contentDisposition=inline
Traceback (most recent call last):
File "download_dataset.py", line 95, in
download_dataset()
File "download_dataset.py", line 71, in download_dataset
print('Thread {0} finished'.format(thread._Thread__kwargs['thread_id']))
AttributeError: 'Thread' object has no attribute '_Thread__kwargs'

Download Freeze !

hi I have tried multiple times to download few images but after meta data download, images are not getting download. It freezes.
issue

No option to set offset from where images should be download

Let's say I already downloaded 1000 images and now I want to download 3000.
But I have to download 1000 images again which I already downloaded. Instead, there should be one more parameter in python download_dataset 3000 that tells offset from which number it should download images.

Filter by diagnosis

Rather than just "benign" and "malignant" i'd like to be able to filter by diagnosis.
I.e only file melanoma, Nevus, seborrheic keratosis.

Syntax Error

File "download_archive.py", line 84
def download_descriptions(ids: list, descs_dir: str, num_processes: int) -> list:
^
SyntaxError: invalid syntax

question regarding to the segmentation map

First of all, thanks for sharing this great repo. It makes downloading the dataset way much easier.

When I tried to download the segmentation map using the code, it will always get stuck at 51%. The number of downloaded segmentation map is always 13779. I'm wondering if that's because the annotation is not complete or I'm missing something.

SyntaxError

Hello, I'm getting this error while trying to run

python download_archive.py --num_images 1000

Am a bit new to dataset download and loading, so... sorry if this is to basic... :/

File "download_archive.py", line 86
    def download_descriptions(ids: list, descs_dir: str, num_processes: int) -> list:
                                 ^
SyntaxError: invalid syntax

pip and pip3 updated, already checked Request, Pillow and tqdm.

Any idea? or I'm missing something...

Nice work tho! :) Pretty much what I was looking for!

where images are stored

Premises: I am a newbie and this is my first issue I am opening in GitHub.

I am trying to use your downloader because I need ISIC archive to build up and test GANs analysis by using Keras and TensorFlow packages. I am using Colab so as to exploit the GPU provided by Google and I am also having all scripts and data on GoogleDrive shared folder Assignment_1. Here is my code:

!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

!mkdir -p drive
!google-drive-ocamlfuse drive

!pip install -r drive/Assignment_1/requirements.txt

!python3 drive/Assignment_1/download_dataset.py 13000

Question: Where are all images stored? Once I execute the code I get a psycache folder with some .pyc files in it but I really don't know what do they stand for. The point is I tried to go through download_dataset_subset.py and download_dataset.py but I am not understanding it. When I type:

!python3 drive/Assignment_1/download_dataset.py 13000

I just can see:

Collecting the images ids
Downloading images and descriptions
43% (5635 of 13000) |########

as first message and then updating messages until I reach 100% so I am pretty sure the download is complete.

Syntax error

when i run "download_archive.py" it gives me this error:

File "download_archive.py", line 1, in
from download_single_item import LesionImageDownloader as ImgDownloader, SegmentationDownloader as SegDownloader
File "download_single_item.py", line 97
img : Image.Image = Image.open(image_path)
^
SyntaxError: invalid syntax

Skip sample if it takes too long to download

Sometimes the script hangs while trying to download a specific description or image. When the user requests a specific number k of samples, it would be nice if the script skipped or retried samples that are taking too long to download.

Right now I've been waiting for a while to download 250 malignant samples because it has been stuck trying to download the 197th for a few hours.

For the record, after I gave up and killed the process the exceptions revealed

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='isic-archive.com', port=443): Max retries exceeded with url: /api/v1/image/54e7ddbbbae4780ec59cde5f (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x10f3b2518>: Failed to establish a new connection: [Errno 60] Operation timed out',))

ImportError: DLL load failed

Thank you for creating the script.

I tried running the script with all initial requirements installed. But getting below error:-

C:\Users\Supriya Singh\Documents\3rdSEM\ISIC-Archive-Downloader-master>python download_archive.py --filter benign
Traceback (most recent call last):
File "download_archive.py", line 1, in
from download_single_item import LesionImageDownloader as ImgDownloader, SegmentationDownloader as SegDownloader
File "C:\Users\Supriya Singh\Documents\3rdSEM\ISIC-Archive-Downloader-master\download_single_item.py", line 8, in
from PIL import Image
File "C:\Users\Supriya Singh\Anaconda3\lib\site-packages\PIL\Image.py", line 64, in
from . import _imaging as core
ImportError: DLL load failed: The specified module could not be found.

C:\Users\Supriya Singh\Documents\3rdSEM\ISIC-Archive-Downloader-master>

Could you please help to find the cause?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.