medmnist / medmnist Goto Github PK

View Code? Open in Web Editor NEW

1.0K 15.0 160.0 13.94 MB

[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification

Home Page: https://medmnist.com/

License: Apache License 2.0

Python 100.00%

dataset benchmark automl mnist medical medical-image-analysis medmnist multi-modal decathlon medical-imaging

medmnist's Issues

how to load few samples per class

Hi, say if I want to load 100 images per class or maybe 10-20% of the the class with minimum count how can I do that?

Project dependencies may have API risk issues

Hi, In MedMNIST, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

numpy
pandas
scikit-learn
scikit-image
tqdm
Pillow
fire
torch
torchvision

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency pandas can be changed to >=0.4.0,<=1.2.5.
The version constraint of dependency scikit-learn can be changed to >=0.14,<=0.21.3.
The version constraint of dependency tqdm can be changed to >=4.36.0,<=4.64.0.
The version constraint of dependency Pillow can be changed to ==9.2.0.
The version constraint of dependency Pillow can be changed to >=2.0.0,<=9.1.1.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the pandas

pandas.read_csv

The calling methods from the scikit-learn

sklearn.metrics.accuracy_score
sklearn.metrics.roc_auc_score

The calling methods from the tqdm

tqdm.trange

The calling methods from the Pillow

PIL.Image.fromarray

The calling methods from the all methods

RuntimeError
numpy.random.rand.sum
fire.Fire
next
format
numpy.stack
ys.append
save_fn
setuptools.setup
numpy.random.rand
list
filename.split
available
medmnist.Evaluator.get_dummy_prediction
f.read
os.path.join
zip
time.time
download
os.path.exists
self.download
medmnist.utils.montage3d
df.append.sort_index
medmnist.utils.montage2d
frames.append
filename.split.split
save
split_.startswith
join
cls.evaluate
self.labels.max
key.INFO.medmnist.getattr
shuffle_iterator
self.get_standard_evaluation_filename
map
warnings.DeprecationWarning
medmnist.info.INFO.keys
pandas.DataFrame
index.self.labels.astype
get_default_root
y_score.pd.DataFrame.to_csv
key.INFO.medmnist.getattr.montage
numpy.argmax
key.INFO.medmnist.getattr.save
flag.INFO.medmnist.getattr
key.endswith
y_true.squeeze.squeeze
os.path.split
glob.glob
Metrics
pandas.read_csv
medmnist.utils.montage2d.save
self.__len__
pprint.pprint
open.close
df.append.append
medmnist.utils.save2d
medmnist.Evaluator.parse_and_evaluate
self.transform.convert
medmnist.Evaluator
os.path.expanduser
getAUC
xs.append
readme
range
setuptools.find_packages
dataset._collate_fn
open
info
self.__len__.append
path.endswith
sklearn.metrics.accuracy_score
y_score.squeeze.squeeze
sklearn.metrics.roc_auc_score
medmnist.utils.save_frames_as_gif
data.append
open.write
montage2d
os.makedirs
cls
getACC
numpy.load
random.shuffle
tqdm.trange
torchvision.datasets.utils.download_url
load_fn.save
os.remove
print
getattr
load_fn
medmnist.utils.save3d
skimage.util.montage
montage_frames.append
self.transform
self.target_transform
medmnist.Evaluator.evaluate
df.append.to_csv
len
numpy.random.choice
frames.save
PIL.Image.fromarray
numpy.array
collections.namedtuple
i.y_true.astype
warnings.warn
idx.append

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

Query related to AUC and ACC score

Dear Sir,
I noticed one thing that in your experimental results the AUC is greater than Accuracy score. Is it normal to have AUC score greater than Accuracy? Could you please explain this. Thanks

Links to download datasets are down

The links to download medmnist datasets seem to be down.
Download fails with the following error: meth_name = 'http_error_503', http_err = 1

How to visualise data without montage?

https://colab.research.google.com/drive/1Infsau44_tq-cdh3acQ1hCfW01Y-QkQy?usp=sharing

Given the .montage() makes PIL Images, why does using matplolib to visualise the images make it extremely blurry?

Can someone please let me know how to visualise the individual images?

Normalization config

May I know if you have a normalization parameter setup for each sub-dataset? Thank you!

Visualization of MedMNIST Images

Dear Authors,
Thank you again for making the dataset public. I have a question regarding the dermamnist dataset, and I am having some issues while visualizing it. I am using dermamnist with pytorch for a classification task, and my data loader is the following -

class MedMNISTDatasetProxy(Dataset):
    def __init__(self, tensors, transform=None):
        assert tensors[0].shape[0] == tensors[1].shape[0]
        self.tensors = tensors
        self.transform = transform

    def __getitem__(self, index):
        x = self.tensors[0][index]

        if self.transform:
            x = self.transform(x)

        y = torch.tensor(self.tensors[1][index])
        
        return x, y

    def __len__(self):
        return self.tensors[0].shape[0]

The transform list which I am passing is the following -

data_transform_proxy = transforms.Compose([transforms.ToTensor()])

I am making a data loader from this dataset (because I need that in my application), and I save the data loader and the load it again for the purpose of visualization. I am trying to visualize the images as follows, by using transforms.ToPILImage() in pytorch.
However when I visualize the images, I get a green shaded color for the dermamnist images, I'm not sure why this is happening. Following are a few of the image visualizations attached -

The same issue happens with pathmnist also. The histopath images are usually pinkish color but the visualization, using the same procedure as above results in the following visualization -

If needed, my code for making the image grid is as follows -

def image_grid(imgs, rows, cols, original = False):
    print(rows, cols, len(imgs))
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))

    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        if original:
            img = img.convert("RGB")
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

Thanks and please let me know if I am missing something.
Best Regards,
Megh

Visualize 28x28x28 data

Dear repo,

This is not a bug report. I try to visualize the 28x28x28 MNIST data, as the montage is not very clear.
Any example is available? Thanks,

AssertionError

Hello, when i run "python -m medmnist save --flag=organmnist3d --folder=tmp/" ,
terminal show
Saving organmnist3d train...
Using downloaded and verified file: /home/islab/.medmnist/organmnist3d.npz
Traceback (most recent call last):
File "/home/islab/anaconda3/envs/covid/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/islab/anaconda3/envs/covid/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/islab/MedMNIST-main/medmnist/main.py", line 123, in
fire.Fire()
File "/home/islab/anaconda3/envs/covid/lib/python3.6/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/islab/anaconda3/envs/covid/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire
target=component.name)
File "/home/islab/anaconda3/envs/covid/lib/python3.6/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/islab/MedMNIST-main/medmnist/main.py", line 45, in save
dataset.save(folder, postfix)
File "/home/islab/MedMNIST-main/medmnist/dataset.py", line 169, in save
assert postfix == "gif"
AssertionError

i dont know how to solve it , hope to help

Thank you in advance!

Labelling vs ground truth

Hi, Just a quick question.

Where can I get confirmation of ground truth labels for your datasets?

Specifically BreastMNIST and AdrenalMNIST3D.

数据百度能不能存一份？

如题：数据百度能不能存一份？

Easy way to combine datasets?

Is there any code snippet to combine multiple datasets?

Examples Organ(A/C/S)MNIST.

Ideally a 33-class problem instead 3 separate 11-class ones?

running getting_started_without_PyTorch notebook report error

I meet such an error for running the getting_started_without_PyTorch notebook-- searched around this is Python version issue (I am using 3.11). check here: wireservice/agate#737.
I downgrade the Python to 3.9.0 this error was gone.
Please consider upgrading the code.

File ~/prj/medmnist/MedMNIST/examples/dataset_without_pytorch.py:4
      2 import random
      3 import numpy as np
----> 4 from collections import Sequence
      5 from PIL import Image
      6 from medmnist.info import INFO, HOMEPAGE, DEFAULT_ROOT

ImportError: cannot import name 'Sequence' from 'collections' (/home/xlz/miniconda3/envs/medical/lib/python3.11/collections/__init__.py)

Larger image options- 6464 or 128128?

I believe that intention of this work is to provide medical datasets for quick prototyping of ML algorithms. But since medical imaging classification generally relies on micro features and textures, 28*28 might be too small to learn anything meaningful.

I am curious if there is any way to access larger version of these datasets directly from your repo, say of size 6464 or 128128.

TypeError: only length-1 arrays can be converted to Python scalars while plot showing

If you get this error

change
img, target = self.img[index], int(self.label[index])
to
img, target = self.img[index], self.label[index].astype(int)

it work for me

License problem and use of this dataset?

Hey,

I see that your README seems to explicitly state that the dataset is licensed under Creative Commons Attribution 4.0 International ([CC BY 4.0]), which allows for commercial use.

However, if I've understood your paper correctly, at least the DermaMNIST part of the dataset is derived from the HAM10000 dataset, which as I understand is explicitly licensed CC-BY-NC (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T&version=4.0&selectTab=termsTab)

If the DermaMNIST part of this dataset is indeed derived from HAM10000, and if this dataset is hosted under CC BY 4.0, then does this not constitute a license problem?

Looking forward to hearing back

Cheers,
Jumperkables

Citation to PneumoniaMNIST original source

This seems to be an issue with the paper itself and with the website too. I wanted to access the original source of the PneumoniaMNIST dataset, but the references are copied from the OCTMNIST paper instead. Can you provide a link to original source and paper for the PneumoniaMNIST dataset?

Thanks in advance!

Can you provide the code for other models?

I am following your article, but you just provide the models of the baseline method in your GitHub.
So can you provide the code for other models?
Such as auto-sklearn , AutoKeras and Google AutoML Vision.

Custom Dataset Usage

Thank you for the repository and the code you provided. Is it possible to use my own dataset ?

The temporal dimension of the 3D dataset

The 3D dataset have dimensions (N, 28, 28, 28) where N corresponds to the number of samples. I would just like to make myself clear on the point that axis=1 stands for the temporal dimension here (number of frames of images).

I have also noticed in the following function, the frames are taken from axis=1

MedMNIST/medmnist/utils.py

Line 39 in 9713611

def montage3d(imgs, n_channels, sel):

Any help would be greately appreciated.
TIA!

How to understand the label array

Hi thank you for your work and repo!

I would like to know the semantic meaning of the label array. For example, the chestmnist has the test label array (22433, 14), and the first label is array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8). What is the label it is associated with and how to convert the array to the corresponding label(s)?

"label": {
    "0": "atelectasis",
    "1": "cardiomegaly",
    "2": "effusion",
    "3": "infiltration",
    "4": "mass",
    "5": "nodule",
    "6": "pneumonia",
    "7": "pneumothorax",
    "8": "consolidation",
    "9": "edema",
    "10": "emphysema",
    "11": "fibrosis",
    "12": "pleural",
    "13": "hernia"
},

I appreciate your help in advance. Thanks!

Generation of OrganMNIST {Axial,Coronal,Sagittal}

Hi, in the paper of MNIST_v1, you say that

" We use bounding-box annotations of 11 body organs from another study [17] to obtain the organ labels. Hounsfield-Unit (HU) of the 3D images are transformed into grey scale with a abdominal window; we then crop 2D images from the center slices of the 3D bounding boxes in axial / coronal / sagittal views (planes)."

the I found that the size of OrganAMNIST is significantly larger than OrganMNIST3D. Does it mean that you crop multiple slices from a single 3D bbox for OrganAMNIST? I would appreciate it if you could provide further details regarding the generation of OrganaMNIST.

auc calculation issue

Dear authors:

great work!

auc calculation is significantly different from sklearn document
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score

for binary classification, why set a threshold 0.5?

About "getting_started.ipynb"

Hi there,

Why did we use batch_size * 2 when we initialize the dataloaders in the following part:

train_loader = data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True)
train_loader_at_eval = data.DataLoader(dataset=train_dataset, batch_size=2*BATCH_SIZE, shuffle=False)
test_loader = data.DataLoader(dataset=test_dataset, batch_size=2*BATCH_SIZE, shuffle=False)

Also, why did we use the train_dataset for evaluation during training as well? Wouldn't it be a better practice to use val split of the dataset? Is there a specific reason for your choices?

the model sizes of the searched models and the search time by AutoKeras and Google AutoML Vision

Sorry to be a bother.

I am now following your paper.
Some experimental results, i.e. the model sizes of the searched models and the search time by AutoKeras and Google AutoML Vision, may be useful to my paper.

Could you send me the records if it's possible?

Thank you very much!

my email: [email protected]

all images download as .npz

I can't find .csv or pngs, even if I use the command:
python -m medmnist save --flag=xxxmnist --folder=tmp/ --postfix=png

Question about chestmnist dataset

When I use the chestmnist dataset, I found:

class c = 0: 70472 real images
class c = 1: 7996 real images
class c = 2: 0 real images
class c = 3: 0 real images
class c = 4: 0 real images
class c = 5: 0 real images
class c = 6: 0 real images
class c = 7: 0 real images
class c = 8: 0 real images
class c = 9: 0 real images
class c = 10: 0 real images
class c = 11: 0 real images
class c = 12: 0 real images
class c = 13: 0 real images

However, it seems that the chestmnist dataset has multi-label:

Dataset ChestMNIST of size 28 (chestmnist)
    Number of datapoints: 78468
    Root location: /home/user3/.medmnist
    Split: train
    Task: multi-label, binary-class
    Number of channels: 1
    Meaning of labels: {'0': 'atelectasis', '1': 'cardiomegaly', '2': 'effusion', '3': 'infiltration', '4': 'mass', '5': 'nodule', '6': 'pneumonia', '7': 'pneumothorax', '8': 'consolidation', '9': 'edema', '10': 'emphysema', '11': 'fibrosis', '12': 'pleural', '13': 'hernia'}
    Number of samples: {'train': 78468, 'val': 11219, 'test': 22433}
    Description: The ChestMNIST is based on the NIH-ChestXray14 dataset, a dataset comprising 112,120 frontal-view X-Ray images of 30,805 unique patients with the text-mined 14 disease labels, which could be formulized as a multi-label binary-class classification task. We use the official data split, and resize the source images of 1×1024×1024 into 1×28×28.
    License: CC BY 4.0
===================
Dataset ChestMNIST of size 28 (chestmnist)
    Number of datapoints: 22433
    Root location: /home/user3/.medmnist
    Split: test
    Task: multi-label, binary-class
    Number of channels: 1
    Meaning of labels: {'0': 'atelectasis', '1': 'cardiomegaly', '2': 'effusion', '3': 'infiltration', '4': 'mass', '5': 'nodule', '6': 'pneumonia', '7': 'pneumothorax', '8': 'consolidation', '9': 'edema', '10': 'emphysema', '11': 'fibrosis', '12': 'pleural', '13': 'hernia'}
    Number of samples: {'train': 78468, 'val': 11219, 'test': 22433}
    Description: The ChestMNIST is based on the NIH-ChestXray14 dataset, a dataset comprising 112,120 frontal-view X-Ray images of 30,805 unique patients with the text-mined 14 disease labels, which could be formulized as a multi-label binary-class classification task. We use the official data split, and resize the source images of 1×1024×1024 into 1×28×28.
    License: CC BY 4.0

How can I use the multi-label instead of just binary-class?

[feature request] the 3d dataset convert from npz to dicom

Hello,
Regarding converting the dataset from npz to another format: for 3d dataset, the current implementation only provides the gif format:

assert postfix == "gif"

I would need to dicom format (dcm series data) -- do you have a plan to include that feature, or otherwise do you have suggestions about the ref. code that I can DIY?

How to contact the train and val dataset?

If I need to contact the two data sets (training set and validation set) as whole training, how to do it? When I use ConcatDataset provided by Pytorch, the concatenated data can't return "imgs" and "labels". For example:

train_data = data.ConcatDataset([train_dataset,val_dataset])

the train_data can't directly get the "imgs" and "labels", such as "train_data .imgs" and "train_data .labels".

Mean and Standard Deviation for the datasets while normalizing

Dear Authors,
Thank you for the dataset.
I am looking at the getting_started.ipynb, for pathmnist it is said that the normalization transform is the following - data_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean=[.5], std=[.5])])
The values 0.5, 0.5 are being used. I have the following questions.

Does this value work for all the datasets in medmnist?
Is 0.5, 0.5 the correct mean and standard deviations, or are they just approximate numbers?
Is there a place where I can find datasets and their corresponding mean and standard deviation values so I can use them in my method?

Thanks for your time and help,
Megh

Paired multi-modal data?

Hi there,

Thanks for the wonderful dataset!

I was wondering if there are any paired images in this dataset. What I mean by paired images (x_i, y_i) is that they belong to 2 different modalities (in this case Modality X and Modality Y) and they come from the same patient and hence mapped to the same class labels.

I see in the paper that OrganMNIST Axial, Coronal, and Sagittal come from the same source and have the same set of labels. I was wondering if these 3 modalities have paired images in them and if it includes the pairing data (which axial image is paired with which coronal and sagittal images).

Thank you.

Not able to download dataset

Dear Authors,
Thank you for making the dataset public.
When I go to this link https://zenodo.org/record/5208230#.YluEcy-B0UE , and go to one of the datasets and click on download, nothing happens and the webpage simply hangs.
I also tried using the command line to download - 'python -m medmnist download' - and the download fails.
Thanks and please let me know at the earliest.
Megh

install via conda

Hi, are you planning to make the package available for installation via conda?
That would be great, thanks!

Benchmark about Medmnist+

Can u release the benchmark of medmnist+ like the size 224*224?
best wish

Request for preprocessing code

I would like to know if you could please share the code that you preprocess the datasets? MedMNIST is a good work, but some extra information contained in the original datasets is ignored, for example, for BreastMNIST, I wish to know the labels of normal and benign images, although they have been simplified into positive class. In addition, other information, like the gender/age information is important, but cannot be directly used from MedMNIST. Thanks a lot.

Where can I find sample IDs?

Hi there, I was wondering how can I extract the IDs of a dataset's scans. For example, if I'd like to go back to the original scan (in the original dataset). I skimmed through the medmnist.dataset class (e.g. for ChestMNIST or NoduleMNIST3D) but it doesn't look like there's any relevant mention. Is the sample ID traceable? Thanks!

Possible error in getting_started.ipynb?

Hello,

I was looking at the source code and attached notebooks in the folder examples. In the evaluation cell of the getting_started.ipynb notebook, we can find:

print('%s  acc: %.3f  auc:%.3f' % (split, *metrics))

This is shown as to have printed train acc: 0.983 auc:0.834 when running the statement test('train'). However, looking at the evaluator.py file in MedMNIST, it seems that the evaluator object outputs the AUC first and then the accuracy. Consequently, the print statements in your notebook(s) may be switching the two metrics.

Let me know if this is right.

Best regards,

qlero

getting_start.ipynb notebook scoring issue

Your getting_start.ipynb is a great addition to the repo, but should it use train/val/test sets like your command line version does, where it picks the epoch with the highest val score as the best model, and then shows the test score for that model?

At the command line you do it like this:

==> Building and training model...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.39s/it]
epoch 0 is the best model
==> Testing model...
train AUC: 0.57632 ACC: 0.26923
val AUC: 0.49373 ACC: 0.26923
test AUC: 0.57728 ACC: 0.26923

想问一下数据是怎么样导入Google Automl训练的，npz格式不能导入

Github 首页能否给个Citation？

感谢作者制作这个数据集，我们计划在工作中使用这个数据集，能否在Github首页给个Citation以便引用？thx

[BUG] DataClass montage method not working with scikit-image==0.20.0

Bug description

When calling the method montage from DataClass the following error appears:

TypeError: montage() got an unexpected keyword argument 'multichannel'

Last week skimage was updated to version 0.20.0 and the method montage fromDataClass is no longer working. In the tutorial notebook, this method is used to plot images (cell number 8), and skimage already displays this warning:

/usr/local/lib/python3.9/dist-packages/medmnist/utils.py:25: FutureWarning: multichannel is a deprecated argument name for montage. It will be removed in version 1.0. Please use channel_axis instead. montage_arr = skimage_montage(sel_img, multichannel=(n_channels == 3))

So, now with the new skimage version the argument multichannelis deprecated.

How to reproduce this error?

Update skimage to the latest version (pip install scikit-image==0.20.0)
Run the following snippet

import medmnist
from medmnist import INFO
import torchvision.transforms as transforms
import skimage
print(f"Skimage v{skimage.__version__}")
print(f"MedMNIST v{medmnist.__version__} @ {medmnist.HOMEPAGE}")

data_flag = 'pathmnist'
info = INFO[data_flag]
download = True

DataClass = getattr(medmnist, info['python_class'])
data_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[.5], std=[.5])
])
# load the data
train_dataset = DataClass(split='train', transform=data_transform, download=download)
train_dataset.montage(length=1)

Additional context
A temporary workaround to bypass this error is to modify the requirements.txt file enforcing scikit-image==0.19.0

How to use the latest 64, 128 and 224 version of dataset with data_flag without downloading externally?

Hi,

I appreciate it that you release the new version of this dataset and I want to make use of the larger ones like pathmnist_64.
My previous use of your data is as follows:
data_flag = 'pathmnist'
info = INFO[data_flag]
channel = info['n_channels']
im_size = (32, 32)
num_classes = len(info['label'])

So for these new dataset like pathmnist_64, is there any new data_flag for me to use?

separation of concern and publication on PyPI

I just found this project by chance. I think it is a wonderful idea to have this many different modalities of data formatted like the MNIST dataset. This may give rise to a lot of opportunities during teaching or during sandboxing of methods.

I suggest to split off the dataset.py part completely and put this on PyPI. This way, any user doesn't have to rely on the dependencies which are exposed at this point. In addition, people can easily adopt the datasets by including a relevant statement in their requirements.txt or environment.yml.

What do you think?

Encountering `BadZipFile` Bug When Loading `pathmnist.npz` Locally

My Problem

Traceback (most recent call last):
  File "/home/21009290012/Projects/DRLProjects/CNNLIME/train.py", line 303, in <module>
    main(data_flag, output_root, num_epochs, gpu_ids, batch_size, download, model_flag, resize, as_rgb, model_path, run)
  File "/home/21009290012/Projects/DRLProjects/CNNLIME/train.py", line 63, in main
    train_dataset = DataClass(split='train', transform=data_transform, download=False, as_rgb=as_rgb, root='dataset/')
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/medmnist/dataset.py", line 43, in __init__
    npz_file = np.load(os.path.join(self.root, "{}.npz".format(self.flag)))
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/numpy/lib/npyio.py", line 444, in load
    ret = NpzFile(fid, own_fid=own_fid, allow_pickle=allow_pickle,
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/numpy/lib/npyio.py", line 190, in __init__
    _zip = zipfile_factory(fid)
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

My Configuration

My Environment

Platform: Linux
Torch 1.13.0+cu116
Python: 3.10.12

My Code

train_dataset = DataClass(split='train', transform=data_transform, download=False, as_rgb=as_rgb, root='dataset/')

My Project Structure

CNNLIME
|---checkpoints
|---dataset
|        |---pathmnist.npz
|---model.py
|---train.py

The path of the dataset

https://github.com/MedMNIST/MedMNIST/blob/main/examples/getting_started.ipynb
In the link above, we can see that the default download path is /home/<username>/.medmnist/pathmnist.npz.

I would like to ask how can I change the path of the downloaded data?
How can I configure the parameters below? Thx :)

train_dataset = DataClass(split='train', transform=data_transform, download=download)
test_dataset = DataClass(split='test', transform=data_transform, download=download)

download by Command Line Tools | Something went wrong when downloading

hello, thank u so much for this amazing job!
I was trying to download one of the datasets by command line tool, I tried this command:
python -m medmnist save --flag=organsmnist --folder=tmp/ --postfix=png --download=True --size=224
but I've got this output which seems to be running to some wrong URL.

Traceback (most recent call last):
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1286, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1332, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1281, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1041, in _send_output
    self.send(msg)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 979, in send
    self.connect()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1451, in connect
    super().connect()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 945, in connect
    self.sock = self._create_connection(
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/socket.py", line 851, in create_connection
    raise exceptions[0]
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/socket.py", line 836, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/dataset.py", line 106, in download
    download_url(
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 134, in download_url
    url = _get_redirect_url(url, max_hops=max_redirect_hops)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 82, in _get_redirect_url
    with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/__main__.py", line 184, in <module>
    fire.Fire()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/__main__.py", line 71, in save
    dataset = getattr(medalist, INFO[flag]["python_class"])(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/dataset.py", line 56, in __init__
    self.download()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/dataset.py", line 113, in download
    raise RuntimeError(
RuntimeError: Something went wrong when downloading! Go to the homepage to download manually. https://github.com/MedMNIST/MedMNIST/

Am I trying a wrong command? Or something went wrong? I would really appreciate it if u could can help me with this.

Is it too small to process medical image data into a size of 282828

Hi, thanks for sharing your work.
I have a question, is it too small to process 3D medical image data into a size of 282828, especially when classifying based on some detailed features in medical images?

How do you process medical images with an original size of such as 25625664 into a size of 282828? Have you considered the loss of details caused by downsizing, I noticed that your work exhibits high performance metrics such as AUC.

Evaluation about AutoML Methods

Thanks very much for your nice work!

I have read your source code and paper. I found both your code and AutoKeras chose the best model based on the highest AUC score on the validation set.

However, how Google AutoML Vision2 and auto-sklearn evaluate is introduced. If they use the best test auc/acc during the searching, is it an unfair comparison?

原图像10241024的，resize到2828，医疗影像会不会丢失太多信息呢？

hi，前辈，请问医学影像中，原图像10241024的，resize到2828，用于病灶多类别分类任务，还有什么价值吗？细节信息不会丢失很多吗

What is the command line code to download one specific dataset?

python -m medmnist download pneumoniamnist --size=28 ends up throwing a root not found error but I remove that and do just python -m medmnist download --size=28 it seems to work?