The goal is design an image processing algorithm that can determine the type of the cancer that is displayed in an image. The code is based on PyTorch.
HAM10000 ("Human Against Machine with 10000 training images") dataset consists of dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).
More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file.
I have used resnext101_32x8d
with pretrained weights for training and classification.
- 1. Setup Instructions and Dependencies
- 2. Dataset
- 3. Training the model
- 4. Observations
- 5. Repository overview
Clone the repositiory on your local machine.
git clone https://github.com/ishanrai05/skin-cancer-prediction
Start a virtual environment using python3
virtualenv env
Install the dependencies
pip install -r requirements.txt
You can also use google collab notebook. In that case just upload the notebook provided in the repository and you are good to go.
The dataset is available to download from the official site here. Download and extract the dataset in input
folder in the same directory.
To train the model, run
python main.py --train=True
optional arguments:
argument | default | desciption |
---|---|---|
-h, --help | show help message and exit | |
--use_cuda | False | device to train on. default is CPU |
--samples | False | See sample images |
--view_data_counts | False | Visualize data distribution |
--num_epochs | 10 | Number of epochs to train on |
--train | True | train the model |
I achieved a training accuracy as high as 88.149% and validation accuracy as high as 87.565%.
The top graph indicates training loss(blue line) and training accuracy(orange line)
This repository contains the following files and folders
-
notebook: This folder contains the jupyter notebook for code.
-
resources: Contains images.
-
dataset_loader.py
: pytorch code to load the dataset. -
models.py
: code for models. -
read_data.py
: code to read images. -
visualize.py
: code for visualizations. -
utils.py
: Contains helper functions. -
train.py
: function to train models from scratch. -
main.py
: contains main code to run the model. -
requirements.txt
: Lists dependencies for easy setup in virtual environments.
@data{DVN/DBW86T_2018,
author = {Tschandl, Philipp},
publisher = {Harvard Dataverse},
title = "{The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions}",
UNF = {UNF:6:IQTf5Cb+3EzwZ95U5r0hnQ==},
year = {2018},
version = {V1},
doi = {10.7910/DVN/DBW86T},
url = {https://doi.org/10.7910/DVN/DBW86T}
}