For the final project for Flatiron School's Data Immersive Course, I wanted to focus again on image classification. From my personal blog presentation on SSIM to identify picture pixel similarity and module 4's animal image classification, I found that analysing unstructure data to be very satisfying. In addition, I studied biology as an undergrad and have interned at a breast cancer research lab for a year before joining this bootcamp so I wanted to combine these two passions of mine into one. Unfortunately, getting pictures of cancer cells that are labeled is extremely difficult as it is propiretory data of each medical institute. So I focused on building a model that can perhaps identify cell organelle based on their structures. The cell organelles have each been tagged with specific fluorescent proteins. Specific protein only binds to specific organelle and once bound, they give off a color (most commonly green, but other colors such as DAPI staining for the nucleus exist).
I worked with two different data set. Nether data set is on github repo, but I will link them below:
The first dataset is labeled cell pictures of Yeast from Chong et all. Yeast was used instead of human cells as they have a 90% similarity in terms of cell organelle structures. They had 11 organelles labeled, but I decided to work on only 6.
The dataset from source stated that pictures were of high resolution. So using what I know about VGG16 model transfer learning, I made a few models and trained the images on them. Since there were many pictures from each labels, this took a while. After ward, I decided to look at basic evaluation metrics and saw that my machine was not performing above baseline. Then I actually took a look at the pictures and saw that the resolution was actually terrible instead. I still went ahead and started to modify my model until I finally made one that can predict at double the base line. That can be found in my Dataset 1 Final Notebook.
Here are the model accuracy, training and validation graph:
Here are the classification metric and confusion matrix for dataset 1:
Without higher image resolution, I was not able to make accurate prediction.
The second dataset is from the Human Protein Atlas. They had a dataset on kaggle. The dataset can be found in HPA-Kaggle. They had higher resolution pictures, but they were unlabaled and linked with CSV. Using os.join.path, that was relatively easy to fix.
The final notebook for dataset 2 can be found in my Dataset 2 Final Notebook. After linking the labels, here are the categories:
Dropping samples with less than 200 labels, I ran this through the machine algorithm I made in dataset 1. After tweaking the algorithm some more, I finally achieved a high accuracy result
And here are the training and validation graph:
Here is some example of it's prediction:
As you can see on the last picture, it was able to identify some aggresome cells, which are the first stages of occurence for cancer. Although I will need a lot more pictures of different variations of aggresome cells to be able to predict accurately.
Although I am confident in my algorithm to train and classify cell organelles, I was not able to achieve the high accuracy I wanted due to lower resolution pictures. My goal is to get high resolution pictures of cancer cells, and run it through this model. This can potentially help detect cancer cells on the same level as a trained professional.