thomashopkins32 / hubmap Goto Github PK
View Code? Open in Web Editor NEWHacking the Human Vasculature (Kaggle Competition)
License: Apache License 2.0
Hacking the Human Vasculature (Kaggle Competition)
License: Apache License 2.0
I already have a test for how this would work in utils.py
. I think this would help by allowing for a larger batch size during training.
We also receive annotations that the experts who annotated the data are unsure about.
We could try the following:
0.5
for these masksEach of these options should be tested and examined in isolation on various sizes of the training set (probably need cross validation).
Kaggle uses NVIDIA Tesla P100 GPUs which have 16 (or 12?) GB of dedicated memory. Testing locally using my 3070 which has 8 GB, we can run a batch size of 4 using full precision and a batch size of 8 using mixed precision. We should test how many samples we can fit in a batch using 16 GB of memory.
My guess would be in the range of 16-20 samples for mixed precision but maybe more?
Right now it takes a couple of minutes to load in all of the images and compute polygons for the various annotated masks. I should look into a vectorized version of the polygon
function that will speed this computation up.
I should also see if ChatGPT has some simple improvements to make that could speed up my code.
Starting with a pre-trained model rarely hurts performance and since there is a low amount of annotated training data, I think this would help quite a bit.
Follow Karpathy's recipe for training: http://karpathy.github.io/2019/04/25/recipe/
We want to settle on what is working and what isn't. I think its clear that release 1 and release 2 have overfit quite a bit to the training data.
The competition has ended and I should dive into what worked for people and what didn't.
Now that we are following the advice in the UNet paper, we should try to train the model once again.
The number of batches that I can fit on my GPU (8 GB memory) is only 4 at the moment.
Analysis of the GPU memory requirements will make it easier to determine the best method to train the model with. A good way to go about this would be to call get_model_gpu_memory
after each layer in the network. I need to know how the memory requirements change throughout a forward pass.
I should also look into other methods (or packages) that can do this work for me.
Train on all of the training images
This should allow the network to learn what structures exist in the image which will make training for image segmentation much easier.
This would involve creating and training other models first. I can list some potential models here and decide if this is worth doing.
Kaggle has their own way of evaluating models. I need to read the documentation for the competition and implement some of it so we can submit to the competition.
Do some experimentation in what some different image transformations would do if we included them during training.
Create a new notebook to visualize some of the transformations and make sure the annotations are still available and correct.
There are ~6,000 images in the training set that have no labels. Examine a few a try to determine the best method for utilizing this data.
Look into the following and report back:
Also look into how we might use this information in other ways.
Look into different metrics for evaluating image segmentation problems. The most common one I can think of (which will also be used by Kaggle for scoring the competition) is IoU (Intersection over Union).
We may need custom implementations for this but we also might be able to use the COCO package to do this work.
We have 3 different types of annotated masks available:
I need to see how that annotation is being passed for scoring on the hidden test set. Does Kaggle discard predictions in the region internally?
What should I do with predictions from my model that fall in the glomerulus during training? I should look into this more but here are some ideas:
We should see how our network performs on the following:
0
Every training run should beat the "All inputs are 0
" baseline.
Kaggle only works with notebooks as far as I can tell. Submission to the competition also requires a notebook.
Also, since Kaggle has double the GPU memory available, this means we can run with a larger batch size.
So the competition ended and unfortunately I could not train a good model in time.
Here are the things I could have done to get a better leaderboard score:
Here are some things I wanted to try but are unsure would help:
Here are some mistakes that I made:
Here are some things I am confused about:
And finally, here are some things I learned so far:
In sum, this was a fun project to work on and I am eager to continue trying things with it. I would like to try all of the items from the first and second sections above. I will add them as Issues to this project and work on them over the next few weeks/months.
Maybe these are better/easier to use out of the box.
Now that I have implemented one myself, I should try these open source ones which should make my workflow much faster.
They have built in tooling for reading the polygons file and it seems to work well.
See this notebook for how this works: https://www.kaggle.com/code/mersico/medical-instance-segmentation-with-yolov8
Step through each step in the debugger and make sure the data looks appropriate every step of the way.
Make sure each step of the training script is reproducible since this is an important factor for submitting a notebook to Kaggle.
Initialize the model weights well. Look into the UNet paper for guidance on this. I think there was something mentioned about initialization in there.
Verify that the loss decreases to 0
(or close to it) when we train on a single image (with multiple annotations). If it's not we need to investigate why.
Decrease and increase model capacity, how does this affect the training outcome? Increased capacity should result in lower loss but potentially more overfitting.
Inspect the gradients of each layer's weights. Make sure that they look fairly regular.
The np.logical_and
step seems to take a very long time. We should debug this to find out why.
To make the transforms easier to work with. We should pre-split the data into training, validation, and testing.
The testing data is a single image and is already split off. The training data needs to be randomly split and this split needs to be saved somewhere.
This is required so that we can use no image transformations during validation and also get accurate class frequencies during training. If we use a single dataset and then do a split we run into the following issues:
We should implement TrainHuBMAP
, ValidHuBMAP
, and TestHuBMAP
datasets instead of the single HuBMAP
dataset.
For the first couple of releases, I trained the model using things from my personal experience and intuition in training deep neural networks. I should look to see what worked for the authors and try to emulate that for this dataset.
This means that I should:
Let's see if that will improve our performances.
UNet apparently isn't meant for instance segmentation but semantic segmentation (I have to double check that this is accurate).
I should try a different model that was built for instance segmentation once I have squeezed out performance on UNet to the best of my ability.
We need to be able to save our model and train it further. Kaggle limits the GPU hours of notebooks to 30 hours per week and 9 hours per run.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.