detection_nappe_hydrocarbures_imt_cefrem's People
detection_nappe_hydrocarbures_imt_cefrem's Issues
Resolution of the images
cf folder data_out/2021-06-08_15h41min22s_ and folder data_out/2021-06-08_15h47min38s_
Always the same global distribution:
Oil discharge statistics
Statistics on:
- Max length (px)
- Min length (px)
- Area (px)
Get the resolution of rasterio transform
When opened with
with rasterio.open(path) as file_object:
xres = file_object.transform[0]
yres = file_object.transform[-4]
what is the unit of xres
and yres
?
--> .... per px
RGB_overlay too slow
Use numba
Use cython ?? -> readability ?
Rasterio and gdal installation
To be able to read images.
For windows users, follow the following steps :
- download
GDAL-3.2.3-cp37-cp37m-win_amd64.whl
rasterio on this website installation file andrasterio-1.2.3-cp37-cp37m-win_amd64.whl
gdal installation file on this website - with a python 3.7 conda environment
pip install GDAL-3.2.3-cp37-cp37m-win_amd64.whl
rasterio-1.2.3-cp37-cp37m-win_amd64.whl
class values interverted from SegmentationDataset with cache
User code usage
- Generate patch with preprocessing
- test it on the model
Question:
- get image cache
1 image, annotation, taille -> ensemble des patchs, segmentations, classification
nouvelle base de donnée faire des patchs et refaire unet
Découpe,apprentissage,predict
Passer en paramètre
supply custom dataset as object with methods
Adding earth support
Create a new hdf5 with augmented patches
Main ideas:
Problems:
Currently f4290b7
:
- generating patches is too slow (especially loading large images into ram memory.
- GPU is largelly underused
- CPU is overloaded by large arrays preprocessing
- Training on 100 epochs takes around 1 day 10h (with the risk of memory error if the ram is also used by other programs)
Solution:
Put a balanced augmented dataset patches of 1000 px width (square).
Problem:
We would like to have rotated source images on which we compute patches.
Too much memory required
Solution:
Inverse transformation
Balance the dataset
We want to provide a globally equal number of images to the model with seep and spill than with nothing on the patch.
Problem: to provide data to the model we iterate over a predefined ensemble of images.
One easy but impracticable solution:
Add a file that stores the classes of each patch.
Problem: fix the patches:
- cannot make global rotation augmentations of the original image
- fix the size of the patches
Composite stats of seep_spill_patches and other_patches
Test read speed of the raster images versus hdf5 file
What is the time required to open one raster image versus to open it if it is in a hdf5 file ?
Code architecture
Choose an order to process patches of the same image
Problem: 1 image = x patches. How to reduce memory and computations consumption ?
Note: patches are generated on the fly at commit cadd357
Resize the image
Error of calibration of the annotations ?
Speedup the code and prefetch
To communicate between processes, pickle is used by pytorch. It transmits data to other python instances by pickling the data and operations. Concretely it converts data into string and pass them to the process.
💣 cannot be changed ; 🔍 could maybe be optimized with tradeoffs ;
Problems:
- 🔍 Today multiple objects transforms the data depending of the option: it allows code rusability and split functionnalities. But it complexify the work of multiprocessing by adding multiple stateful objects to pickle.
- 💣 The hdf5 file cannot be pickled (tthey are references of datasets of the hdf5 file)
- 💣 It is not a good idea to pickle the data from the hdf5 file (too much memory used)
- 🔍 We have today only one hdf5 file and so it can limit the number of paralllel reading
One hdf5 file allows better compression even if maybe splitting it in half or four part can be possible memory wise. However, it will increase the complexity of the code because we would have to indicate in which file to get the image.
Affine transformations create class artifacts
Extract th captured image: transform matrix problem
Make the inverse transform:
Exemple:
Transform matrix:
M = [[0.00017966305682390432, -0.0, 23.70719634263147],
[-0.0, -0.00017966305682390432, 40.38637265914473],
[0.0, 0.0, 1.0 ]]
Image input shape (with margins): (10600, 18441)
Cut the last line (as there are no 3rd dimension in the input)
Point (0,0) is mapped to (0,0)
Point (10599, 18440) is mapped to ( 1.90424874e+00, -3.31298677e+00, 9.95997286e+05)
Point (10599, 0) is mapped to [1.90424874e+00, 0.00000000e+00, 2.51272574e+05]
Point (0, 18440) is mapped to [ 0.00000000e+00, -3.31298677e+00, 7.44724712e+05]
Transform raster images into numpy arrays
Cut the image into patches
Problem to avoid: too much oil discharges are cut due to the patches.
Pass dataset as parameter
Fucntionnality required:
ClassificationCache:
- hdf5 file access (image and annotation)
-> problem of point annotations - access to json info file
- access to TwoWayDict
ClassificationPatch
- access to TwoWayDict
- hdf5 access without numpy conversion
Show the parameters and the results
Potential solutions
Solution | 👍 | 👎 |
---|---|---|
Tensorboard | Already ready | Curves cannot be used directly in papers (too small axes, curves, values) |
Dash+plotly | Graph easy to do | Interface to make. Potentially no regex filtering |
Vuejs+node | Entirely personnalized | Time consuming |
Transform into real factories
Confusion matrix for valid batch
html code example
<!DOCTYPE html>
<html lang="en">
<meta charset="UTF-8">
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
body {
font-family: Arial, Helvetica, sans-serif;
background-color: transparent;
text-align: center;
}
table {
overflow: hidden;
}
tr:hover {
background-color: rgba(0, 0, 0, 0.5);
}
td,
th {
position: relative;
padding: 2em;
text-align: center;
background-color: transparent;
}
td:hover::after{
content: "";
position: absolute;
background-color: rgba(0, 0, 0, 0.5);
left: 0;
top: -5000px;
height: 10000px;
width: 100%;
z-index: -1;
}
thead tr:last-child th:last-child, tbody th:last-child, tbody tr:last-child th {
background-color: #ff9a03;
}
thead tr:last-child th div,tbody tr:last-child th:nth-child(2) div,tbody tr:last-child th:last-child div {
font-weight: bold;
color:white
}
thead tr:last-child th:last-child {
border-top: 1px solid white;
border-left: 1px solid white;
}
tbody tr:last-child th:nth-child(2) {
border-left: 1px solid white;
}
tbody tr th:last-child {
border-left: 1px solid white;
}
tbody tr:last-child th {
border-top: 1px solid white;
border-left: none;
}
thead th, tr th:nth-child(1),tr th:nth-child(2), tbody tr:last-child th:first-child {
color:white;
background-color: #0366FF;
border: none
}
tbody tr:last-child th:last-child {
border-top: 1px solid white;
border-left: 1px solid white;
}
</style>
</head>
<body>
<table cellspacing="0" cellpadding="0">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="3"> <div>True classes</div> </th>
<th></th>
</tr>
<tr>
<th></th>
<th></th>
<th>Class1</th>
<th>Class2</th>
<th>Class3</th>
<th><div>Totals <br> predictions</div></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" style="padding-left: 0;padding-right: 0;"><div style="transform: rotate(-90deg);">Predicted classes</div></th>
<th>Class1</th>
<td>43<br>43%</td>
<td>28<br>28%</td>
<td>91<br>91%</td>
<th>102<br>10%</td>
</tr>
<tr>
<th>Class1</th>
<td>32<br>32%</td>
<td>44<br>44%</td>
<td>62<br>62%</td>
<th>40<br>10%</td>
</tr>
<tr>
<th>Class1</th>
<td>46<br>46%</td>
<td>24<br>24%</td>
<td>35<br>35%</td>
<th>40<br>10%</td>
</tr>
<tr>
<th></th>
<th><div>Totals <br> true</div></th>
<th>40<br>40%</th>
<th>40<br>40%</th>
<th>40<br>40%</th>
<th><div>Correct</div><div>40<br>10%</div></td>
</tr>
</tbody>
</table>
<script>
$(document).ready(function () {
$("table tr td").each(function () {
let [value, percent] = $(this).html().split("<br>");
let value_color = parseInt(percent)/100;
$(this).css("background-color", `rgba(255, 0, 0, ${value_color})`);
console.log('Value ' + value + " with " + percent);
});
});
</script>
</body>
</html>
What is the unit of the transform matrix
Parameters to keep track for trainings
Data:
- range of images used
- images excluded
⚠️ more raw than preprocessed - resolution distribution
- number of source images used
- classes available and mappings
- distribution of max/min length of oil discharges
- type de valeurs prédites
Preprocessing: - Grid / patchsize
- number of patches
- Preprocessed images or not
- which preprocessing pipeline (parse it to json)
- data augmentations (with parameters)
AI: - loss
- metrics
- optimizer
- model
- confusion matrix
Objects used:
- make an id to log the modules used for the augmentations, the model.... Write in the logs the id and the name. Change the id for each version of the code: use the commit hash
- log the commit hash
Stat on number of pixels classified
Classify the images
To classify the images we will use the efficientnetv4 model. We transfert its knowledge to classify the images.
We will use this repository. It seems that (no precise information on the repository) the model has been pretrained on ImageNet with 1 000 classes
Data copy
Folders to copy:
- .[DATA]\Satellite\Sentinel1
- .\Stage_Chihab\Cartographie_Hydrocarbures\OilSlicks*WGS84
Class imbalance in filtered_cache_...
Problem
- Patch with no annotations
not given to the model ➡️ understandable that the model does not well on these patches
- Spill class misclassified
According to Statistics of number of classes present on patches, there are less patches with spill maybe due to the fact that several spill polygons can be on the same image
But according to Compared with original polygons statistics, there are 196 spill and 533 seep on all rasters
Which format is the best for input images
[REFLEXIONS]
- Raster images or hdf5 file faster ? Is a hdf5 file a problem for other usages ?
- Do we make a preprocessing algorithm to convert all annotations to a segmentation map and put it with the rasters in the hdf5 ?
Suggestion :
- 1 script takes as input the raster images and put them as numpy arrays in the
images.hdf5
- 1 script takes as input the raster images header and put them as
images_infos.json
- 1 script takes as input the annotations, creates the segmentation map from the polygons and save it into
annotations_labels.hdf5
We will put the name of the raster image as the name of each dataset contained in each hdf5 file. It will allow us to access easily to data.
What is the structure of the names of raster images ?
For example, in the name S1A_IW_GRDH_1SDV_20190601T042305_20190601T042330_027481_0319CB_0EB7_NR_Cal_ML_EC_dB.data
027481_0319CB_0EB7 ensure (maybe less also does) an unique name
Would it be better to use the location of the upper left pixel (for instance) as a name for each dataset in the hdf5 file ?
This value can be seen twice
RGB overlay bug supposition
After printing the confusion matrix of a 100 images batches we obtain the following result:
But the network labels with the overlay all images as background (no annotation)
Supposition:
As we want all images from the same image, we use a different dataset than the datasetcache object. Therefore, we could have different preprocessing operations applied between these datasets
Reduce training time by storing clusters of shapes only
Observation
- For 100 epochs with only patch augmentation 1 day 16h necessary
- We waste a lot of time opening patches not containing interesting classes (seep, spill)
Potential solutions:
- Extract seep and spill zones into a hdf5 cache and take that as new images
- Resize input image so that patches are already at the correct size for the model
Rotate the image to cut the borders
Original idea : rotate the image so that the grid cut follow the axis of the real image without the padding already include in the imgae
- problem: how to get the rotation angle
--> maybe with the transform matrix of the raster
Check pytorch ok on both computers
Note : install in an 3.7 environment (for windows it allows to install rasterio)
- pc : ok
- working station : todo
Get the annotations from the polygons
In order to extract the polygons from the shapefiles .shp
, we will use the package pyshp
denoted as shapefile
in python imports. The QGIS python package does not work (on my computer) on Windows.
The Reader
object allows to open the shapefile and returns an object. Then we can access to its metadata thanks to the property record
of this object.
We can extract the points of each polygon shape by looping over the values of the object and getting the shape.point
attribute. It allows us to get the list of the points of the polygon.
The points seems to be using the geographical coordinates rather than pixel coordinates.
Example of coordinate : (25.078123755232415, 38.92436547557481)
Allow to add multiple layers
For example:
- land
- sea
- boats
....
Problem: memory consumption
- we can maybe store the of points in px or gps
filtered_cache_images.hdf5 training not working
Context
Dataset with patches
- not in margins
- always with seep and/or spill annotation
- augmentation factor of 100
Problem
➡️ Model converges
But unsatisfactory result
Prediction
Compared to the reference
➡️ 2 problems:
- Background (= class other = with no annotation) : is always confused with the seep class
- Spill : is not recognized
Diagnosis
Statistics of number of classes present on patches
- spill_only:
50 patches
- seep_only: 13248 patches
- seep_spill: 9828 patches
Compared with original polygons statistics
To get them we have used
- the DB Manager to execute SQL queries to get
a. the number of rasters with each amount of seep possible
b. the number of rasters with each amount of spill possible - a python script to merge the two tables (gotten with copy paste) and count the number of raster with each amount of seep and spill possible
Number of seep | Number of spill | Number of raster | Number of seep | Number of spill | Number of raster | |
---|---|---|---|---|---|---|
0 | 0 | 357 | 2 | 0 | 4 | |
0 | 1 | 10 | 2 | 1 | 1 | |
0 | 2 | 11 | 2 | 5 | 1 | |
0 | 3 | 10 | 2 | 8 | 1 | |
0 | 4 | 5 | 20 | 0 | 2 | |
0 | 5 | 2 | 3 | 0 | 9 | |
0 | 6 | 3 | 3 | 1 | 1 | |
0 | 7 | 1 | 3 | 3 | 1 | |
0 | 9 | 1 | 4 | 0 | 9 | |
1 | 0 | 11 | 4 | 3 | 1 | |
1 | 3 | 2 | 5 | 0 | 6 | |
1 | 6 | 1 | 5 | 1 | 2 | |
1 | 7 | 1 | 5 | 2 | 1 | |
10 | 0 | 2 | 6 | 0 | 3 | |
11 | 0 | 3 | 6 | 2 | 2 | |
11 | 4 | 1 | 7 | 0 | 5 | |
12 | 0 | 1 | 8 | 0 | 3 | |
13 | 0 | 1 | 8 | 1 | 1 | |
14 | 7 | 1 | 8 | 2 | 1 | |
15 | 0 | 2 | 8 | 7 | 1 | |
15 | 1 | 1 | 9 | 0 | 2 |
(with 0,0 point excluded)
Interactive vizualization
Statistics of the shapes of the seeps, spills
We will use th minAreaRect of opencv to build a rectangle rotated so that it has the minimum area that allows to include all of the shape.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.