worldstrat / worldstrat Goto Github PK

View Code? Open in Web Editor NEW

217.0 6.0 24.0 30.72 MB

The WorldStrat Dataset

Home Page: https://worldstrat.github.io

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 96.06% Python 3.89% Shell 0.04%

dataset deep-learning earth-observation earth-observation-imagery remote-sensing satellite-imagery

worldstrat's Introduction

The WorldStrat Software Package

This is the companion code repository for the WorldStrat dataset and its article, used to generate the dataset and train several super-resolution benchmarks on it. The associated article and datasheet for dataset is available on arXiv.

Quick Start

Download and install Mambaforge (Windows/Linux/Mac OS X/Mac OS X ARM/Other)
Open a Miniforge prompt or initialise Mambaforge in your terminal/shell (conda init).
Clone the repository: git clone https://github.com/worldstrat/worldstrat.
Install the environment: mamba env create -n worldstrat --file environment.yml.
Follow the instructions in the Dataset Exploration notebook using the worldstrat environment.
Alternatively (manual):
Download the dataset from Zenodo, or from Kaggle
Create an empty dataset folder in the repository root (worldstrat/dataset) and unpack the dataset there.
Run the Dataset Exploration notebook, or any of the other notebooks, using the worldstrat environment.

What is WorldStrat?

Nearly 10,000 km² of free high-resolution satellite imagery of unique locations which ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities.

Those locations are also enriched with typically under-represented locations in ML datasets: sites of humanitarian interest, illegal mining sites, and settlements of persons at risk.

Each high-resolution image (1.5 m/pixel) comes with multiple temporally-matched low-resolution images from the freely accessible lower-resolution Sentinel-2 satellites (10 m/pixel).

We accompany this dataset with a paper, datasheet for datasets and an open-source Python package to: rebuild or extend the WorldStrat dataset, train and infer baseline algorithms, and learn with abundant tutorials, all compatible with the popular EO-learn toolbox.

Why make this?

We hope to foster broad-spectrum applications of ML to satellite imagery, and possibly develop the same power of analysis allowed by costly private high-resolution imagery from free public low-resolution Sentinel2 imagery. We illustrate this specific point by training and releasing several highly compute-efficient baselines on the task of Multi-Frame Super-Resolution.

Data versions and structure:

The main repository for this dataset is Zenodo. It contains:

12-bit radiometry high-resolution images in their raw format, downloaded directly from Airbus.
12-bit radiometry high-resolution images downloaded through and processed by SentinelHub.
16 temporally-matched low-resolution Sentinel-2 Level-2A revisits for each high-resolution image.
16 temporally-matched low-resolution Sentinel-2 Level-1C revisits for each high-resolution image.
The metadata about the dataset (imaged locations coordinates, several classifications).
The train/val/test split used to train the super-resolution benchmarks.
The scientific paper about the dataset and toolbox published and presented at NeurIPS 2022.

Due to Kaggle's size limitation of ~107 GB, we've uploaded what we call the "core dataset" there, which consists of:

12-bit radiometry high-resolution images, downloaded through SentinelHub's API.
8 temporally-matched low-resolution Sentinel-2 Level-2A revisits for each high-resolution image.

We used this core dataset to train the models we used as benchmarks in our paper, and which we distribute as pre-trained models.

How can I use this?

We recommend starting with the downloading and unpacking the dataset, and using the Dataset Exploration notebook to explore the data.
After that, you can also check out our source code which contains notebooks that demonstrate:

Generating the dataset by randomly sampling the entire planet and stratifying the points using several datasets.
Training a super-resolution model that generates high-resolution imagery using low-resolution Sentinel-2 imagery as input.
Running inference/generating free super-resolved high-resolution imagery using the aforementioned model.

Licences

The high-resolution Airbus imagery is distributed, with authorization from Airbus, under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
The labels, Sentinel2 imagery, and trained weights are released under Creative Commons with Attribution 4.0 International (CC BY 4.0).
This source code repository under 3-Clause BSD license.

How to cite

If you use this package or the associated dataset, please kindly cite these following BibTeX entries:

@misc{cornebise_open_2022,
  title = {Open {{High-Resolution Satellite Imagery}}: {{The WorldStrat Dataset}} -- {{With Application}} to {{Super-Resolution}}},
  author = {Cornebise, Julien and Or{\v s}oli{\'c}, Ivan and Kalaitzis, Freddie},
  year = {2022},
  month = jul,
  number = {arXiv:2207.06418},
  eprint = {2207.06418},
  eprinttype = {arxiv},
  publisher = {{arXiv}},
  doi = {10.48550/arXiv.2207.06418},
  archiveprefix = {arXiv}
}

@article{cornebise_worldstrat_zenodo_2022,
  title = {The {{WorldStrat Dataset}}},
  author = {Cornebise, Julien and Orsolic, Ivan and Kalaitzis, Freddie},
  year = {2022},
  month = jul,
  journal = {Dataset on Zendodo},
  doi = {10.5281/zenodo.6810792}
}

worldstrat's People

Contributors

Stargazers

Watchers

worldstrat's Issues

Kaggle link not found

https://www.kaggle.com/datasets/jucor1/worldstrat returns Not found

Shifted geometries / black bars around some sides of the high resolution imagery

We have identified an issue that occurs when fetching the high-resolution data from the provider.
The issue causes a black bar (1 px around RGBN, 4 px around PAN) around some sides of some of the high resolution imagery.

We are working on a fix with the imagery provider, and an updated version of the dataset will be published as soon as possible.

Duplicated tiles in `stratified_train_val_test_split.csv`

Hello,

I've found that the stratified_train_val_test_split.csv file in zenodo has dupplicated rows. The column tile which I assumed to be the id of each record in the dataset has duplicated values (38 duplicated entries).

I also found that there are 38 folders in the hr_dataset of zenodo that are not included in stratified_train_val_test_split.csv.

The notebook in this gist reproduces the problem.

plotnine issue

Hi,

just wanted to run your exploration notebook but plotnine cannot be imported since it loads matplotlib._contour which does not exist (in version 3.6.0).

ModuleNotFoundError: No module named 'matplotlib._contour'

Request all model checkpoins

Of the 3 models:

HighResNet
SRCNN Multi-Frame
SRCNN Single-Image

It appears only HighResNet checkpoint is uploaded in https://github.com/worldstrat/worldstrat/tree/main/pretrained_model
Could I request the remaining 2?
Thanks

RuntimeError: The size of tensor a (4) must match the size of tensor b (125) at non-singleton dimension 3 - "./worldstrat/src/lightning_modules.py"

I encounter some errors in using your code on the corresponding database uploaded to Zenodo.

A very minor remark: it is not specified in the doc, but the organization of data on Zenodo is not the same as that expected in the code.
For HR, add a "12bit" sub-directory between "hr_dataset" and the sub-directories containing each image.
For the LR, just remove the last characters ("_l2a" or "_l1c") from the name of the "lr_dataset" folder. I assume that the contents of these two downloaded folders "lr_dataset_l2a" and "lr_dataset_l1c" are thought to be stored in this single "lr_dataset" directory, merged by location identifier.

More annoyingly, the use of the dataloader does not provide the batches of HR images in the format expected by the code documentation. Which generates errors.

File "./worldstrat/src/lightning_modules.py", line 310, in bias_adjust
    bias = (y - y_hat).mean(dim=(-1, -2), keepdim=True) # bias / zero-order
RuntimeError: The size of tensor a (4) must match the size of tensor b (125) at non-singleton dimension 3

Looking at the shapes of the batches loaded in the loss (src.lightning_modules.py, line 198), I get the following values.
x: [1, 8, 12, 50, 50]
y: [1, 1054, 1054, 4]
This second value, in the current writing of the code, does not correspond to the dimensions (batch_size, channels, height, width) expected for the use of the function self.bias_adjust(y, y_hat) (line 226).
y: [1, 1054, 4]
y_hat: [1, 1054, 125, 125]

Any clue to help me?

Inference on cpu

In the Inference.ipynb notebook cuda is assumed, preventing running on a CPU machine by use of to("cuda")

This could be addressed by using a variable device = "cpu" # or "cuda" then using to(device)

However I still get error AssertionError: Torch not compiled with CUDA enabled which means cudatoolkit is required in the requirements I believe

Problems with pre-trained model and inference

Overall

I am trying to use worldstrat's pretrained model for inference on images not in the dataset. It would be great to have an API that takes an area of interest and
returns a super-resolution satellite image of that area. Judging from the
package's website, I assumed this would be very easy but I ran into several problems.

It would be great to have a Huggingface space that would allow someone to input
Sentinel Hub API keys and coordinates, and spit out a super resolution image.
I'd be happy to create this assuming you can help me with some of the more
serious problems below:

Specifics

Why shrink the low resolution images to 160x160? This seems like it throws away information for no reason.
Why does the pretrained model return 156x156 images instead of 500x500 which is the output size?
The inference notebook shows shrunken (50x50) images rather than the higher quality ones coming from Sentinel-2
- As far as I can tell, there is actually little improvement over the original Sentinel-2 pictures. The model seems to essentially throw away the data when converting to 50x50, and then recovers most of the information when converting back to 156x156.
Since the pre-trained model uses chips, it would be nice if there was an API for stitching the chips back together.
The code does not seem to permit inference without the high resolution images
- src/datasets.py is hard-coded to look for the HR images
The "Dataset Generation" notebook does not work due to several problems
- Commented out lines in SentinelDownloader.py prevent the downloader from actually downloading at creation, for example
- The query attribute used in SentinelCatalogue.py has been removed.
- Visualizer.py's use of fps has been removed and needs to be replaced with duration=125
The "Inference" notebook has randomly_rotate_and_flip_images set.

RuntimeError: The size of tensor a (32) must match the size of tensor b (3) at non-singleton dimension 1

I get this issue when I try to reproduce the code. It happened when I change the batch_size from 1 to any other number (>1). Does anyone can help me with that problem? I want to increase the batch_size to use utilize my GPU memory bandwidth.

Details of the error:
RuntimeError Traceback (most recent call last)
c:\Users\MRamzy\Desktop\datasets_ms_new_work\D4_Worldstat_DS\worldstrat-main\Training.ipynb Cell 6 line 1
----> 1 run_training_command(default_train_command, running_on_windows=True)

c:\Users\MRamzy\Desktop\datasets_ms_new_work\D4_Worldstat_DS\worldstrat-main\Training.ipynb Cell 6 line 3
36 if running_on_windows:
37 sys.argv += ["--num_workers", "0"]
---> 38 cli_main()

File c:\Users\MRamzy\Desktop\datasets_ms_new_work\D4_Worldstat_DS\worldstrat-main\src\train.py:42, in cli_main()
39 model = generate_model(args)
41 add_callbacks(args, dataloaders)
---> 42 generate_and_run_trainer(args, dataloaders, model)

File c:\Users\MRamzy\Desktop\datasets_ms_new_work\D4_Worldstat_DS\worldstrat-main\src\train.py:111, in generate_and_run_trainer(args, dataloaders, model)
101 trainer = pl.Trainer.from_argparse_args(args)
102 # print(len(dataloaders["train"]))
103
104 # print (dataloaders["train"].dataset[0]["lr"].shape)
105 # print(dataloaders["train"].dataset[0]["hr"].shape)
106 # print (dataloaders["val"].dataset[100]["lr"].shape)
107 # print(dataloaders["val"].dataset[100]["hr"].shape)
--> 111 trainer.fit(model, dataloaders["train"], dataloaders["val"])
112 # trainer.fit(model, dataloaders["train"])
114 if not args.fast_dev_run:

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:771, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
752 r"""
753 Runs the full optimization routine.
754
(...)
768 datamodule: An instance of :class:~pytorch_lightning.core.datamodule.LightningDataModule.
769 """
770 self.strategy.model = model
--> 771 self._call_and_handle_interrupt(
772 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
773 )

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:724, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
722 return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
723 else:
--> 724 return trainer_fn(*args, **kwargs)
725 # TODO: treat KeyboardInterrupt as BaseException (delete the code below) in v1.7
726 except KeyboardInterrupt as exception:

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:812, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
808 ckpt_path = ckpt_path or self.resume_from_checkpoint
809 self._ckpt_path = self.__set_ckpt_path(
810 ckpt_path, model_provided=True, model_connected=self.lightning_module is not None
811 )
--> 812 results = self._run(model, ckpt_path=self.ckpt_path)
814 assert self.state.stopped
815 self.training = False

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:1237, in Trainer._run(self, model, ckpt_path)
1233 self._checkpoint_connector.restore_training_state()
1235 self._checkpoint_connector.resume_end()
-> 1237 results = self._run_stage()
1239 log.detail(f"{self.class.name}: trainer tearing down")
1240 self._teardown()

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:1324, in Trainer._run_stage(self)
1322 if self.predicting:
1323 return self._run_predict()
-> 1324 return self._run_train()

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:1346, in Trainer._run_train(self)
1343 self._pre_training_routine()
1345 with isolate_rng():
-> 1346 self._run_sanity_check()
1348 # enable train mode
1349 self.model.train()

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:1414, in Trainer._run_sanity_check(self)
1412 # run eval step
1413 with torch.no_grad():
-> 1414 val_loop.run()
1416 self._call_callback_hooks("on_sanity_check_end")
1418 # reset logger connector

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\loops\base.py:204, in Loop.run(self, *args, **kwargs)
202 try:
203 self.on_advance_start(*args, **kwargs)
--> 204 self.advance(*args, **kwargs)
205 self.on_advance_end()
206 self._restarting = False

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py:153, in EvaluationLoop.advance(self, *args, **kwargs)
151 if self.num_dataloaders > 1:
152 kwargs["dataloader_idx"] = dataloader_idx
--> 153 dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
155 # store batch level output per dataloader
156 self._outputs.append(dl_outputs)

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py:127, in EvaluationEpochLoop.advance(self, data_fetcher, dl_max_batches, kwargs)
124 self.batch_progress.increment_started()
126 # lightning module methods
--> 127 output = self._evaluation_step(**kwargs)
128 output = self._evaluation_step_end(output)
130 self.batch_progress.increment_processed()

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py:222, in EvaluationEpochLoop._evaluation_step(self, **kwargs)
220 output = self.trainer._call_strategy_hook("test_step", *kwargs.values())
221 else:
--> 222 output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
224 return output

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\trainer\trainer.py:1766, in Trainer._call_strategy_hook(self, hook_name, *args, **kwargs)
1763 return
1765 with self.profiler.profile(f"[Strategy]{self.strategy.class.name}.{hook_name}"):
-> 1766 output = fn(*args, **kwargs)
1768 # restore current_fx when nested context
1769 pl_module._current_fx_name = prev_fx_name

File c:\Users\MRamzy\anaconda3\envs\new_torch\lib\site-packages\pytorch_lightning\strategies\strategy.py:344, in Strategy.validation_step(self, *args, **kwargs)
339 """The actual validation step.
340
341 See :meth:~pytorch_lightning.core.lightning.LightningModule.validation_step for more details
342 """
343 with self.precision_plugin.val_step_context():
--> 344 return self.model.validation_step(*args, **kwargs)

File c:\Users\MRamzy\Desktop\datasets_ms_new_work\D4_Worldstat_DS\worldstrat-main\src\lightning_modules.py:441, in LitModel.validation_step(self, batch, batch_idx)
426 def validation_step(self, batch, batch_idx):
427 """Validation step.
428 Calls the forward pass, computes the loss and metrics for the validation batch.
429 Logs the reduced metrics.
(...)
439 Validation loss.
440 """
--> 441 loss_output = self.loss(
442 batch, self.val_metrics, self.baseline_val_metrics, prefix="val"
443 )
444 _, metrics_reduced = self.unpack_and_reduce_metrics(loss_output, prefix="val")
445 self.log_dict(metrics_reduced)

File c:\Users\MRamzy\Desktop\datasets_ms_new_work\D4_Worldstat_DS\worldstrat-main\src\lightning_modules.py:239, in LitModel.loss(self, batch, metrics, baseline_metrics, prefix)
232 y_hat_base = self.bias_adjust(y, y_hat_base)
233 baseline_m = self.compute_baseline_metrics(
234 y, y_hat_base, m, baseline_metrics, prefix
235 )
238 loss = (
--> 239 (self.hparams.w_mse * mse)
240 + (self.hparams.w_mae * mae)
241 + (self.hparams.w_tv * tv)
242 + (self.hparams.w_ssim * ssim)
243 )
245 if self.hparams.benchmark:
246 self.validation_log(y, detach_dict(m), baseline_m, prefix)

RuntimeError: The size of tensor a (32) must match the size of tensor b (3) at non-singleton dimension 1

I have JP2 but code want tiff file?

Missing data - subfolders

Hello,
after downloading the data from Zenodo and trying to retrain the Benchmarks, I get file not found errors.
rasterio.errors.RasterioIOError: --path--/lr_dataset/ASMSpotter-1-1-1/L2A/: No such file or directory
As it turns out, the folder 'ASMSpotter-1-1-1' is not in the dataset even though the download was successful and the checksums were correct. Any ideas how that can happen?

Edit: The above mentioned folder is indeed the only one missing, removing this line form the stratified_train_val_test_split.csv file allows the training to continue.

Evaluation on cloud-free samples

It would be great to provide code for evaluating on a distinct subset of the dataset exclusively containing cloud-free samples, in order to allow for focusing on the super-resolution task in isolation. Controlling for sample numbers while involving cloudy observations in a separate condition would make for an interesting comparison. This may provide insights about the difficulty added by the cloud-covered observations.

It would be great indeed!
It should also be fairly straightforward to generate a list of cloud-free AOIs, which can be used as an alternative training split.

The code accepts a CSV file with an explicit list of AOIs to be used, and cloud masks, percentages and scene classification data is provided for each low-resolution revisit, which can be used to generate that list.

When the time allows, we will implement this. If anyone else catches the time to implement and make a PR, they're very welcome!