krisrs1128 / clouds_dist Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 978 KB

Simulation of low-clouds, from weather measures.

Python 99.52% Shell 0.48%

clouds_dist's People

Contributors

Stargazers

Watchers

Forkers

mustafaghali

clouds_dist's Issues

Train zoomed into the circle

Let's see the effect of the NANs on training.

refactor preprocessed_data and preprocessed_data_path

remove .idea from tracked files

add transforms to the data loading to scale metos

Loss on validation set

Right now, we are just looking at the training losses. This is not so bad for the GAN term, but is risky with the matching loss.

Assignment:

Let earth data read as input some index file, specifying which samples are for train and which are for validation
Create function to evaluate on the validation set, during training
Make sure this are also written to the logs

We now generate images in the range [-1, 1]. This gets truncated when plotted in numpy -- everything less than 0 becomes black. This means the images we see in comet don't look close to what they really are like (they still look like gaussian noise).

The fix should be easy... just rescale to [0, 1]

x --> .5 * (x + 1)

Ensure reproducibility

Check that torch.seed makes the dataloader and initialization reproducible.

continue training

we need to setup a procedure to continue training.

goes with issue about standardizing output dirs so that in the end we can just say something like --continue=run-i and the code goes to the right place, loads latest.pt and boom

Fix log_image step

rescale stats doesn't fit in the GPU memory with the model at the same time

for a given batch 32 the stats(means, mins, max) tensors with the gan model give's Cuda out of memory error, it works fine when I reduce the batch size to 26...

make Quantile and Digitize aware of Nans

Why is there a data loader in rescale

clouds_dist/src/preprocessing.py

Line 16 in f937ecb

self.data_loader = torch.utils.data.DataLoader(

I don't quite follow @mustafaghali because in train.py we have

        transfs = []
        if self.opts.data.preprocessed_data_path is None and self.opts.data.with_stats:
            transfs += [
                Rescale(
                    data_path=self.opts.data.path,
                    batch_size=self.opts.train.batch_size,
                    num_workers=self.opts.data.num_workers,
                    verbose=1,
                )
            ]

        self.trainset = EarthData(
            self.opts.data.path,
            preprocessed_data_path=self.opts.data.preprocessed_data_path,
            load_limit=self.opts.data.load_limit or -1,
            transform=transforms.Compose(transfs),
        )

so why does rescale have a data_loader attribute?

Also note I deleted batchsize=n_in_mem and switched to opts.train.batch_size

staged training

train with large lambda_l (matching loss) in the beginning and then
balance with the lambda_g

learn residuals

Train per-pixel linear regressors (42 metos => 3 rgb) then infer on data then at train time substract these inferences from the input to remove the least meaning variations

@krisrs1128 is that it?

Who's doing it?

squash channels

[0 ; 9] => U (wind components)
[10 ; 19] => T
[20:29] => V
[30:39] => H 
[40] => Scattering level
[41] => TS (surface temperature)
[42:43] => Long, Lat
2->11

0 => av(U)
1 => av(T)
2 => av(h)
3 => SL
4 => TS
5 => Lon
6 => Lat

use slurm_dir

Analyse expermient runs

config/explore_gan_hyps.json

{
    "experiment": {
        "name": "explore-gan",
        "exp_dir": "$tmpv/clouds_runs/",
        "repeat": 20
    },
    "runs": [
        {
            "sbatch": {
                "runtime": "24:00:00",
                "message": "gan exploration",
                "conf_name": "gan_exp"
            },
            "config": {
                "model": {
                    "disc_size": 64,
                    "dropout": {
                        "sample": "range",
                        "from": [
                            0, 0.45, 0.05
                        ]
                    }
                },
                "train": {
                    "datapath": "/network/tmp1/schmidtv/clouds500",
                    "batch_size": 8,
                    "num_D_accumulations": 1,
                    "n_epochs": 500,
                    "with_stats": false,
                    "lr_d": {
                        "sample": "list",
                        "from": [
                            0.00001,
                            0.00005,
                            0.0001,
                            0.0005,
                            0.001,
                            0.005,
                            0.01
                        ]
                    },
                    "lr_g": {
                        "sample": "list",
                        "from": [
                            0.00001,
                            0.00005,
                            0.0001,
                            0.0005,
                            0.001,
                            0.005,
                            0.01
                        ]
                    },
                    "lambda_gan": {
                        "sample": "list",
                        "from": [
                            0.1,
                            1,
                            5,
                            10
                        ]
                    },
                    "lambda_L": {
                        "sample": "list",
                        "from": [
                            0,
                            0.1,
                            1,
                            5,
                            10
                        ]
                    },
                    "matching_loss": {
                        "sample": "list",
                        "from": [
                            "l1",
                            "l2",
                            "weighted"
                        ]
                    }
                }
            }
        }
    ]
}

modify no_comet losses plot (make it scale)

sanity checks + hyper parameter seach

range of values in the input should be narrow

range of target values should be -1:1

count models parameters

feature activations within unet

Compare experimental results

Yaml convert time to duration in sec

I am not sure if this happens only on my side ?
e.g in explore.yaml runtime: 24:00:00
while in parallel_run.py spb[runtime] = 86400

delete normalization per channel AND location

plot histogram of imgs values

Imputation

Change the name: RemoveNans --> ReplaceNans
Implement the transformation
- Imgs: Replace nans with -1
- Metos: Replace the nans with mean - 3std

no image in batch!!

Investigating, I got this:

batch["real_imgs"].sum()

debug the Discriminator capacity

Standardize output dir

We need to create an output directory, unique per run, with conf, comet, checkpoints, saved images (if need be some day) and so on.

agree on file structure

Like:

$SCRATCH/clouds
    data/
        imgs/
        metos/
    logs/
    outputs/
        run-i/
            comet.zip
            network.pt
            final_images/
            conf.json

Thoughts?

get_increasable_name function doesn't change directory name

Pull Request Etiquette

Hey, as I've seen in other projects, a good software engineering practice is to put [WIP] at the beginning of a PR's title when it's "work in progress". That will prevent unwanted merges.

For instance, @mustafaghali created a Quantization PR. But from our discussion, it's was not what we had in mind. So he changed it. And now I don't know if it should be merged or not.

So if a PR's not ready to be merged, add [WIP] in its title :)

fallback if no slurmtmpdir variable

add yaml to singularity image

add "data" key in defaults.json

stats not counting nan locations

overfitting initialization

initialize early layers from pretrained, overfitting model

Quantizing metos

Instead of linearly rescaling, quantize values across image per variable: percentiles

Parameter for bottleneck dimension

Should be an option in the config file, to increase size of generator.

use coords

add them to process sample

validation procedure

remove leading zeros

remove the zeros in the borders of the images

Discriminator not learning

I ran an experiment trying to overfit 100 samples, over 100 epochs.

Here's the link to the comet exp.

Pb: discriminator loss is constantly 0.5.

There must be a bug in the code, something's not right. Trying to investigate issues that may be related to backward or detach

/home/vsch/cloudenv/lib/python3.6/site-packages/torch/nn/functional.py:1386: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/vsch/clouds/src/train.py", line 210, in <module>
    result = trainer.run_trail()81 train/d_loss:0.2497 train/L1_loss:0.3646 train/g_loss:0.3646
  File "/home/vsch/clouds/src/train.py", line 87, in run_trail
    lambda_L1=1,
  File "/home/vsch/clouds/src/train.py", line 111, in train
    for i, (coords, real_img, metos_data) in enumerate(self.trainloader):
  File "/home/vsch/cloudenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 568, in __next__
    return self._process_next_batch(batch)
  File "/home/vsch/cloudenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
  File "/home/vsch/cloudenv/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/vsch/cloudenv/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/vsch/clouds/src/data.py", line 58, in __getitem__
    path = [s for s in self.paths[key] if self.ids[j] in s][0]
  File "/home/vsch/clouds/src/data.py", line 58, in <listcomp>
    path = [s for s in self.paths[key] if self.ids[j] in s][0]
IndexError: list index out of range

Implement quantile transformation

Continue the work started in session 10/17

hyper params exploration robustness & usability

see branch run-hyperparams-exploration