banctilrobitaille / kerosene Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 1.0 719 KB

Deep Learning framework for fast and clean research with Pytorch

Home Page: https://kerosene.readthedocs.io/en/latest/

License: MIT License

Python 99.88% Batchfile 0.03% Shell 0.09%

ai artificial-intelligence deeplearning framework python pytorch visdom

kerosene's People

Contributors

Stargazers

Watchers

Forkers

projetsplusia

kerosene's Issues

[FEATURE] Implement debug options in the trainer (quick run, gradient norm inspection)

Is your feature request related to a problem? Please describe.
Right now it is hard to debug the validation loop cause there is no way of bypassing the training loop.

Describe the solution you'd like
Ideally some options like quick_run or gradient norm inspection could be interesting

[FEATURE] Support multiple optimizers for a single model

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Be able to support multiple optimizers for a single model. Different model layers should be optimized by different instances of optimizers.

For instance :

 optimizer_mu = optim.SGD([model.gc1_1.mu, model.gc2_1.mu, model.gc3_1.mu, model.gc4_1.mu], lr=0.000000001)
optimizer_sig = optim.SGD([model.gc1_1.sig, model.gc2_1.sig, model.gc3_1.sig, model.gc4_1.sig], lr=1000)
optimizer = optim.Adam([model.gc1_1.weight, model.gc2_1.weight, model.gc3_1.weight, model.gc4_1.weight, model.gc1_1.bias, model.gc2_1.bias, model.gc3_1.bias, model.gc4_1.bias], lr=LR, betas=(B1, B2))

Describe alternatives you've considered
None to date.

Will this change the current api? How?
Yes. Need to support a list of optimizers in configuration, with section for specifying parameters for each optimizer.

Additional context
N/A.

[FEATURE] Make metrics optional

Is your feature request related to a problem? Please describe.
It is not possible to have a model without metrics.

Describe the solution you'd like
Be able to create a model without metrics

Describe alternatives you've considered
None

Will this change the current api? How?
No`

[BUG] Key error with custom variables when event is set at ON_EPOCH_END

Describe the bug
Impossible to complete the event self.fire(Event.ON_EPOCH_END) in Trainer object. The key is not found in self._custom_variable. "Normal" behaviour since this line is called before the user-defined _on_epoch_end() where a value is attributed to the dictionary key. Same would happen for events sets ON_EPOCH_BEGIN in main.py.

To Reproduce
Steps to reproduce the behavior:

Have defined a custom variable in main.py :

.with_event_handler(PlotCustomVariables(visdom_logger, "D(G(X) | X", PlotType.LINE_PLOT,
                                                    params={"opts": {"title": "Loss D(G(X) | X"}},
                                                    every=1), Event.ON_EPOCH_END) \

Execute whole validation process.
Have this variable computed in method _on_epoch_end.
Key Error : "D(G(X) | X" not found.

Expected behavior
User defined _on_epoch_end should be called to assign the variable in custom_variables dictionary before calling self.fire(Event.ON_EPOCH_END) of Trainer object.

** Solution **
Call the following instructions in this order :

    def _on_epoch_begin(self):
        self._reset_model_trainers()
        self.on_epoch_begin()
        self.fire(Event.ON_EPOCH_BEGIN)

    def _on_epoch_end(self):
        self.scheduler_step()
        self.on_epoch_end()
        self.fire(Event.ON_EPOCH_END)

This way, a value set to the dictionary key in custom_variable can be called in user_defined _on_epoch_end() method, and then the event is fired.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Linux Ubuntu 18.04.2 LTS
Python 3.7.4
CUDA 10.1 / cuDNN 7.3.1
NVIDIA GeForce RTX 2080 8 GB

[FEATURE] Support manuel instantiation of model trainer

Is your feature request related to a problem? Please describe.
Currently it is not possible to manually instantiate a model trainer as it takes only factories and not directly the object (e.g. optimizer, scheduler, etc)

Describe the solution you'd like
Support string and object in the instantiation of a model trainer

[FEATURE] Add better error handling and logging

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Add error handling and logging, more validation

Will this change the current api? How?
No

[BUG] Program does not exit on windows only (process hangs forever)

Describe the bug
When training a network on a windows computer, the process hang and does not exit. The hypothesis is that pytorch dataloaders are hanging. Some issues have been reported on the pytorch's github page.

To Reproduce
Train a network for an epoch. At the end of the validation the program will hang.

Expected behavior
The program should exit with 0 as it does on linux system

Desktop (please complete the following information):

OS platform and distribution: Windows 7
Python version: Python 3.6
CUDA/cuDNN version: None
GPU model and memory: ON CPU

[BUG] Training crash when a network error occure and using visdom

Describe the bug
If there is a connection timeout for any network problem and using visdom to plot graph then the training crash with a visdom error

To Reproduce
Steps to reproduce the behavior:
Train a network with visdom, unplug the network cable

Expected behavior
The plot could either fallback to a dummy visdom, or to a file.
The error should be handled

Screenshots

[FEATURE] Implement early stopping

Describe the solution you'd like
Adding a EventHandler that change a modelTrainer status to Finalize if it does not respect the min/max mode over a given monitor value. It should support standard value and custom variables.

[FEATURE] Pytorch tensor support in yaml config

Is your feature request related to a problem? Please describe.
Actually it is impossible to give weights in the params of weighted loss as it needs a tensor.

Describe the solution you'd like
A yaml parser that convert yaml list to tensor given a specific yaml type !!pytorch/tensor lets say.

Describe alternatives you've considered
Using custom factory that convert the object

Will this change the current api? How?
no

[FEATURE] At exit cleanup function for handlers

Is your feature request related to a problem? Please describe.
At the moment if the process terminate there is no call to cleanup handlers, save visdom, etc.

Describe the solution you'd like
Register a cleanup function using the atexit module for every handlers and let the user implement it.

Describe alternatives you've considered
None

Will this change the current api? How?
No

Additional context

[FEATURE] Make kerosene compatible with HyperOpt

Is your feature request related to a problem? Please describe.
The train method does not return anything so it is currently not that friendly to use an Hyperparameters tuning lib such as HyperOpt : https://github.com/hyperopt/hyperopt

Describe the solution you'd like
The train method from Trainer would return the test monitors so its easy for the user to create an objective function compatible with hyperopt

Describe alternatives you've considered
Using the attribute on the trainer at the end of the training.

Will this change the current api? How?
no

[BUG] Validation step resetinng while testing

Describe the bug
The validation step reset in the console when testing

Expected behavior
Should not reset

[FEATURE] Handle loss scaler when performing operation on apex loss

Is your feature request related to a problem? Please describe.
When we apply a mathematical operation on a apex loss, it takes on the the 2 scaler but this might have an impact on the gradient

Describe the solution you'd like
Take the biggest scaller if increased or smaller if reduced, etc

Describe alternatives you've considered
Let the user handle it

Will this change the current api? How?
No

[FEATURE] Implement a TQDM handlers for training progress/status

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Have a TQDM handlers that print the training status

Describe alternatives you've considered
No

Will this change the current api? How?
No

[FEATURE] Load/save model

Is your feature request related to a problem? Please describe.
No. Mainly performance related.

Describe the solution you'd like
A way to save models when training is complete and a way to reload these models to continue training.

Describe alternatives you've considered
None.

Will this change the current api? How?
Yes, will add a call during training to periodically save the model using some kind of strategy (mainly based on validation error). Implement this feature using a kerosene.handlers.base_handler.EventHandler object. This handler would take a path in argument, path which will be found in a configuration or passed to the handler during initialization.

Additional context
None.

[FEATURE] Single metric computation function missing

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Have multiple function to update metric individually like for the criterions.

Describe alternatives you've considered
None

Will this change the current api? How?
No

[FEATURE] Support multiple metrics per model

Is your feature request related to a problem? Please describe.
Multiple metric per model should be supported. eg. Dice, Hausdorff

Describe the solution you'd like
Defining multiple metrics in the config and support them automatically. Add a dict maybe in the trainer that would old the metric stuff

Describe alternatives you've considered
Let the user define their metric in their user-defined trainer.

[FEATURE] Implement gradient clipping

Is your feature request related to a problem? Please describe.
To avoid exploding gradients sometimes it is necessary to clip the gradients that exceed a given metric (L1/2 Norm)

[FEATURE] Remove case sensitiveness in factory and config

[QUESTION] A place to put various variable

This template is for miscellaneous issues not covered by the other issue categories.

Describe the issue
Is there a place to put variables (like coefficients) and parse them into a VariableConfiguration?

Expectations
Answer. Could lead to implementation of a class to handle a VariableConfiguration.

Additional context
For example, if I have an hyper-parameter alpha to multiply with another factor, is there a place I can put it in the YAML configuration file and the variable will be handled ?

[FEATURE] Distributed computing support

Is your feature request related to a problem? Please describe.
Future users might want to use apex.parallel.DistributedDataParallel module to execute code in parallel on multiple GPU.

Two use cases are possibles:

Multiple GPUs running a single model;
Multiple GPUs running a synchronized copy of one or more models.

The first use case will be let to the user to define which OPs go on which GPU while defining its model. We have to ensure that Kerosene doesn't override the choice of GPU later in the execution.

In the second use case, the script is launched with the upstream torch.distributed.launch and must have --nproc_per_node <= number of gpus per node as argument. When used with this launcher, apex.parallel.DistributedDataParallel assumes 1:1 mapping of processes to GPUs. It also assumes that the script calls torch.cuda.set_device(args.rank) before creating the model.

Other required components is the use of torch.utils.data.distributed.DistributedSampler object, so each GPU will have a unique subset of samples during training. i.e.:

            for train_dataset, valid_dataset in zip(train_datasets, valid_datasets):
                train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
                valid_sampler = torch.utils.data.distributed.DistributedSampler(valid_dataset)

                dataloaders.append(torch.utils.data.DataLoader(dataset=train_dataset,
                                                               batch_size=batch_size,
                                                               shuffle=True,
                                                               num_workers=num_workers,
                                                               sampler=train_sampler,
                                                               collate_fn=patch_collate))
                dataloaders.append(torch.utils.data.DataLoader(dataset=valid_dataset,
                                                               batch_size=batch_size,
                                                               shuffle=True,
                                                               num_workers=num_workers,
                                                               sampler=valid_sampler,
                                                               collate_fn=patch_collate))
            return dataloaders

Adding DataLoader to a list might not be the most efficient way thought, see pytorch/pytorch#11201.

Finally, additional arguments in the RunningConfig might be necessary, such as local_rank which torch.distributed.launch must see in the script's argument of main.py . local_rank is an int which is inline with nvidia-smi GPU listing. local_rank must be used through the execution to send data over the good GPU. i.e:

#!/bin/bash
python3 -m torch.distributed.launch --nproc_per_node=3 main.py --config=config/config.yaml

Process 0 will run main.py with `local_rank` set to 0
Process 1 will run main.py with `local_rank` set to 1
Process 2 will run main.py with `local_rank` set to 2

During this execution flow, local_rank must not change and must always be accessible by Trainer object which prepare the batches and send it to the GPUid==local_rank
That's why SAMITorch had a prepare_batch method with was easy to test to verify if data is sent to the good device.

My way to initialize this was in the RunningConfig.init() method:

self._device = torch.device("cuda:" + str(self._local_rank)) if torch.cuda.is_available() else torch.device("cpu")

Other environment variables like:

WORLD_SIZE=1;
RANK=0;
MASTER_ADDR=127.0.0.1;
MASTER_PORT=65000;

must be set before using apex.parallel.DistributedDataParallel() constructor which is used once the model is created, i.e:

 self._config.model = DDP(self._config.model, delay_allreduce=True)

Describe the solution you'd like
Support for apex.parallel.DistributedDataParallel()

Describe alternatives you've considered
None.

Will this change the current api? How?
Should not change API in best case scenario. The DataParallel use case should be user transparent at best.

Additional context
None.

[BUG] Incoherent validation step

Describe the bug
The current validation step is always shifted by an factor of the len(dataloader) when training the network because the current_valid_batch is not reset to 0 at the end of epoch.

To Reproduce
Steps to reproduce the behavior:
Train a network and print the training status.

Expected behavior
The validation steps should increase linearly by step of 1

Screenshots
If applicable, add screenshots to help explain your problem.

[BUG] Yaml config not supporting not params key

Describe the bug
Event if you don't want any params, you have to write the params keys.

To Reproduce
Steps to reproduce the behavior:

Create a config
Add a metric with no params key

Expected behavior
It should handle the fact that you don't want any params

Screenshots
Criterion works but not the metric (no params key)

[FEATURE] Make the test epoch optional

Is your feature request related to a problem? Please describe.
At the moment a test dataset is mandatory. Should be optional

Describe the solution you'd like
Run the test_epoch only if the test dataloader is present

Will this change the current api? How?
No

[FEATURE] Implement Apex Loss missing operation

Is your feature request related to a problem? Please describe.
Cannot substract number to an Apex Loss due to a missing substract operation.

Describe the solution you'd like
Implement the standard sub opration to apex loss

Will this change the current api? How?
No

[BUG] No event validation for training and validation epoch handling in visdom plot

Describe the bug
The same function is used to validate the training and validation in an if else if manner. If the shouldHandleStep function is false for training it could be true for validation even if the fired event is training. This leads to unwanted ploting.

[BUG] LR Scheduler not supported

Describe the bug
Currently only the ReduceLROnPleateau scheduler is supported as the loss is given as arg.

To Reproduce
Use any other scheduler,

Expected behavior
Should give the epoch as parameters to other scheduler

[BUG] Configuration items do not print in Visdom

Describe the bug
Sub-items in YAML configuration do not print in Visdom environment.

To Reproduce
In a configuration of the model:

optimizer:
      type: "FusedSGD"
      params:
        lr: 0.00001
        momentum: 0.9
        weight_decay: 0.1

Will only print in Visdom:
_optimizer_config:

Expected behavior
Sub-items in configuration should print:

_optimizer_config:
      type: FusedSGD
      params: {lr: 0.00001, momentum: 0.9, weight_decay: 0.1

Screenshots

Desktop (please complete the following information):

OS platform and distribution: [e.g. Linux Ubuntu 18.04.2 LTS]
Python version: [e.g. Python 3.7]
CUDA/cuDNN version: [e.g. CUDA 10.1 / cuDNN 7.3.1]
GPU model and memory: [e.g. NVIDIA GeForce RTX 2080 8 GB]

Additional context
Add any other context about the problem here.

[FEATURE] Implement a standard api for custom variable ploting

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Instead of having different EventPreprocessor for every visdom plot type we should have a single event preprocessor for custom variables that accepts the type plot, frequency and opts as args.

Describe alternatives you've considered
none

Will this change the current api? How?
Yes, instead of having multiple Preprocessor for custom variable there will be one.

Additional context
No.

[FEATURE] Add more tests

Is your feature request related to a problem? Please describe.
There is a lack of testing

Describe the solution you'd like
Add more unit tests

[FEATURE] Support multiple GPUs per model

Is your feature request related to a problem? Please describe.
Yes.
Kerosene relies on environment variable CUDA_VISIBLE_DEVICES=.... If the machine has more than 1 GPU, Kerosene waits for a process with rank > 0 and hangs.

Describe the solution you'd like
Add an argument to start script with Distributed Data Parallel or not. If --use-ddp is active, it must be jointly used with torch.launcher.distributed and will start >1 processes for each GPU.
If not passed, DDP won’t be initialized and user could use >1 GPU for its model.

Describe alternatives you've considered
None

Will this change the current api? How?
Yes. Need to add an argument in the RunningConfig.

Additional context
Must be tested on a machine with >1 GPU.

[FEATURE]`Implement error handling in the main loop that saves the current state

Is your feature request related to a problem? Please describe.
At the moment, if the program crash nothing is saved.

Describe the solution you'd like
There should be an exception handler that fire an Event.ExceptionOccured that is handle by default. This could save the model, print some error message, save the visdom evironment, save visdom to a file, etc..

Describe alternatives you've considered
Could be done user side.

Will this change the current api? How?
No

[FEATURE] Missing optimizers in factory

Is your feature request related to a problem? Please describe.
Not all optimizers are implemented in the factory. e.g. RMSprop

Describe the solution you'd like
Add the optimizers in the factory.

[FEATURE] Implement Dataparallel support

Is your feature request related to a problem? Please describe.
Yes, the application is not working when user has multiple GPU and launching without the torch launcher.

[BUG] " RuntimeError: A tensor was not cuda." Error after optimizer being reloaded from checkpoint.

Describe the bug
Error RuntimeError: A tensor was not cuda. (multi_tensor_apply at csrc/multi_tensor_apply.cuh:60) is being fired after optimizer being reloaded from checkpoint.

To Reproduce
Steps to reproduce the behavior:

Run a training with Checkpoint handler.
Stop training.
Relaunch training with path attribute in config to reload from saved checkpoint.
Error RuntimeError: A tensor as not cuda. (multi_tensor_apply at csrc/multi_tensor_apply.cuh:60) is displayed.

Expected behavior
Training should restart without error.

Desktop (please complete the following information):

OS platform and distribution: [e.g. Linux Ubuntu 18.04.3 LTS]
Python version: 3.7.3
CUDA/cuDNN version: 10.1
GPU model and memory: NVIDIA GeForce RTX 2080 8 GB

[FEATURE] Implement consumer producer pattern

Is your feature request related to a problem? Please describe.
At the moment all the handlers are synchronous. If a handler takes a long time to respond for some reason (e.g. Visdom beacause of network latency) it slows down the training process.

Describe the solution you'd like
Some handlers like visdom could run in a thead and act as consumers. They would consume the data in a queue and process them without slowing down the training process.

Describe alternatives you've considered
None

Will this change the current api? How?
No

[FEATURE] Accumulate gradients

Is your feature request related to a problem? Please describe.
Lacks of gradient accumulation.

Describe the solution you'd like
A way to accumulate gradients for k steps.

Describe alternatives you've considered
None

Will this change the current api? How?
Yes, will add something in config to specify number of steps to accumulate gradients.

Additional context
Accumulated gradients runs k small batches of size N before doing a backwards pass.

[FEATURE] Aggregator for visdom data

Is your feature request related to a problem? Please describe.
At the moment, a REST call is made to visdom per visdom data which takes unecessary CPU time.

Describe the solution you'd like
Visdom data could be stacked and push only once every N iteration.

Describe alternatives you've considered
None

Will this change the current api? How?
No

Additional context
No

banctilrobitaille / kerosene Goto Github PK

kerosene's People

Contributors

Stargazers

Watchers

Forkers

kerosene's Issues

Recommend Projects

Recommend Topics

Recommend Org