banctilrobitaille / kerosene Goto Github PK
View Code? Open in Web Editor NEWDeep Learning framework for fast and clean research with Pytorch
Home Page: https://kerosene.readthedocs.io/en/latest/
License: MIT License
Deep Learning framework for fast and clean research with Pytorch
Home Page: https://kerosene.readthedocs.io/en/latest/
License: MIT License
Is your feature request related to a problem? Please describe.
Right now it is hard to debug the validation loop cause there is no way of bypassing the training loop.
Describe the solution you'd like
Ideally some options like quick_run or gradient norm inspection could be interesting
Is your feature request related to a problem? Please describe.
No.
Describe the solution you'd like
Be able to support multiple optimizers for a single model. Different model layers should be optimized by different instances of optimizers.
For instance :
optimizer_mu = optim.SGD([model.gc1_1.mu, model.gc2_1.mu, model.gc3_1.mu, model.gc4_1.mu], lr=0.000000001)
optimizer_sig = optim.SGD([model.gc1_1.sig, model.gc2_1.sig, model.gc3_1.sig, model.gc4_1.sig], lr=1000)
optimizer = optim.Adam([model.gc1_1.weight, model.gc2_1.weight, model.gc3_1.weight, model.gc4_1.weight, model.gc1_1.bias, model.gc2_1.bias, model.gc3_1.bias, model.gc4_1.bias], lr=LR, betas=(B1, B2))
Describe alternatives you've considered
None to date.
Will this change the current api? How?
Yes. Need to support a list of optimizers in configuration, with section for specifying parameters for each optimizer.
Additional context
N/A.
Is your feature request related to a problem? Please describe.
It is not possible to have a model without metrics.
Describe the solution you'd like
Be able to create a model without metrics
Describe alternatives you've considered
None
Will this change the current api? How?
No`
Describe the bug
Impossible to complete the event self.fire(Event.ON_EPOCH_END)
in Trainer object. The key is not found in self._custom_variable
. "Normal" behaviour since this line is called before the user-defined _on_epoch_end()
where a value is attributed to the dictionary key. Same would happen for events sets ON_EPOCH_BEGIN
in main.py.
To Reproduce
Steps to reproduce the behavior:
.with_event_handler(PlotCustomVariables(visdom_logger, "D(G(X) | X", PlotType.LINE_PLOT,
params={"opts": {"title": "Loss D(G(X) | X"}},
every=1), Event.ON_EPOCH_END) \
_on_epoch_end
.Expected behavior
User defined _on_epoch_end
should be called to assign the variable in custom_variables
dictionary before calling self.fire(Event.ON_EPOCH_END)
of Trainer object.
** Solution **
Call the following instructions in this order :
def _on_epoch_begin(self):
self._reset_model_trainers()
self.on_epoch_begin()
self.fire(Event.ON_EPOCH_BEGIN)
def _on_epoch_end(self):
self.scheduler_step()
self.on_epoch_end()
self.fire(Event.ON_EPOCH_END)
This way, a value set to the dictionary key in custom_variable can be called in user_defined _on_epoch_end()
method, and then the event is fired.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Is your feature request related to a problem? Please describe.
Currently it is not possible to manually instantiate a model trainer as it takes only factories and not directly the object (e.g. optimizer, scheduler, etc)
Describe the solution you'd like
Support string and object in the instantiation of a model trainer
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
Add error handling and logging, more validation
Will this change the current api? How?
No
Describe the bug
When training a network on a windows computer, the process hang and does not exit. The hypothesis is that pytorch dataloaders are hanging. Some issues have been reported on the pytorch's github page.
To Reproduce
Train a network for an epoch. At the end of the validation the program will hang.
Expected behavior
The program should exit with 0 as it does on linux system
Desktop (please complete the following information):
Describe the bug
If there is a connection timeout for any network problem and using visdom to plot graph then the training crash with a visdom error
To Reproduce
Steps to reproduce the behavior:
Train a network with visdom, unplug the network cable
Expected behavior
The plot could either fallback to a dummy visdom, or to a file.
The error should be handled
Describe the solution you'd like
Adding a EventHandler that change a modelTrainer status to Finalize if it does not respect the min/max mode over a given monitor value. It should support standard value and custom variables.
Is your feature request related to a problem? Please describe.
Actually it is impossible to give weights in the params of weighted loss as it needs a tensor.
Describe the solution you'd like
A yaml parser that convert yaml list to tensor given a specific yaml type !!pytorch/tensor lets say.
Describe alternatives you've considered
Using custom factory that convert the object
Will this change the current api? How?
no
Is your feature request related to a problem? Please describe.
At the moment if the process terminate there is no call to cleanup handlers, save visdom, etc.
Describe the solution you'd like
Register a cleanup function using the atexit module for every handlers and let the user implement it.
Describe alternatives you've considered
None
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
The train method does not return anything so it is currently not that friendly to use an Hyperparameters tuning lib such as HyperOpt : https://github.com/hyperopt/hyperopt
Describe the solution you'd like
The train method from Trainer would return the test monitors so its easy for the user to create an objective function compatible with hyperopt
Describe alternatives you've considered
Using the attribute on the trainer at the end of the training.
Will this change the current api? How?
no
Describe the bug
The validation step reset in the console when testing
Expected behavior
Should not reset
Is your feature request related to a problem? Please describe.
When we apply a mathematical operation on a apex loss, it takes on the the 2 scaler but this might have an impact on the gradient
Describe the solution you'd like
Take the biggest scaller if increased or smaller if reduced, etc
Describe alternatives you've considered
Let the user handle it
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
Have a TQDM handlers that print the training status
Describe alternatives you've considered
No
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
No. Mainly performance related.
Describe the solution you'd like
A way to save models when training is complete and a way to reload these models to continue training.
Describe alternatives you've considered
None.
Will this change the current api? How?
Yes, will add a call during training to periodically save the model using some kind of strategy (mainly based on validation error). Implement this feature using a kerosene.handlers.base_handler.EventHandler
object. This handler would take a path in argument, path which will be found in a configuration or passed to the handler during initialization.
Additional context
None.
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
Have multiple function to update metric individually like for the criterions.
Describe alternatives you've considered
None
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
Multiple metric per model should be supported. eg. Dice, Hausdorff
Describe the solution you'd like
Defining multiple metrics in the config and support them automatically. Add a dict maybe in the trainer that would old the metric stuff
Describe alternatives you've considered
Let the user define their metric in their user-defined trainer.
Is your feature request related to a problem? Please describe.
To avoid exploding gradients sometimes it is necessary to clip the gradients that exceed a given metric (L1/2 Norm)
This template is for miscellaneous issues not covered by the other issue categories.
Describe the issue
Is there a place to put variables (like coefficients) and parse them into a VariableConfiguration?
Expectations
Answer. Could lead to implementation of a class to handle a VariableConfiguration.
Additional context
For example, if I have an hyper-parameter alpha
to multiply with another factor, is there a place I can put it in the YAML configuration file and the variable will be handled ?
Is your feature request related to a problem? Please describe.
Future users might want to use apex.parallel.DistributedDataParallel
module to execute code in parallel on multiple GPU.
Two use cases are possibles:
Multiple GPUs running a single model;
Multiple GPUs running a synchronized copy of one or more models.
The first use case will be let to the user to define which OPs go on which GPU while defining its model. We have to ensure that Kerosene doesn't override the choice of GPU later in the execution.
In the second use case, the script is launched with the upstream torch.distributed.launch
and must have --nproc_per_node <= number of gpus per node
as argument. When used with this launcher, apex.parallel.DistributedDataParallel
assumes 1:1 mapping of processes to GPUs. It also assumes that the script calls torch.cuda.set_device(args.rank) before creating the model.
Other required components is the use of torch.utils.data.distributed.DistributedSampler
object, so each GPU will have a unique subset of samples during training. i.e.:
for train_dataset, valid_dataset in zip(train_datasets, valid_datasets):
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
valid_sampler = torch.utils.data.distributed.DistributedSampler(valid_dataset)
dataloaders.append(torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
sampler=train_sampler,
collate_fn=patch_collate))
dataloaders.append(torch.utils.data.DataLoader(dataset=valid_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
sampler=valid_sampler,
collate_fn=patch_collate))
return dataloaders
Adding DataLoader to a list might not be the most efficient way thought, see pytorch/pytorch#11201.
Finally, additional arguments in the RunningConfig
might be necessary, such as local_rank
which torch.distributed.launch
must see in the script's argument of main.py . local_rank
is an int
which is inline with nvidia-smi
GPU listing. local_rank
must be used through the execution to send data over the good GPU. i.e:
#!/bin/bash
python3 -m torch.distributed.launch --nproc_per_node=3 main.py --config=config/config.yaml
Process 0 will run main.py with `local_rank` set to 0
Process 1 will run main.py with `local_rank` set to 1
Process 2 will run main.py with `local_rank` set to 2
During this execution flow, local_rank must not change and must always be accessible by Trainer
object which prepare the batches and send it to the GPUid==local_rank
That's why SAMITorch had a prepare_batch
method with was easy to test to verify if data is sent to the good device.
My way to initialize this was in the RunningConfig.init()
method:
self._device = torch.device("cuda:" + str(self._local_rank)) if torch.cuda.is_available() else torch.device("cpu")
Other environment variables like:
must be set before using apex.parallel.DistributedDataParallel()
constructor which is used once the model is created, i.e:
self._config.model = DDP(self._config.model, delay_allreduce=True)
Describe the solution you'd like
Support for apex.parallel.DistributedDataParallel()
Describe alternatives you've considered
None.
Will this change the current api? How?
Should not change API in best case scenario. The DataParallel use case should be user transparent at best.
Additional context
None.
Describe the bug
The current validation step is always shifted by an factor of the len(dataloader) when training the network because the current_valid_batch is not reset to 0 at the end of epoch.
To Reproduce
Steps to reproduce the behavior:
Train a network and print the training status.
Expected behavior
The validation steps should increase linearly by step of 1
Screenshots
If applicable, add screenshots to help explain your problem.
Describe the bug
Event if you don't want any params, you have to write the params keys.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
It should handle the fact that you don't want any params
Screenshots
Criterion works but not the metric (no params key)
Is your feature request related to a problem? Please describe.
At the moment a test dataset is mandatory. Should be optional
Describe the solution you'd like
Run the test_epoch only if the test dataloader is present
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
Cannot substract number to an Apex Loss due to a missing substract operation.
Describe the solution you'd like
Implement the standard sub opration to apex loss
Will this change the current api? How?
No
Describe the bug
The same function is used to validate the training and validation in an if else if manner. If the shouldHandleStep function is false for training it could be true for validation even if the fired event is training. This leads to unwanted ploting.
Describe the bug
Currently only the ReduceLROnPleateau scheduler is supported as the loss is given as arg.
To Reproduce
Use any other scheduler,
Expected behavior
Should give the epoch as parameters to other scheduler
Describe the bug
Sub-items in YAML configuration do not print in Visdom environment.
To Reproduce
In a configuration of the model:
optimizer:
type: "FusedSGD"
params:
lr: 0.00001
momentum: 0.9
weight_decay: 0.1
Will only print in Visdom:
_optimizer_config:
Expected behavior
Sub-items in configuration should print:
_optimizer_config:
type: FusedSGD
params: {lr: 0.00001, momentum: 0.9, weight_decay: 0.1
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
No.
Describe the solution you'd like
Instead of having different EventPreprocessor for every visdom plot type we should have a single event preprocessor for custom variables that accepts the type plot, frequency and opts as args.
Describe alternatives you've considered
none
Will this change the current api? How?
Yes, instead of having multiple Preprocessor for custom variable there will be one.
Additional context
No.
Is your feature request related to a problem? Please describe.
There is a lack of testing
Describe the solution you'd like
Add more unit tests
Is your feature request related to a problem? Please describe.
Yes.
Kerosene relies on environment variable CUDA_VISIBLE_DEVICES=...
. If the machine has more than 1 GPU, Kerosene waits for a process with rank > 0 and hangs.
Describe the solution you'd like
Add an argument to start script with Distributed Data Parallel or not. If --use-ddp
is active, it must be jointly used with torch.launcher.distributed
and will start >1 processes for each GPU.
If not passed, DDP won’t be initialized and user could use >1 GPU for its model.
Describe alternatives you've considered
None
Will this change the current api? How?
Yes. Need to add an argument in the RunningConfig.
Additional context
Must be tested on a machine with >1 GPU.
Is your feature request related to a problem? Please describe.
At the moment, if the program crash nothing is saved.
Describe the solution you'd like
There should be an exception handler that fire an Event.ExceptionOccured that is handle by default. This could save the model, print some error message, save the visdom evironment, save visdom to a file, etc..
Describe alternatives you've considered
Could be done user side.
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
Not all optimizers are implemented in the factory. e.g. RMSprop
Describe the solution you'd like
Add the optimizers in the factory.
Is your feature request related to a problem? Please describe.
Yes, the application is not working when user has multiple GPU and launching without the torch launcher.
Describe the bug
Error RuntimeError: A tensor was not cuda. (multi_tensor_apply at csrc/multi_tensor_apply.cuh:60)
is being fired after optimizer being reloaded from checkpoint.
To Reproduce
Steps to reproduce the behavior:
path
attribute in config to reload from saved checkpoint.RuntimeError: A tensor as not cuda. (multi_tensor_apply at csrc/multi_tensor_apply.cuh:60)
is displayed.Expected behavior
Training should restart without error.
Desktop (please complete the following information):
Is your feature request related to a problem? Please describe.
At the moment all the handlers are synchronous. If a handler takes a long time to respond for some reason (e.g. Visdom beacause of network latency) it slows down the training process.
Describe the solution you'd like
Some handlers like visdom could run in a thead and act as consumers. They would consume the data in a queue and process them without slowing down the training process.
Describe alternatives you've considered
None
Will this change the current api? How?
No
Is your feature request related to a problem? Please describe.
Lacks of gradient accumulation.
Describe the solution you'd like
A way to accumulate gradients for k
steps.
Describe alternatives you've considered
None
Will this change the current api? How?
Yes, will add something in config to specify number of steps to accumulate gradients.
Additional context
Accumulated gradients runs k
small batches of size N before doing a backwards pass.
Is your feature request related to a problem? Please describe.
At the moment, a REST call is made to visdom per visdom data which takes unecessary CPU time.
Describe the solution you'd like
Visdom data could be stacked and push only once every N iteration.
Describe alternatives you've considered
None
Will this change the current api? How?
No
Additional context
No
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.