Light

fsschneider / deepobs Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 34.0 569.06 MB

DeepOBS: A Deep Learning Optimizer Benchmark Suite

License: MIT License

Python 97.74% Shell 2.26%

deepobs's Introduction

Hi there, I'm Frank Schneider 👋

I am a Postdoctoral Researcher in the Methods of Machine Learning group at the University of Tübingen.

I work to make deep learning more user-friendly by focusing on the training algorithms.

🚀 I’m currently working on faster training methods for deep neural networks.
🥇 My past projects have focused on creating benchmarks for deep learning optimizers (see DeepOBS and Descending through a Crowded Valley) and novel debugging tools for training neural networks (see Cockpit).
🧑‍🤝‍🧑 I’m part of the MLCommons™ working group for Algorithmic Efficiency, building a competition and benchmark of faster neural network training algorithms.

Connect with me:

deepobs's People

Contributors

Stargazers

Watchers

deepobs's Issues

Incompatible baselines ?

Hi, thanks for the project, it's really handy! I tried to use the released 1.2.0-beta0 version as well as the master with the baselines from this repository: https://github.com/fsschneider/DeepOBS_Baselines without success, I always get the same error:

(...)
File "/deepobs/analyzer/shared_utils.py", line 118, in aggregate_runs
    aggregate["optimizer_hyperparams"] = json_data["optimizer_hyperparams"]
KeyError: 'optimizer_hyperparams'

Are there any baselines that are supported by the version with PyTorch support as well ?

To Do

We will do the following steps for version 1.2.0:

Provide extensive baselines for version 1.2.0.

Evaluation set is a subset of the training set

Hello Frank,
just came over the following pattern that is used in all dataset classes:

def _make_train_eval_dataset(self):
    """Creates the CIFAR-10 train eval dataset.

Returns:
  A tf.data.Dataset instance with batches of training eval data.
"""
    return self._train_dataset.take(
        self._train_eval_size // self._batch_size)

The problem is that the take method does not delete the data from the Dataset the data is taken from. As a result the evaluation set and the training set are not distinct. This should not be the case or at least, this is not the standard way.

Here a short dummy example that shows that the data is really not deleted from the train dataset:

import tensorflow as tf
import numpy as np

x= np.array([1,2,3,4,5])

dataset1= tf.data.Dataset.from_tensor_slices(x)
dataset2= dataset1.take(3)
it1= dataset1.make_one_shot_iterator()
it2 = dataset2.make_one_shot_iterator()
sess= tf.Session()
it1next = it1.get_next()
it2next = it2.get_next()
for i in range(5):
print(sess.run([it1next]))
for i in range(3):
print(sess.run([it2next]))

result:
[1]
[2]
[3]
[4]
[5]

[1]
[2]
[3]

To Do

We will do the following steps for version 1.2.0:

Include a validation set for PyTorch (needs to be merged from Aaron's branch)
Include a validation set for TensorFlow (almost ready)
Add a graphic with the split/setup for all four data sets to the docs.

Sequential version of quadratic problem in TensorFlow

The quadratic_deep problem for PyTorch has been updated and slightly changed. It is now re-written as a sequential "neural network". This allows compatibility for example with BackPACK.

The TensorFlow version should be updated accordingly. The update most likely introduces only constant or scaling changes. But to be precise, the TensorFlow version should be as equivalant as possible.

To Do

We will do the following steps for version 1.2.0:

Update the TensorFlow version of quadratic_deep to match our PyTorch version.

Make the device an argument of the runner

The device on which the training of the runner is performed should by settable as an argument in the run() method. For both frameworks. This allows for more flexible hardware usage.

Error in Plotting

Getting the following error trying to plot the results for the simple example as described in the documentation. I am running on Colab notebook with Python 3.6.8.

matplotlib 3.0.3
matplotlib-venn 0.11.5
matplotlib2tikz 0.7.5

Error message is:

/usr/local/lib/python3.6/dist-packages/matplotlib2tikz/save.py in _recurse(data, obj)
340 """
341 content = _ContentManager()
--> 342 for child in obj.get_children():
343 # Some patches are Spines, too; skip those entirely.
344 # See nschloe/tikzplotlib#277.

AttributeError: 'str' object has no attribute 'get_children'

To Do

We will do the following steps for version 1.2.0:

Update matplotlib2tikz to tikzplotlib
Add tested version(s) to the documentation

Implement a sanity check for existing tuner output pathes

Let us assume I run a tuning A and the results are written to './results'. When I change my mind and want to run a different tuning B and I do not specify a different output folder the new outputs of B are also written to './results'. This can easily happen when I use a script for A, adapt it for B, and forget to change the output directory (or when I use the default of DeepOBS). The problem this implies is:

The outputs of A and B are both in the './results' path and further analyses (e.g. getting the best hyperparameter setting with the analyzer) are performed on all the runs. If I am only interested in B, but results of A are included I may end up with the best setting of A.

Proposed solution:
I know that we could simply expect the user to be smart enough to delete the results of A first (or to change the output directory of B), but if the users forgets to do so it can be really a mess. Therefore, I suggest to implement a sanity check in the tuner that prompts or warns the user when tuning is run on an already existing output directory. As far as I know the rerunning of the best setting is based on the runner class and not the tuner class. Therefore, rerunning the best setting would not prompt the user (which is fine).

However, just see this as an idea for improvement. The exact design may differ from my suggestion.

I want to know if you uesd the weight decay in pytorch?

I am looking forward to hearing from you.

Add prefetching for batches / Parallelization for preprocessing

During training on V100s, I noticed a low Volatile GPU-Utilization which is usually the case if training is rather fast but the batches don't get created fast enough by the CPU.

I've taken a look at some of the Testproblems and it doesn't seem like batches are being prefetched or the preprocessing is being parallelized. I would recommend making use of the TensorFlow prefetch method as well as map(preprocessing, num_parallel_calls=64) for parallelized preprocessing. This will most likely cause a tremendous speed-up for higher-end GPUs.

Log Used Soft- and Hardware and Wall-Clock Time

Runners should also log the wall-clock time for logged values. We can extract the aforementioned time to achieve a target error in the post-processing step.
For completeness, and as DeepOBS is still being developed, it would be good to track version and hardware information in the same .json file (e.g. torch, torchvision, deepobs, tensorflow versions + hardware info)

To Do

We will do the following steps for version 1.2.0:

Add logging of wall-clock time.
Add logging of used software.
Add logging of used hardware.

Feature Request: Add support for pytorch

(This is a test issue. As mentioned in the responses to #3 and #4 , there is a development branch that supports pytorch. find it here: https://github.com/abahde/DeepOBS)

Expected behavior

I would appreciate if DeepOBS had built-in support for my pytorch optimizers.

This is relevant because a lot of optimizer research happens on pytorch.

Proposed approach:

Maybe @abahde could send a pull request when he's finished implementing it. Then @fsschneider can accept the pull request, handle the merging and we have success!

To Do

We will do the following steps for version 1.2.0:

Automatically generate commands to rerun best setting

Implement a method tuner.generate_commands_for_rerun().

This methods should return the re-runs needed to be done once the best hyperparameter instance has been found.

Tuner that supports more than a single seed for all runs not just the best

Hi there,

I'm trying to use DeepOBS with GridSearch. I want to do a grid search averaged over a number of seeds. As far as I know, the grid search just uses a single seed for comparison which might be misleading to report the best performing hyper-parameters because of the stochasticity. Does this feature already exist? If so, please let me know how to do it with DeepOBS.

Thank you so much.

Make the optimizer name an argument of the runner

The user should be able to set the optimizer name by passing a string to the runner instance. This makes it possible to seperate SGD/Momentum/Nesterov in PyTorch and gives more flexibility for the optimizer developer.

To Do

We will do the following steps for version 1.2.0:

Add optimizer name as an optional argument to the runner. If none is given, it will use the internal name for the optimizer.

i want to know why the lr for SGD or MSGD is 0.01,rather than 0.1

The paper provided not 0.1? or 0.01 has better performance in 100 epochs.

get_performance_dictionary doesn't provide the desired metric

The interface of the function is

def get_performance_dictionary(
    optimizer_path, mode="most", metric="valid_accuracies", conv_perf_file=None
):

But, despite providing e.g. "valid_accuracies", the function returns sometimes the "test_accuracies".
The explanation can be found in the following line of code (permalink to dev branch):

DeepOBS/deepobs/analyzer/analyze.py

Line 532 in 9782c0b

 metric = "test_accuracies" if "test_accuracies" in sett.aggregate else "test_losses" 

This line overrides the metric provided by the user in all cases, making it redundant.
A proposed fix is to delete this line, or to remove the metric parameter from the function. I personally think the former is more meaningful, since it provides more flexibility to the end user.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

fsschneider / deepobs Goto Github PK

deepobs's Introduction

Hi there, I'm Frank Schneider 👋

Connect with me:

deepobs's People

Contributors

Stargazers

Watchers

Forkers

deepobs's Issues

Expected behavior

Proposed approach:

Recommend Projects

Recommend Topics

Recommend Org