overlordgolddragon / keras-adamw Goto Github PK

Keras/TF implementation of AdamW, SGDW, NadamW, Warm Restarts, and Learning Rate multipliers

License: MIT License

Python 99.93% Shell 0.07%

keras optimizers adamw adamwr nadam sgd learning-rate-multipliers warm-restarts tensorflow

keras-adamw's Introduction

Keras AdamW

Keras/TF implementation of AdamW, SGDW, NadamW, and Warm Restarts, based on paper Decoupled Weight Decay Regularization - plus Learning Rate Multipliers

Features

Weight decay fix: decoupling L2 penalty from gradient. Why use?
- Weight decay via L2 penalty yields worse generalization, due to decay not working properly
- Weight decay via L2 penalty leads to a hyperparameter coupling with lr, complicating search
Warm restarts (WR): cosine annealing learning rate schedule. Why use?
- Better generalization and faster convergence was shown by authors for various data and model sizes
LR multipliers: per-layer learning rate multipliers. Why use?
- Pretraining; if adding new layers to pretrained layers, using a global lr is prone to overfitting

Installation

pip install keras-adamw. Or, for latest version (most likely stable):

pip install git+https://github.com/OverLordGoldDragon/keras-adamw

Usage

If using tensorflow.keras imports, set import os; os.environ["TF_KERAS"]='1'.

Weight decay

AdamW(model=model)
Three methods to set weight_decays = {<weight matrix name>:<weight decay value>,}:

# 1. Automatically
Just pass in `model` (`AdamW(model=model)`), and decays will be automatically extracted.
Loss-based penalties (l1, l2, l1_l2) will be zeroed by default, but can be kept via
`zero_penalties=False` (NOT recommended, see Use guidelines).

# 2. Use keras_adamw.utils.py
Dense(.., kernel_regularizer=l2(0)) # set weight decays in layers as usual, but to ZERO
wd_dict = get_weight_decays(model)
# print(wd_dict) to see returned matrix names, note their order
# specify values as (l1, l2) tuples, both for l1_l2 decay
ordered_values = [(0, 1e-3), (1e-4, 2e-4), ..]
weight_decays = fill_dict_in_order(wd_dict, ordered_values)

# 3. Fill manually
model.layers[1].kernel.name # get name of kernel weight matrix of layer indexed 1
weight_decays.update({'conv1d_0/kernel:0': (1e-4, 0)}) # example

Warm restarts

AdamW(.., use_cosine_annealing=True, total_iterations=200) - refer to Use guidelines below

LR multipliers

AdamW(.., lr_multipliers=lr_multipliers) - to get, {<layer name>:<multiplier value>,}:

(a) Name every layer to be modified (recommended), e.g. Dense(.., name='dense_1') - OR
(b) Get every layer name, note which to modify: [print(idx,layer.name) for idx,layer in enumerate(model.layers)]
(a) lr_multipliers = {'conv1d_0':0.1} # target layer by full name - OR
(b) lr_multipliers = {'conv1d':0.1} # target all layers w/ name substring 'conv1d'

Example

import numpy as np
from keras.layers import Input, Dense, LSTM
from keras.models import Model
from keras.regularizers import l1, l2, l1_l2
from keras_adamw import AdamW

ipt   = Input(shape=(120, 4))
x     = LSTM(60, activation='relu', name='lstm_1',
             kernel_regularizer=l1(1e-4), recurrent_regularizer=l2(2e-4))(ipt)
out   = Dense(1, activation='sigmoid', kernel_regularizer=l1_l2(1e-4, 2e-4))(x)
model = Model(ipt, out)

lr_multipliers = {'lstm_1': 0.5}

optimizer = AdamW(lr=1e-4, model=model, lr_multipliers=lr_multipliers,
                  use_cosine_annealing=True, total_iterations=24)
model.compile(optimizer, loss='binary_crossentropy')

for epoch in range(3):
    for iteration in range(24):
        x = np.random.rand(10, 120, 4) # dummy data
        y = np.random.randint(0, 2, (10, 1)) # dummy labels
        loss = model.train_on_batch(x, y)
        print("Iter {} loss: {}".format(iteration + 1, "%.3f" % loss))
    print("EPOCH {} COMPLETED\n".format(epoch + 1))

(Full example + plot code, and explanation of lr_t vs. lr: example.py)

Use guidelines

Weight decay

Set L2 penalty to ZERO if regularizing a weight via weight_decays - else the purpose of the 'fix' is largely defeated, and weights will be over-decayed --My recommendation
lambda = lambda_norm * sqrt(1/total_iterations) --> can be changed; the intent is to scale λ to decouple it from other hyperparams - including (but not limited to), # of epochs & batch size. --Authors (Appendix, pg.1) (A-1)
total_iterations_wd --> set to normalize over all epochs (or other interval != total_iterations) instead of per-WR when using WR; may sometimes yield better results --My note

Warm restarts

Done automatically with autorestart=True, which is the default if use_cosine_annealing=True; internally sets t_cur=0 after total_iterations iterations.
Manually: set t_cur = -1 to restart schedule multiplier (see Example). Can be done at compilation or during training. Non--1 is also valid, and will start eta_t at another point on the cosine curve. Details in A-2,3
t_cur should be set at iter == total_iterations - 2; explanation here
Set total_iterations to the # of expected weight updates for the given restart --Authors (A-1,2)
eta_min=0, eta_max=1 are tunable hyperparameters; e.g., an exponential schedule can be used for eta_max. If unsure, the defaults were shown to work well in the paper. --Authors
Save/load optimizer state; WR relies on using the optimizer's update history for effective transitions --Authors (A-2)

# 'total_iterations' general purpose example
def get_total_iterations(restart_idx, num_epochs, iterations_per_epoch):
    return num_epochs[restart_idx] * iterations_per_epoch[restart_idx]
get_total_iterations(0, num_epochs=[1,3,5,8], iterations_per_epoch=[240,120,60,30])

Learning rate multipliers

Best used for pretrained layers - e.g. greedy layer-wise pretraining, or pretraining a feature extractor to a classifier network. Can be a better alternative to freezing layer weights. --My recommendation
It's often best not to pretrain layers fully (till convergence, or even best obtainable validation score) - as it may inhibit their ability to adapt to newly-added layers. --My recommendation
The more the layers are pretrained, the lower their fraction of new layers' lr should be. --My recommendation

How to cite

Short form:

John Muradeli, keras-adamw, 2019. GitHub repository, https://github.com/OverLordGoldDragon/keras-adamw/. DOI: 10.5281/zenodo.5080529

BibTeX:

@article{OverLordGoldDragon2019keras-adamw,
  title={Keras AdamW},
  author={John Muradeli},
  journal={GitHub. Note: https://github.com/OverLordGoldDragon/keras-adamw/},
  year={2019},
  doi={10.5281/zenodo.5080529},
}

keras-adamw's People

Contributors

Stargazers

Watchers

keras-adamw's Issues

AttributeError: 'L1' object has no attribute 'l2'

I have the following code:

lr_multipliers = {'lstm_1': 0.5}
optimizer = AdamW(lr=1e-4, model=model_AdamW, lr_multipliers=lr_multipliers,
use_cosine_annealing=True, total_iterations=24)
model_AdamW.compile(optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

and got error: AttributeError: 'L1' object has no attribute 'l2'

AdaBelief

Thank you very much for your work on this project! It really is an excellent contribution to provide an up-to-date version of AdamW that allows layer-dependent learning rates. I'm wondering what your thoughts are about AdaBelief and if you'd want to add it as an option to this package?

get_weight_decays doesn't work properly for Bidirectional RNNs

In particular IndRNNs and likely RNNs, maybe also LSTMs - looking into it, will fix

Import issue while using tensorflow.keras

Hi,

Very good work for implementing this :)
However, the TF_KERAS environnement doesn't work to select the rightAdamW from the right optimizers file ... I set the variable at the beginning of my code and debug mode shows it is present.
Therefore, I'm doing from keras_adamw.optimizers_v2 import AdamW and it works :)

It seems that using the direct import from keras_adamw import AdamW, it goes in the the optimizers225.py

Thanks for your work

Warm Restart

Thank you for developing AdamW!

I have a question about warm restart. Is it necessary to force set "t_cur = 0" after the end of the training of each epoch? (var 1)
Or is 't_cur' automatically set to 0 after reaching 'total_iterations'? (var 2)

(var 1)
def on_epoch_end(...):
...
K.set_value(self.model.optimizer.t_cur, 0) # WARM RESTART
...

(var 2)
trainset_size = 1000
batch_size = 64
optimizer = AdamW(..., total_iterations=15, batch_size=batch_size)

And correct me if I'm wrong. Best practice for setting 'total_iterations' is:
total_iterations = int(trainset_size / batch_size)

Error when using SGDW in a complex project

I was trying to use the SGDW with a project but it seems to be causing an error
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation resample_p6/conv2d/kernel/Initializer/random_uniform/sub: Could not satisfy explicit device specification '' because the node {{colocation_node resample_p6/conv2d/kernel/Initializer/random_uniform/sub}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0].

The error seems to be caused only when the SGDW optimizer and not the AdamW one (I haven't tried the NAdamW one).

The project I tried to apply the SGDW is EfficentDet which is quite complex project. Nevertheless, this shouldn't happen. I am not sure which is the cause of the problem. Also, when used in a small network as the one provided in the example.py there doesn't seem to be any problem.

Note to users

Currently the implementation isn't fully compatible with tf.keras, tf.python.keras, or TensorFlow 2.0

I'm working on addressing all. TF 2 brought with it sweeping changes complicating compatibility, and some bugs. You can "watch" the repo to be notified of the next release (update).

SGDW doesn't work

Hi!
Thanks for all your effort. This code really helps when implementing a custom optimizer.

There seems to be an issue with SGDW. The sample code from the README works fine with AdamW, but crashes when using SGDW:

import os; os.environ["TF_KERAS"]='1'
import numpy as np
from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2
import keras_adamw

ipt   = Input(shape=(120, 4))
x     = LSTM(60, activation='relu', name='lstm_1',
             kernel_regularizer=l1(1e-4), recurrent_regularizer=l2(2e-4))(ipt)
out   = Dense(1, activation='sigmoid', kernel_regularizer=l1_l2(1e-4, 2e-4))(x)
model = Model(ipt, out)

lr_multipliers = {'lstm_1': 0.5}

optimizer = keras_adamw.SGDW(lr=1e-4, model=model)
model.compile(optimizer, loss='binary_crossentropy')

for epoch in range(3):
    for iteration in range(24):
        x = np.random.rand(10, 120, 4) # dummy data
        y = np.random.randint(0, 2, (10, 1)) # dummy labels
        loss = model.train_on_batch(x, y)
        print("Iter {} loss: {}".format(iteration + 1, "%.3f" % loss))
    print("EPOCH {} COMPLETED\n".format(epoch + 1))

returns

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-d2bca98bfb4f> in <module>()
     21         x = np.random.rand(10, 120, 4) # dummy data
     22         y = np.random.randint(0, 2, (10, 1)) # dummy labels
---> 23         loss = model.train_on_batch(x, y)
     24         print("Iter {} loss: {}".format(iteration + 1, "%.3f" % loss))
     25     print("EPOCH {} COMPLETED\n".format(epoch + 1))

8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    966           except Exception as e:  # pylint:disable=broad-except
    967             if hasattr(e, "ag_error_metadata"):
--> 968               raise e.ag_error_metadata.to_exception(e)
    969             else:
    970               raise

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:541 train_step  **
        self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1814 _minimize
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:508 apply_gradients
        "name": name,
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2420 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2427 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:592 _distributed_apply  **
        var, apply_grad_to_update_var, args=(grad,), group=False))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2013 update
        return self._update(var, fn, args, kwargs, group)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2659 _update
        return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2665 _update_non_slot
        result = fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:567 apply_grad_to_update_var  **
        update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
    /usr/local/lib/python3.6/dist-packages/keras_adamw/optimizers_v2.py:672 _resource_apply_dense
        m = K.zeros(K.int_shape(var))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:1333 zeros
        return variable(v, dtype=dtype, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:845 variable
        constraint=constraint)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:261 __call__
        return cls._variable_v2_call(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:255 _variable_v2_call
        shape=shape)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py:511 invalid_creator_scope
        "tf.function-decorated function tried to create "

    ValueError: tf.function-decorated function tried to create variables on non-first call.

I'm using tensorflow-gpu version 2.2 and tf.keras.

Usage & concept questions

It works perfectly with me. Thank you for sharing and developing this repo. I think this idea really works (at least for my problem).

Thanks,
Chong

Invalid argument: Input to reshape is a tensor with 4300800 values, but the requested shape has 19268370432

Hi,

I am using keras-adamw with bert-for-tf2 under the AMD rocm environment, and sometimes I get an error like the following one:

File "bert-decept.py", line 543, in
history = fit_model(model, data, BATCH_SIZE, EPOCHS, tensorboard_callback, model_checkpoint_callback,
File "bert-decept.py", line 438, in fit_model
history = model.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py", line 766, in fit
return func.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 649, in fit
return fit_loop(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 386, in model_iteration
batch_outs = f(ins_batch)
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3631, in call
fetched = self._callable_fn(*array_vals,
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session._session,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input to reshape is a tensor with 4300800 values, but the requested shape has 19268370432
[[{{node bert_1/encoder/layer_7/attention/self/query/Tensordot}}]]
[[Func/training_2/AdamW/gradients/gradients/bert_1/encoder/layer_7/output/dropout_62/cond_grad/StatelessIf/then/_11515/input/_23174/_9837]]
(1) Invalid argument: Input to reshape is a tensor with 4300800 values, but the requested shape has 19268370432
[[{{node bert_1/encoder/layer_7/attention/self/query/Tensordot}}]]
0 successful operations.
0 derived errors ignored.

File "bert-decept.py", line 543, in
history = fit_model(model, data, BATCH_SIZE, EPOCHS, tensorboard_callback, model_checkpoint_callback,
File "bert-decept.py", line 438, in fit_model
history = model.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py", line 766, in fit
return func.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 649, in fit
return fit_loop(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 386, in model_iteration
batch_outs = f(ins_batch)
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3631, in call
fetched = self._callable_fn(*array_vals,
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session._session,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Size 0 must be non-negative, not -1737945760
[[{{node bert/encoder/layer_5/attention/self/query/Tensordot/Reshape}}]]
[[Func/training/AdamW/gradients/gradients/bert/encoder/layer_9/output/dropout_30/cond_grad/StatelessIf/then/_696/input/_2295/_3389]]
(1) Invalid argument: Size 0 must be non-negative, not -1737945760
[[{{node bert/encoder/layer_5/attention/self/query/Tensordot/Reshape}}]]

At least in my non-experienced eyes it seems like an invalid pointer reference, so probably not a problem related to adamw but probably to rocm. Does anyone have any idea/nsight about what might be the problem?

Best regards
Panagiotis

ValueError: Could not interpret optimizer identifier: <keras_adamw.optimizers_v2.AdamW object at 0x0000021E2F81D220>

Hello:
When I run the example, I had the error "ValueError: Could not interpret optimizer identifier: <keras_adamw.optimizers_v2.AdamW object at 0x0000021E2F81D220>". It is may be the different between keras and tf.keras. But when I change the lib as "
import numpy as np
import os
os.environ['TF_KERAS'] = '1'
from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2

from keras_adamw import AdamW"
And I run under the tensorflow environment

keras.legacy no longer present

Looks like keras.legacy is no longer part of 2.4:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-c6757b174f2e> in <module>()
----> 1 from keras_adamw import AdamW
      2 import tensorflow_hub as hub
      3 import os
      4 
      5 # Load compressed models from tensorflow_hub

1 frames
/usr/local/lib/python3.6/dist-packages/keras_adamw/optimizers.py in <module>()
      1 import numpy as np
      2 from keras import backend as K
----> 3 from keras.legacy import interfaces
      4 from keras.optimizers import Optimizer
      5 from .utils import _init_weight_decays, _apply_weight_decays, _check_args

ModuleNotFoundError: No module named 'keras.legacy'

AttributeError: module 'tensorflow.python.ops' has no attribute 'control_dependencies'

Hi @OverLordGoldDragon,

i'm trying to use AdamW in the latest version of your library, however got this error:

AttributeError: module 'tensorflow.python.ops' has no attribute 'control_dependencies'

I'm using Tensorflow 2.3, seems the control_dependencies already moved into different module: tf.control_dependencies

wdyt?

Comparison Against Adam

Is it possible for you to benchmark your implementation of AdamW against Tensorflow's implementation of Adam on multiple datasets? It would be useful information for users to decide whether AdamW is the right choice. I would be interested in the differences in the time it takes for every epoch step.

Last weight is updated prematurely (tf.keras)

Reproducing code -- fix. Only the very last weight in the model is affected. I'll fix this soon.

The effect of this should be minimal in practice; all it does is apply eta_t one t_cur ahead, for the very last weight.

WeightDecay is incorrectly normalized

Hey,

first of all thank you for this library, it's great and works great in general

I just wanted to point out, that I think the weight decay is wrongly normalized based on the batch size - from the original paper, the normalized weight decay formula is as follows:

λ = λ_norm * sqrt(b / BT), where b is the batch size, B is the number of epochs, and most importantly T is total number of training samples
The code assumes that total_iterations is set to be equal to BT, however the iterations are counted based on the number of batch updates, which is equal to step_size * epochs

This is missing the total number of samples in each batch b, to get back to the orginally used BT: sqrt (b / b * total_iterations ) = sqrt (1 / total_iterations )
.
Therefore, the batch_size should be set to 1, if setting total_iterations or total_iterations_wd as described in the examples here.

AttributeError: can't set attribute

Hi,

I'm using keras - 2.3.1
and TF - '1.15.2-dlenv_tfe'

my code:

        model = Model(inputs=inp, outputs=out)
        
        optimizer = AdamW(model, lr=1e-4)

        model.compile(loss='mse', optimizer=optimizer)

And I'm getting:

     68         with K.name_scope(self.__class__.__name__):
     69             self.iterations = K.variable(0, dtype='int64', name='iterations')
---> 70             self.lr = K.variable(lr, name='lr')
     71             self.beta_1 = K.variable(beta_1, name='beta_1')
     72             self.beta_2 = K.variable(beta_2, name='beta_2')

AttributeError: can't set attribute

Actual lr seems fixed during training

I am a bit confused about the actual optimizers lr at each batch.

I have noticed that you there is a (now closed) issue regarding the Usage & concept questions where you refer to the actual lr (learning rate) being lr*eta_t.

But if I use your example as basis and include a plotting of the lr at each batch there does not appear to be any fluctuation of actual lr regardless of the values eta_t is assigned to.

from tensorflow.keras import backend as K
import os
os.environ["TF_KERAS"] = '1'
os.environ["TF_EAGER"] = '0'

from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2

import numpy as np
import matplotlib.pyplot as plt

from keras_adamw import AdamW
from keras_adamw.utils import K_eval

USE_CPU = True

if USE_CPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = ''

ipt = Input(shape=(120, 4))
x = LSTM(60, activation='relu', name='lstm_1',
         kernel_regularizer=l1(1e-4), recurrent_regularizer=l2(2e-4))(ipt)
out = Dense(1, activation='sigmoid', kernel_regularizer=l1_l2(1e-4, 2e-4))(x)
model = Model(ipt, out)

lr_multipliers = {'lstm_1': 0.5}

optimizer = AdamW(lr=1e-4, model=model, lr_multipliers=lr_multipliers,
                  use_cosine_annealing=True, total_iterations=24)
model.compile(optimizer, loss='binary_crossentropy')

eta_history = []
lr_history = []
for epoch in range(3):
    for iteration in range(24):
        x = np.random.rand(10, 120, 4)  # dummy data
        y = np.random.randint(0, 2, (10, 1))  # dummy labels
        loss = model.train_on_batch(x, y)
        eta_t = K_eval(model.optimizer.eta_t, K)
        eta_history.append(eta_t)
        t_cur = K_eval(model.optimizer.t_cur, K)
        lr = K_eval(model.optimizer.lr, K)  # K.eval(model.optimizer.lr)
        lr_history.append(lr)
        eta_max = K_eval(model.optimizer.eta_max, K)
        eta_min = K_eval(model.optimizer.eta_min, K)

        print('Iter {} t_cur: {} - lr: {} - eta_max: {} - eta_min: {}'.format(iteration + 1, t_cur, lr, eta_max, eta_min))
        print("Iter {} loss: {} - eta_t: {}".format(iteration + 1, "%.3f" % loss, eta_t))
        if iteration == (24 - 2):
            K.set_value(model.optimizer.t_cur, -1)  # WARM RESTART
    print("EPOCH {} COMPLETED\n".format(epoch + 1))

plt.plot(eta_history, linewidth=2)
plt.xlim(0, len(eta_history))
plt.ylim(0, 1.05)
plt.ylabel('eta_t', weight='bold', fontsize=15)
plt.xlabel('Train iterations', weight='bold', fontsize=15)
plt.gcf().set_size_inches(10, 5)
plt.show()
plt.close()

plt.plot(lr_history, linewidth=2)
plt.xlim(0, len(lr_history))
plt.ylim(0.9*np.min(lr_history), 1.1*np.max(lr_history))
plt.ylabel('lr', weight='bold', fontsize=15)
plt.xlabel('Train iterations', weight='bold', fontsize=15)
plt.gcf().set_size_inches(10, 5)
plt.show()

Iter 1 t_cur: 1 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 1 loss: 0.691 - eta_t: 0.9953429698944092
Iter 2 t_cur: 2 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 2 loss: 0.694 - eta_t: 0.9814586639404297
Iter 3 t_cur: 3 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 3 loss: 0.704 - eta_t: 0.9586056470870972
Iter 4 t_cur: 4 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 4 loss: 0.689 - eta_t: 0.927209734916687
Iter 5 t_cur: 5 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 5 loss: 0.682 - eta_t: 0.8878556489944458
Iter 6 t_cur: 6 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 6 loss: 0.708 - eta_t: 0.8412765264511108
Iter 7 t_cur: 7 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 7 loss: 0.684 - eta_t: 0.788340151309967
Iter 8 t_cur: 8 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 8 loss: 0.691 - eta_t: 0.7300325036048889
Iter 9 t_cur: 9 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 9 loss: 0.701 - eta_t: 0.6674398183822632
Iter 10 t_cur: 10 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 10 loss: 0.690 - eta_t: 0.6017280220985413
Iter 11 t_cur: 11 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 11 loss: 0.699 - eta_t: 0.5341211557388306
Iter 12 t_cur: 12 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 12 loss: 0.699 - eta_t: 0.46587878465652466
Iter 13 t_cur: 13 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 13 loss: 0.687 - eta_t: 0.39827197790145874
Iter 14 t_cur: 14 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 14 loss: 0.713 - eta_t: 0.3325602114200592
Iter 15 t_cur: 15 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 15 loss: 0.709 - eta_t: 0.2699674367904663
Iter 16 t_cur: 16 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 16 loss: 0.688 - eta_t: 0.21165981888771057
Iter 17 t_cur: 17 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 17 loss: 0.692 - eta_t: 0.15872341394424438
Iter 18 t_cur: 18 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 18 loss: 0.687 - eta_t: 0.1121443510055542
Iter 19 t_cur: 19 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 19 loss: 0.684 - eta_t: 0.07279029488563538
Iter 20 t_cur: 20 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 20 loss: 0.693 - eta_t: 0.04139435291290283
Iter 21 t_cur: 21 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 21 loss: 0.699 - eta_t: 0.018541336059570312
Iter 22 t_cur: 22 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 22 loss: 0.699 - eta_t: 0.00465703010559082
Iter 23 t_cur: 23 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 23 loss: 0.678 - eta_t: 0.0
Iter 24 t_cur: 0 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 24 loss: 0.696 - eta_t: 1.0
EPOCH 1 COMPLETED

how to log learning_rate

Hi, I have used your adamw for my project,but when I log the learning_rate to my log file or tensorboard file with self.optimizer.lr, it always the initial learning_rate. Can you tell me how to log the changing learning_rate at each step ? Thanks

IMPORTANT: upgrade to 1.23

1.2 and 1.21 use erroneous decay formula, decaying l1 as l2 and vice versa; this is fixed in 1.23. Pardon the mishap.

overlordgolddragon / keras-adamw Goto Github PK

keras-adamw's Introduction

Keras AdamW

Features

Installation

Usage

Weight decay

Warm restarts

LR multipliers

Example

Use guidelines

Weight decay

Warm restarts

Learning rate multipliers

How to cite

keras-adamw's People

Contributors

Stargazers

Watchers

Forkers

keras-adamw's Issues

Recommend Projects

Recommend Topics

Recommend Org