Giter VIP home page Giter VIP logo

grda-optimizer-1's Introduction

gRDA-Optimizer

"Generalized Regularized Dual Averaging" is an optimizer that can learn a small sub-network during training, if one starts from an overparameterized dense network.

Please cite the following publication when referring to gRDA:

Chao, S.-K., Wang, Z., Xing, Y. and Cheng, G. (2020). Directional pruning of deep neural networks. NeurIPS 2020. Available at: https://arxiv.org/abs/2006.09358


Figure: The DP is the asymptotic directional pruning solution computed with gRDA, which lies in the same minimum valley as the SGD on the training loss landscape. DP has a sparsity 90.3% and a test accuracy 76.81%, while the SGD solution has no zero elements and has a test accuracy 76.6%. This example uses wide ResNet28x10 on CIFAR-100. See the paper for more detail.


Figure: training curve and sparsity based on the simple 6-layer CNN provided in the Keras tutorial https://keras.io/examples/cifar10_cnn/. The experiments are done using lr = 0.005 for SGD, SGD momentum and gRDAs. c = 0.005 for gRDA. lr = 0.005 and 0.001 for Adagrad and Adam, respectively.

Requirements

Keras version >= 2.2.5
Tensorflow version >= 1.14.0

How to use

There are three hyperparameters: Learning rate (lr), sparsity control mu (mu), and initial sparse control constant (c) in gRDA optimizer.

  • lr: as a rule of thumb, use the learning rate that works for the SGD without momentum. Scale the learning rate with the batch size.
  • mu: 0.5 < mu < 1. Greater mu will make the parameters more sparse. Selecting it in the set {0.501,0.51,0.55} is generally recommended.
  • c: a small number, e.g. 0 < c < 0.005. Greater c causes the model to be more sparse, especially at the early stage of training. c usually has small effect on the late stage of training. We recommend to first fix mu, then search for the largest c that preserves the testing accuracy with 1-5 epochs.

Keras

Suppose the loss function is the categorical crossentropy,

from grda import GRDA

opt = GRDA(lr = 0.005, c = 0.005, mu = 0.7)
model.compile(optimizer = opt, loss='categorical_crossentropy', metrics=['accuracy'])

Tensorflow

from grda_tensorflow import GRDA

n_epochs = 20
batch_size = 10
batches = 50000/batch_size # CIFAR-10 number of minibatches

opt = GRDA(learning_rate = 0.005, c = 0.005, mu = 0.51)
opt_r = opt.minimize(R_loss, var_list = r_vars)
with tf.Session(config=session_conf) as sess:
    sess.run(tf.global_variables_initializer())
    for e in range(n_epochs + 1):
        for b in range(batches):
            sess.run([R_loss, opt_r], feed_dict = {data: train_x, y: train_y})

PyTorch

You can check the test file mnist_test_pytorch.py. The essential part is below.

from grda_pytorch import gRDA

optimizer = gRDA(model.parameters(), lr=0.005, c=0.1, mu=0.5)
# loss.backward()
# optimizer.step()

See mnist_test_pytorch.py for an illustration on customized learning rate schedule.

PlaidML

Be cautious that it can be unstable with Mac when GPU is implemented, see plaidml/plaidml#168.

To run, define the softthreshold function in the plaidml backend file (plaidml/keras):

def softthreshold(x, t):
     x = clip(x, -t, t) * (builtins.abs(x) - t) / t
     return x

In the main file, add the following before importing other libraries

import plaidml.keras
plaidml.keras.install_backend()

from grda_plaidml import GRDA

Then the optimizer can be used in the same way as Keras.

Experiments

ResNet-50 on ImageNet

These PyTorch models are based on the official implementation: https://github.com/pytorch/examples/blob/master/imagenet/main.py

lr schedule c mu epoch sparsity top1 accuracy file size link
fix lr=0.1 (SGD, no momentum or weight decay) / / 89 / 68.71 98MB link
fix lr=0.1 0.005 0.55 145 91.54 69.76 195MB link
fix lr=0.1 0.005 0.51 136 87.38 70.35 195MB link
fix lr=0.1 0.005 0.501 145 86.03 70.60 195MB link
lr=0.1 (ep1-140) lr=0.01 (after ep140) 0.005 0.55 150 91.59 73.24 195MB link
lr=0.1 (ep1-140) lr=0.01 (after ep140) 0.005 0.51 146 87.28 73.14 195MB link
lr=0.1 (ep1-140) lr=0.01 (after ep140) 0.005 0.501 148 86.09 73.13 195MB link

grda-optimizer-1's People

Contributors

skchao74 avatar donlan2710 avatar zhanyu-wang avatar sheepfriend avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.