koraykv / optim Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 5.0 167 KB

Some optimization packages for torch7

Lua 100.00%

optim's People

Contributors

Stargazers

Watchers

Forkers

schaul rolfe sixin-zh ywelement jucor

optim's Issues

luarocks install fail

trying to install using "luarocks --local install" fails.

It seems to be a missing file as when I create a file "dokmedia/optim/optim" and run "luarocks --local make optim-1.0.1-0.rockspec" it will install.

ASGD has weight decay built in?

The averaged SGD function implements
x := (1 - lambda eta_t) x - eta_t df/dx(z,x)
which includes L2 weight decay with decay constant lambda. The weight decay constant lambda also appears in the learning rate decay function
eta_t = eta0 / (1 + lambda eta0 t) ^ 0.75

The ASGD papers I've read don't seem to require the use of L2 weight decay. Moreover, Xu (2010) - Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent - seems to imply that the lambda term in the learning rate decay function should be a multiple of the smallest eigenvalue of the Hessian. Looking at Bottou's SGD code, weight decay doesn't appear to be present in the CRF example, although it is in the readme file, from which the torch implementation seems to be derived.

Is the L2 weight decay an essential part of the ASGD implementation? Why is the weight decay constant tied to the learning rate decay function?

Thanks,
Jason

In optim.sgd, weight decay is not subject to learning rate decay

Weight decay uses the undecayed learning rate. As a result, as training progresses, the weight decay is effectively given more and more emphasis. This does not seem correct.

fista.lua: line search condition: possible error?

Hi,

I am currently working my way through the FISTA paper and your implementation and noticed a difference in the condition for the line search.
By FISTA paper, I mean the one you cite in your implementation: http://goo.gl/bSuKQ

On page 12 (page number 194) in the box describing the FISTA algorithm the condition is stated as:

F(p_L(y)) <= Q_L(p_L(y),y)

Note the upper case F which is defined on page 6 as F(x) = f(x) + g(x).

If I am not mistaken, you only use the lower case f(x) in line 109 of fista.lua.

if fply <= Q then

There is a comment on that line which I don't quite get. Maybe it explains why you omit the g(x).
Is this an error or is there a reason for omitting the g(x)?

Thanks in advance!

Best,

Hubert

gradParameters must be shared when using optim, but must not be shared when using updateParameters()

The methodology required to share parameters between modules is not the same when using the standard Module:updateParameters() versus the updates of the optim package.

Without optim, only the parameters should be shared; the gradients should be independent.
If you share the gradients, the shared gradients for each copy of the parameters contain the sum of the gradients over all of the copies. If you then use the standard updateParameters(), the update of each copy of the shared parameters then performs the full update for the shared parameters. Since every copy of the shared parameters is updated separately, the update of the shared parameters is effectively scaled by the number of copies. You need to accumulate the gradients across the shared parameters exactly once, in the parameters themselves, rather than in both the gradients and then again in the parameters.

If you use the optim package, the parameters and gradParameters are first flattened, in the process of which shared parameters are coalesced into a single copy. Gradients must be shared both so the sizes of the parameters and gradParameters match, and because each set of shared, coalesced parameters is only updated once, so the accumulation across shared parameters must be done in the gradients, rather than in the parameters.

This difference is not obvious, and does not appear during Jacobian unit-testing, which does not use the optim framework.

koraykv / optim Goto Github PK

optim's People

Contributors

Stargazers

Watchers

Forkers

optim's Issues

luarocks install fail

ASGD has weight decay built in?

In optim.sgd, weight decay is not subject to learning rate decay

fista.lua: line search condition: possible error?

gradParameters must be shared when using optim, but must not be shared when using updateParameters()

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent