lessw2020 / ranger-deep-learning-optimizer Goto Github PK
View Code? Open in Web Editor NEWRanger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase
License: Apache License 2.0
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase
License: Apache License 2.0
Hello,
Thank you for your work on these optimizers btw. I was testing a couple out and was performing quite well with the RangerVA originally. Then, when your gradient centralization was added I got further improvements but it also seemed to be overtraining the train set more easily despite using the same parameters. Therefore, I tried to implement combining the gradient centralization into the RangerVA algorithm and so far it seems to be performing quite well and faster since it seems I can use larger batch sizes. I was wondering if you could quickly check, whenever you have some free time, if I implemented correctly in the code below since you are so used to this optimizer.
Best
``
class RangerVA(Optimizer):
def __init__(self, params, lr=1e-3,
alpha=0.5, k=6, n_sma_threshhold=5, betas=(.95,0.999),
eps=1e-5, weight_decay=0, amsgrad=True, transformer='softplus', smooth=50,
grad_transformer='square',use_gc=True, gc_conv_only=False):
#parameter checks
if not 0.0 <= alpha <= 1.0:
raise ValueError(f'Invalid slow update rate: {alpha}')
if not 1 <= k:
raise ValueError(f'Invalid lookahead steps: {k}')
if not lr > 0:
raise ValueError(f'Invalid Learning Rate: {lr}')
if not eps > 0:
raise ValueError(f'Invalid eps: {eps}')
#prep defaults and init torch.optim base
defaults = dict(lr=lr, alpha=alpha, k=k, step_counter=0, betas=betas,
n_sma_threshhold=n_sma_threshhold, eps=eps, weight_decay=weight_decay,
smooth=smooth, transformer=transformer, grad_transformer=grad_transformer,
amsgrad=amsgrad,use_gc=use_gc, gc_conv_only=gc_conv_only )
super().__init__(params,defaults)
#adjustable threshold
self.n_sma_threshhold = n_sma_threshhold
#look ahead params
self.alpha = alpha
self.k = k
#radam buffer for state
self.radam_buffer = [[None,None,None] for ind in range(10)]
#gc on or off
self.use_gc=use_gc
#level of gradient centralization
self.gc_gradient_threshold = 3 if gc_conv_only else 1
print(f"Ranger optimizer loaded. \nGradient Centralization usage = {self.use_gc}")
if (self.use_gc and self.gc_gradient_threshold==1):
print(f"GC applied to both conv and fc layers")
elif (self.use_gc and self.gc_gradient_threshold==3):
print(f"GC applied to conv layers only")
def __setstate__(self, state):
print("set state called")
super(RangerVA, self).__setstate__(state)
def step(self, closure=None):
loss = None
#Evaluate averages and grad, update param tensors
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data.double()
if grad.is_sparse:
raise RuntimeError('Ranger optimizer does not support sparse gradients')
amsgrad = group['amsgrad']
smooth = group['smooth']
grad_transformer = group['grad_transformer']
p_data_fp32 = p.data.double()
state = self.state[p] #get state dict for this param
if len(state) == 0:
state['step'] = 0
state['exp_avg'] = torch.zeros_like(p_data_fp32)
state['exp_avg_sq'] = torch.zeros_like(p_data_fp32)
if amsgrad:
# Maintains max of all exp. moving avg. of sq. grad. values
state['max_exp_avg_sq'] = torch.zeros_like(p.data)
#look ahead weight storage now in state dict
state['slow_buffer'] = torch.empty_like(p.data)
state['slow_buffer'].copy_(p.data)
else:
state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32)
state['exp_avg_sq'] = state['exp_avg_sq'].type_as(p_data_fp32)
#begin computations
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
beta1, beta2 = group['betas']
if amsgrad:
max_exp_avg_sq = state['max_exp_avg_sq']
# Maintains the maximum of all 2nd moment running avg. till now
torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
# Use the max. for normalizing running avg. of gradient
denomc = max_exp_avg_sq.clone()
else:
denomc = exp_avg_sq.clone()
#GC operation for Conv layers and FC layers
if grad.dim() > self.gc_gradient_threshold:
grad.add_(-grad.mean(dim = tuple(range(1,grad.dim())), keepdim = True))
state['step'] += 1
#compute variance mov avg
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
#compute mean moving avg
exp_avg.mul_(beta1).add_(1 - beta1, grad)
buffered = self.radam_buffer[int(state['step'] % 10)]
if state['step'] == buffered[0]:
N_sma, step_size = buffered[1], buffered[2]
else:
buffered[0] = state['step']
beta2_t = beta2 ** state['step']
N_sma_max = 2 / (1 - beta2) - 1
N_sma = N_sma_max - 2 * state['step'] * beta2_t / (1 - beta2_t)
buffered[1] = N_sma
if N_sma > self.n_sma_threshhold:
step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
else:
step_size = 1.0 / (1 - beta1 ** state['step'])
buffered[2] = step_size
##transformer
if grad_transformer == 'square':
grad_tmp = grad**2
denomc.sqrt_()
elif grad_transformer == 'abs':
grad_tmp = grad.abs()
exp_avg_sq.mul_(beta2).add_((1 - beta2)*grad_tmp)
if group['weight_decay'] != 0:
p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32)
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
# ...let's use calibrated alr
if N_sma > self.n_sma_threshhold:
if group['transformer'] =='softplus':
sp = torch.nn.Softplus( smooth)
denomf = sp( denomc)
p_data_fp32.addcdiv_(-step_size, exp_avg, denomf )
else:
denom = exp_avg_sq.sqrt().add_(group['eps'])
p_data_fp32.addcdiv_(-step_size * group['lr'], exp_avg, denom)
else:
p_data_fp32.add_(-step_size * group['lr'], exp_avg)
p.data.copy_(p_data_fp32)
#integrated look ahead...
#we do it at the param level instead of group level
if state['step'] % group['k'] == 0:
slow_p = state['slow_buffer'] #get access to slow param tensor
slow_p.add_(self.alpha, p.data - slow_p) #(fast weights - slow weights) * alpha
p.data.copy_(slow_p) #copy interpolated weights to RAdam param tensor
return loss
I just released this code on PyPI. It's called asranger
(ranger
was taken).
So it can be installed with pip install asranger
and can be made a hard dependency by other projects on PyPI.
The corresponding code is on my fork.
Hi,
I had today a relatively long debug session, after I've upgraded my Pytorch Lightning installation, that the training_step wasn't called.
It finally turned out, that the problem was that the "closure" argument is not used in the step function (it is commented out - as also noted in the source code).
However, as it is apparently required by some libraries and is also recommended by the official PyTorch guidelines, it would be great if it would be better documented, that people might need to enable these lines.
Thanks in advance.
Hi There!
Thanks for putting together this code for Rectified Adam with Lookahead optimizer. I used this optimization function to train my model with fastai and successfully trained the model.
I exported the model using
feature = 'silhouette'
learn.export(f'{feature}_efficientnet-b3.pkl')
and later during inference I am trying to load the learner using
from ranger import Ranger
feature = 'silhouette'
learn = load_learner(path = model_path, file = f'{feature}_efficientnet-b3.pkl')
I have defined the model path properly in the previous cells. But for some reason, I cannot load the learner. The file cannot locate the module ranger.ranger
. Can someone please help me fix this issue?
Here's a screenshot of the error for your reference.
Thanks & Regards,
Vinayak.
Hi,
thanks for your work.
I just plugged it into my model and found that step_counter was not set for all param_groups.
I fixed it with this hack:
#look ahead tracking and updating if latest batch = k
for group,slow_weights in zip(self.param_groups,self.slow_weights):
if 'step_counter' not in group:
group["step_counter"] = 0
but I suspect it's not optimal...
this would mean that self.param_groups changed between the constructor and step(), but I have no idea why. Have you seen something similar before?
Thanks
Recent transformers architectures are very famous in NLP: BERT, GPT-2, RoBERTa, XLNET. Did you try to fine-tune them on some NLP task? If so, what was the best Ranger hyper-parameters and learning rate scheduler?
Getting the following error from the most recent version:
AttributeError: 'Ranger' object has no attribute 'radam_buffer'
i want to do a test of your ranger,i only know cosine anneal training, can you tell me the meaning of flat?thanks
Your optimizer looks like a big achievement!
I have used " optimizer=Ranger(lr=0.001)" in keras .
But I have a error named "TypeError: init() missing 1 required positional argument: 'params'".
I don't know how to debug it . Can you help me?
Actually, there is a pip package but it is based out of a fork of this repo. I think it would make sense to collate this effort to the main repo.
Originally posted by @sarthakpati in #33 (comment)
Hello. First of all, thank you for sharing code and experiment results.
Reading the code, I found that the model will use fast weights to infer. According to LookAhead, fast weights (before synchronization) may perform worse than slow weights. By chance of (1-1/k) probability (80% when k=5), we will use unsynchronized fast weights to validate/test. Therefore, it should be better if we manually synchronize before evaluation.
In my recent paper I used Ranger. I wish to give all the credit the author(s) deserves, but I'm not sure how to properly cite it? Currently I cited the medium article. Should I cite this github repo instead? Thanks.
Hi all,
My colleague and I tried a combination of (relatively) large Ranger learning rate (say, 0.001) + large weight decay (say, 0.1). Seems the large decay leads to better performance? We tried two different models, and observed 0.5-1.5% increase of ImageNet classification accuracy, but both models were customized models, and not standard ones like Resnet.
Not sure whether anyone else finds similar results.
I tried ranger vs adamw on single and 8 gpu setup, while ranger better on single gpu, on DDP setup it performe worse, any advises?
Hi,
I have a dream and I'll try to share it to you.
But before explaining further, I'll need your brain to analyze this input and output me what you think about it!
Small rant on the inertia of AI research
First of all, thank you for advancing progress in deep learning.
I'm just a random guy that want to implement an AGI (lol) and like many Nlp engeeners, I need HIGHLY accurate neural networks for fundamental NLP tasks (e.g POS tag, NER, dep parsing, Coref resolution, WSD, etc)
They are all not very accurate (often sub 95% F1 score) and their errors add up.
Such limitations make Nlp not yet suitable for many things.
This is why improving the state of the art (which can be observed on paperswithcode.com) is a crucial priority from academicians.
Effectively, many researchers have smart ideas to improve the state of the art and often slightly improve it by:
Having a "standard neural network" for the task and mix with it their new fancy idea.
I talk from knowledge, I've read most papers from state of the art leaderboards from most fundamental NLP tasks.
Almost always they have this common baseline + one idea, theirs.
The common baseline sometimes slowly evolve (e.g now it's often a pre trained model (say BERT) + fine tuning + their idea.
Sorry to say, but "this" is to me retarded
Where "this" mean the fact that by far, most researchers work in isolation, not integrating others ideas (or with such a slow inertia).
I would have wished that state of the art in one Nlp task would be a combination of e.g 50 innovative and complementary ideas from researchers.
You are researchers, do you have an idea why that is the case? If someone actually tried to merge all good complementary and compatible ideas, would they have the best, unmatchable state of the art?
Why facebookresearch, Microsoft, Google don't try the low hanging fruit in addition to producing X new shiny ideas per month, actually try to merge them in a coherent, synergetic manner??
I would like you to tell me what you think of this major issue that slow AI progress.
As an example of such inertia let's talk about Swish, Mish or RAdam :
Those things are incredibly easy to try and see "hey does it give to my neural network free accuracy gains?"
Yet not any paper on state of the art leaderboards has tried Swish, Mish or RAdam despite being soo simple to try (you don't need to change the neural network)
Not even pre trained models where so many papers depend on them (I opened issues for each of them).
Once I know what you think about this research inertia, I'll explain my vision of what needs to be done to fix it.
It would be very helpful if you could provide implementation in keras.
Thank you for your excellent work~
I notice that the best model of ranger optimizer have flat learning rate for 75%. Is it mean ranger optimizer is not sensitive to lr?
Looking forward to your early reply~
Just a warning to the curious, I tried to train DCCRN (from https://github.com/mpariente/asteroid) with Ranger2020 (default params) and it was stuck at a large loss after less than 1 epoch, and loss did not improve for another 30 epochs. I did not debug further. Adam with default params works very well.
I found that step_size is too high in the initial 5 steps.
The problem is in the code:
if N_sma >= self.N_sma_threshhold:
step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
else:
step_size = 1.0 / (1 - beta1 ** state['step'])
If betas are set to (0.9, 0.999) the internal variables are changed as following:
state['step']| step_size
------------------------------
1 | 10
2 |5.26315789
3 |3.6900369
4 |2.90782204
5 |2.44194281
6 |0.00426327
7 |0.00524248
8 |0.00607304
9 |0.00681674
10 |0.00750596
Note, that step_size doesn't depend on gradient value and it scales learning_rate.
Thus RAdam aggressively moves weights from their initial values, even if they have a good initialization.
Is it better to set step_size equal to 0 if N_sma < self.N_sma_threshhold?
https://paperswithcode.com/paper/adas-adaptive-scheduling-of-stochastic
Could it beat rangerLars?
For AdamW people usually add some sort of learning rate decay: linear, cosine triangle, etc. Also, warm up steps are also popular.
Do we need all of these with Ranger or just use a fixed learning rate?
Im using nvidia apex and torch grad norm.
This is grad norm plot with ranger (red) and adamw (blue).
https://i.imgur.com/Ui4Sioo.png
Is this ok to have so huge grad norm values? Should I turn off grad norming?
Hi and thanks for the code!
I am using your script for my code and while adapting it to PEP8 specs I found a few details that you may want to change. These are style changes that add clarity, but of course it is up to you whether to adhere to PEP8 recommendations or not. I could prepare a pull request as well if you like this style.
required (from torch.optim.optimizer) is not used
itertools is not used
N_sma_threshhold <-- the variable name should not begin with a capital letter, also "threshold" has a typo
k <-- is an importan variable with an obscure name, perhaps something like "lookahead_steps" would be more clear?
Multi-line comments should use """comment"""
Normal comments need a space after #
betas=(.95,0.999) (and others) need a space after the coma
You have commented code that is not used, perhaps it would be best to remove it altogether.
Spaces are not consistent and don't agree with PEP8.
Variables self.slow_weights are always on cpu.
You can easily fix this by adding a .to() method in Ranger class like so:
def to(self, device):
if device is "cuda":
for i in range(len(self.slow_weights)):
for j, w in enumerate(self.slow_weights[i]):
self.slow_weights[i][j] = w.cuda()
elif device is "cpu":
for i in range(len(self.slow_weights)):
for j, w in enumerate(self.slow_weights[i]):
self.slow_weights[i][j] = w.cpu()
TypeError: mean() received an invalid combination of arguments - got (keepdim=bool, dim=tuple, ), but expected one of:
@lessw2020 Thanks for this awesome optimizer. I´m very excited about it!
There is one particular workload that trains using a batch of 1 item.
Theoretically, make sense to use RAdam (Rectified Adam), LookAhead, and GC in this context?
I´m thinking about it, read the papers but I still could not make a conclusion. As you (or any other person here) is much more experienced than me, do you have an option on this?
To save : 'optimizer' : optimizer.state_dict()
optimizer.load_state_dict(checkpoint['optimizer'])
However, I have the impression restarting the training always bring the accuracy down and then it recovers.
Best,
Thomas Chaton>
I am working on transformer now.
#13 I see this issue, but no one said they get a better result than AdamW yet.
Anyone have already make ranger work well in transformer by fine-tunning?
Also, I do not understand the Readme: 'Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.'
I use 1e-4 lr now, what is the '75% flat lr'?
What is 'lower lr for 25%'?
Could you show me some demo code about how to adjust the lr expect for the code init the Ranger?
I want to cite ranger on a Medium article and I would like to know if there is an arXiv publication of Ranger or a published peer-reviewed paper on some conference or journal.
I saw you linked a paper o the README.md, but it does not seem to be about ranger, as the very word does not appear in any part of it. I know the Radam and Lookahead paper, but the Ranger one is missing on my library. Thanks
Hi I was trying to save the model checkpoints after each epoch using the below code.But only the state dictionary of the zeroth epoch got stored and none of the others.Does the ranger optimiser object support state_dict ? If yes then how can I save it after each epoch?
out_model = os.path.join(args.model_dir, 'model.th') with open(out_model, 'wb') as f: torch.save(model.state_dict(), f) print("Model is dumped")
Thank you for the great implementation.
I think I found a small part to modify at ranger.py line 116.
original code:
if N_sma > N_sma_threshhold:
to be left:
if N_sma > self.N_sma_threshhold:
I get the following warning when using ranger with pytorch 1.6.0
/path/Ranger-Deep-Learning-Optimizer/ranger/ranger.py:138: UserWarning: This overload of addcmul_ is deprecated:
addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Does Ranger support Stochastic Weight Averaging?
Why Ranger is only available on a github but not as an pip package. Wouldn't it be easier for the community to actually use it?
You first have
if N_sma > self.N_sma_threshhold:
and then you have
if N_sma > 4:
Is it right that the second one is constant or should that also be N_sma_threshhold parameter?
Would you like to make this a python package that could be installed with pip? It would be more practical.
I'd like to include it in my repo asteroid and give you proper credit for it.
One way is to install a python package (I can make a PR for that), the other one would be to copy-paste some of the code and point to the license file. Which way would you prefer?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.