lessw2020 / ranger-deep-learning-optimizer Goto Github PK

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

License: Apache License 2.0

Python 100.00%

ranger-deep-learning-optimizer's Issues

RangerVA with GC

Hello,

Thank you for your work on these optimizers btw. I was testing a couple out and was performing quite well with the RangerVA originally. Then, when your gradient centralization was added I got further improvements but it also seemed to be overtraining the train set more easily despite using the same parameters. Therefore, I tried to implement combining the gradient centralization into the RangerVA algorithm and so far it seems to be performing quite well and faster since it seems I can use larger batch sizes. I was wondering if you could quickly check, whenever you have some free time, if I implemented correctly in the code below since you are so used to this optimizer.

Best

``
class RangerVA(Optimizer):

def __init__(self, params, lr=1e-3, 
             alpha=0.5, k=6, n_sma_threshhold=5, betas=(.95,0.999), 
             eps=1e-5, weight_decay=0, amsgrad=True, transformer='softplus', smooth=50,
             grad_transformer='square',use_gc=True, gc_conv_only=False):
    #parameter checks
    if not 0.0 <= alpha <= 1.0:
        raise ValueError(f'Invalid slow update rate: {alpha}')
    if not 1 <= k:
        raise ValueError(f'Invalid lookahead steps: {k}')
    if not lr > 0:
        raise ValueError(f'Invalid Learning Rate: {lr}')
    if not eps > 0:
        raise ValueError(f'Invalid eps: {eps}')

    #prep defaults and init torch.optim base
    defaults = dict(lr=lr, alpha=alpha, k=k, step_counter=0, betas=betas, 
                    n_sma_threshhold=n_sma_threshhold, eps=eps, weight_decay=weight_decay,
                    smooth=smooth, transformer=transformer, grad_transformer=grad_transformer,
                   amsgrad=amsgrad,use_gc=use_gc, gc_conv_only=gc_conv_only )
    super().__init__(params,defaults)

    #adjustable threshold
    self.n_sma_threshhold = n_sma_threshhold   

    #look ahead params
    self.alpha = alpha
    self.k = k 

    #radam buffer for state
    self.radam_buffer = [[None,None,None] for ind in range(10)]
    
    #gc on or off
    self.use_gc=use_gc
    #level of gradient centralization
    self.gc_gradient_threshold = 3 if gc_conv_only else 1
    print(f"Ranger optimizer loaded. \nGradient Centralization usage = {self.use_gc}")
    if (self.use_gc and self.gc_gradient_threshold==1):
        print(f"GC applied to both conv and fc layers")
    elif (self.use_gc and self.gc_gradient_threshold==3):
        print(f"GC applied to conv layers only")


def __setstate__(self, state):
    print("set state called")
    super(RangerVA, self).__setstate__(state)


def step(self, closure=None):
    loss = None
    #Evaluate averages and grad, update param tensors
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is None:
                continue
            grad = p.grad.data.double()
            if grad.is_sparse:
                raise RuntimeError('Ranger optimizer does not support sparse gradients')
            
            amsgrad = group['amsgrad']
            smooth = group['smooth']
            grad_transformer = group['grad_transformer']

            p_data_fp32 = p.data.double()

            state = self.state[p]  #get state dict for this param

            if len(state) == 0:   
                state['step'] = 0
                state['exp_avg'] = torch.zeros_like(p_data_fp32)
                state['exp_avg_sq'] = torch.zeros_like(p_data_fp32)
                if amsgrad:
                    # Maintains max of all exp. moving avg. of sq. grad. values
                    state['max_exp_avg_sq'] = torch.zeros_like(p.data)                    

                #look ahead weight storage now in state dict 
                state['slow_buffer'] = torch.empty_like(p.data)
                state['slow_buffer'].copy_(p.data)

            else:
                state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32)
                state['exp_avg_sq'] = state['exp_avg_sq'].type_as(p_data_fp32)
                                  

            #begin computations 
            exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
            beta1, beta2 = group['betas']
            if amsgrad:
                max_exp_avg_sq = state['max_exp_avg_sq']  
                # Maintains the maximum of all 2nd moment running avg. till now
                torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
                # Use the max. for normalizing running avg. of gradient
                denomc = max_exp_avg_sq.clone()
            else:
                denomc = exp_avg_sq.clone()
            #GC operation for Conv layers and FC layers       
            if grad.dim() > self.gc_gradient_threshold:                    
                grad.add_(-grad.mean(dim = tuple(range(1,grad.dim())), keepdim = True))

            state['step'] += 1              

            #compute variance mov avg
            exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
            #compute mean moving avg
            exp_avg.mul_(beta1).add_(1 - beta1, grad)
            buffered = self.radam_buffer[int(state['step'] % 10)]
            if state['step'] == buffered[0]:
                N_sma, step_size = buffered[1], buffered[2]
            else:
                buffered[0] = state['step']
                beta2_t = beta2 ** state['step']
                N_sma_max = 2 / (1 - beta2) - 1
                N_sma = N_sma_max - 2 * state['step'] * beta2_t / (1 - beta2_t)
                buffered[1] = N_sma
                if N_sma > self.n_sma_threshhold:
                    step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
                else:
                    step_size = 1.0 / (1 - beta1 ** state['step'])
                buffered[2] = step_size

            
            ##transformer
            if grad_transformer == 'square':
                grad_tmp = grad**2
                denomc.sqrt_() 
            elif grad_transformer == 'abs':
                grad_tmp = grad.abs()


            exp_avg_sq.mul_(beta2).add_((1 - beta2)*grad_tmp)

            if group['weight_decay'] != 0:
                p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32)
            bias_correction1 = 1 - beta1 ** state['step']
            bias_correction2 = 1 - beta2 ** state['step']
            step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1                

            
            # ...let's use calibrated alr 
            if N_sma > self.n_sma_threshhold:
                if  group['transformer'] =='softplus':
                    sp = torch.nn.Softplus( smooth)
                    denomf = sp( denomc)
                    p_data_fp32.addcdiv_(-step_size, exp_avg, denomf )
                else:
                    denom = exp_avg_sq.sqrt().add_(group['eps'])
                    p_data_fp32.addcdiv_(-step_size * group['lr'], exp_avg, denom)
            else:
                p_data_fp32.add_(-step_size * group['lr'], exp_avg)
            p.data.copy_(p_data_fp32)

            #integrated look ahead...
            #we do it at the param level instead of group level
            if state['step'] % group['k'] == 0:
                slow_p = state['slow_buffer'] #get access to slow param tensor
                slow_p.add_(self.alpha, p.data - slow_p)  #(fast weights - slow weights) * alpha
                p.data.copy_(slow_p)  #copy interpolated weights to RAdam param tensor

    return loss

ranger and cosine annealing LR leads to different schedule than SGD optimizer? o_O

Released on PyPI

I just released this code on PyPI. It's called asranger (ranger was taken).
So it can be installed with pip install asranger and can be made a hard dependency by other projects on PyPI.
The corresponding code is on my fork.

Please note in the documentation (or in the constructor) that closures must be enabled

Hi,

I had today a relatively long debug session, after I've upgraded my Pytorch Lightning installation, that the training_step wasn't called.

It finally turned out, that the problem was that the "closure" argument is not used in the step function (it is commented out - as also noted in the source code).

However, as it is apparently required by some libraries and is also recommended by the official PyTorch guidelines, it would be great if it would be better documented, that people might need to enable these lines.

Thanks in advance.

cannot load trained model using Ranger

Hi There!

Thanks for putting together this code for Rectified Adam with Lookahead optimizer. I used this optimization function to train my model with fastai and successfully trained the model.

I exported the model using

feature = 'silhouette'
learn.export(f'{feature}_efficientnet-b3.pkl')

and later during inference I am trying to load the learner using

from ranger import Ranger
feature = 'silhouette'
learn = load_learner(path = model_path, file = f'{feature}_efficientnet-b3.pkl')

I have defined the model path properly in the previous cells. But for some reason, I cannot load the learner. The file cannot locate the module ranger.ranger. Can someone please help me fix this issue?

Here's a screenshot of the error for your reference.

Thanks & Regards,
Vinayak.

step_counter not set

Hi,
thanks for your work.

I just plugged it into my model and found that step_counter was not set for all param_groups.

I fixed it with this hack:

        #look ahead tracking and updating if latest batch = k
        for group,slow_weights in zip(self.param_groups,self.slow_weights):
            if 'step_counter' not in group:
                group["step_counter"] = 0

but I suspect it's not optimal...
this would mean that self.param_groups changed between the constructor and step(), but I have no idea why. Have you seen something similar before?

Thanks

Dose any one has tested this composed Optimizer?

Did you try to fine-tune transformers LM with Ranger?

Recent transformers architectures are very famous in NLP: BERT, GPT-2, RoBERTa, XLNET. Did you try to fine-tune them on some NLP task? If so, what was the best Ranger hyper-parameters and learning rate scheduler?

AttributeError: 'Ranger' object has no attribute 'radam_buffer'

Getting the following error from the most recent version:

AttributeError: 'Ranger' object has no attribute 'radam_buffer'

flat+ cosine anneal training curve

i want to do a test of your ranger,i only know cosine anneal training, can you tell me the meaning of flat?thanks

How to use ranger in keras? Please help me.

Your optimizer looks like a big achievement！
I have used " optimizer=Ranger(lr=0.001)" in keras .
But I have a error named "TypeError: init() missing 1 required positional argument: 'params'".
I don't know how to debug it . Can you help me？

The results I tested on the cifar10 dataset are as follows. Ranger's results look strange

Collate pip package so that it picks up from main repo.

Actually, there is a pip package but it is based out of a fork of this repo. I think it would make sense to collate this effort to the main repo.

Originally posted by @sarthakpati in #33 (comment)

Add manual synchronization function

Hello. First of all, thank you for sharing code and experiment results.
Reading the code, I found that the model will use fast weights to infer. According to LookAhead, fast weights (before synchronization) may perform worse than slow weights. By chance of (1-1/k) probability (80% when k=5), we will use unsynchronized fast weights to validate/test. Therefore, it should be better if we manually synchronize before evaluation.

How to cite Ranger in a paper?

In my recent paper I used Ranger. I wish to give all the credit the author(s) deserves, but I'm not sure how to properly cite it? Currently I cited the medium article. Should I cite this github repo instead? Thanks.

larger learning rate + large weight decay performs better?

Hi all,
My colleague and I tried a combination of (relatively) large Ranger learning rate (say, 0.001) + large weight decay (say, 0.1). Seems the large decay leads to better performance? We tried two different models, and observed 0.5-1.5% increase of ImageNet classification accuracy, but both models were customized models, and not standard ones like Resnet.
Not sure whether anyone else finds similar results.

Ranger and pytorch DDP

I tried ranger vs adamw on single and 8 gpu setup, while ranger better on single gpu, on DDP setup it performe worse, any advises?

Let's revolutionize the AI research field

Hi,
I have a dream and I'll try to share it to you.

But before explaining further, I'll need your brain to analyze this input and output me what you think about it!

Small rant on the inertia of AI research

First of all, thank you for advancing progress in deep learning.

I'm just a random guy that want to implement an AGI (lol) and like many Nlp engeeners, I need HIGHLY accurate neural networks for fundamental NLP tasks (e.g POS tag, NER, dep parsing, Coref resolution, WSD, etc)
They are all not very accurate (often sub 95% F1 score) and their errors add up.

Such limitations make Nlp not yet suitable for many things.
This is why improving the state of the art (which can be observed on paperswithcode.com) is a crucial priority from academicians.

Effectively, many researchers have smart ideas to improve the state of the art and often slightly improve it by:
Having a "standard neural network" for the task and mix with it their new fancy idea.

I talk from knowledge, I've read most papers from state of the art leaderboards from most fundamental NLP tasks.
Almost always they have this common baseline + one idea, theirs.
The common baseline sometimes slowly evolve (e.g now it's often a pre trained model (say BERT) + fine tuning + their idea.

Sorry to say, but "this" is to me retarded
Where "this" mean the fact that by far, most researchers work in isolation, not integrating others ideas (or with such a slow inertia).
I would have wished that state of the art in one Nlp task would be a combination of e.g 50 innovative and complementary ideas from researchers.
You are researchers, do you have an idea why that is the case? If someone actually tried to merge all good complementary and compatible ideas, would they have the best, unmatchable state of the art?
Why facebookresearch, Microsoft, Google don't try the low hanging fruit in addition to producing X new shiny ideas per month, actually try to merge them in a coherent, synergetic manner??
I would like you to tell me what you think of this major issue that slow AI progress.

As an example of such inertia let's talk about Swish, Mish or RAdam :
Those things are incredibly easy to try and see "hey does it give to my neural network free accuracy gains?"
Yet not any paper on state of the art leaderboards has tried Swish, Mish or RAdam despite being soo simple to try (you don't need to change the neural network)
Not even pre trained models where so many papers depend on them (I opened issues for each of them).

Once I know what you think about this research inertia, I'll explain my vision of what needs to be done to fix it.

Could someone tell me how to use it?

Keras implementation

It would be very helpful if you could provide implementation in keras.

Is adabelief the best optimizer?

https://paperswithcode.com/paper/adabelief-optimizer-adapting-stepsizes-by-the

Gradient centralization was updated

Yonghongwei/Gradient-Centralization@d46e4c5

best result : flat learning rate for 75% it means ranger optimizer is not sensitive to lr?

Thank you for your excellent work~
I notice that the best model of ranger optimizer have flat learning rate for 75%. Is it mean ranger optimizer is not sensitive to lr?

Looking forward to your early reply~

Loss stuck after 1 epoch

Just a warning to the curious, I tried to train DCCRN (from https://github.com/mpariente/asteroid) with Ranger2020 (default params) and it was stuck at a large loss after less than 1 epoch, and loss did not improve for another 30 epochs. I did not debug further. Adam with default params works very well.

Too huge step_size at initialization stage

I found that step_size is too high in the initial 5 steps.
The problem is in the code:

if N_sma >= self.N_sma_threshhold:
    step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
else:
    step_size = 1.0 / (1 - beta1 ** state['step'])

If betas are set to (0.9, 0.999) the internal variables are changed as following:

state['step']| step_size
------------------------------
        1    |     10
        2    |5.26315789
        3    |3.6900369
        4    |2.90782204
        5    |2.44194281
        6    |0.00426327
        7    |0.00524248
        8    |0.00607304
        9    |0.00681674
       10    |0.00750596

Note, that step_size doesn't depend on gradient value and it scales learning_rate.
Thus RAdam aggressively moves weights from their initial values, even if they have a good initialization.

Is it better to set step_size equal to 0 if N_sma < self.N_sma_threshhold?

Benchmarck Adaptive Scheduling of Stochastic Gradients

https://paperswithcode.com/paper/adas-adaptive-scheduling-of-stochastic

Could it beat rangerLars?

Do we need some kind of Learning rate decay with Ranger?

For AdamW people usually add some sort of learning rate decay: linear, cosine triangle, etc. Also, warm up steps are also popular.

Do we need all of these with Ranger or just use a fixed learning rate?

Grad norm and ranger

Im using nvidia apex and torch grad norm.
This is grad norm plot with ranger (red) and adamw (blue).
https://i.imgur.com/Ui4Sioo.png
Is this ok to have so huge grad norm values? Should I turn off grad norming?

Spelling, variables and PEP8

Hi and thanks for the code!

I am using your script for my code and while adapting it to PEP8 specs I found a few details that you may want to change. These are style changes that add clarity, but of course it is up to you whether to adhere to PEP8 recommendations or not. I could prepare a pull request as well if you like this style.

required (from torch.optim.optimizer) is not used
itertools is not used

N_sma_threshhold <-- the variable name should not begin with a capital letter, also "threshold" has a typo

k <-- is an importan variable with an obscure name, perhaps something like "lookahead_steps" would be more clear?

Multi-line comments should use """comment"""
Normal comments need a space after #

betas=(.95,0.999) (and others) need a space after the coma

You have commented code that is not used, perhaps it would be best to remove it altogether.

Spaces are not consistent and don't agree with PEP8.

Not working using cuda

Variables self.slow_weights are always on cpu.
You can easily fix this by adding a .to() method in Ranger class like so:

def to(self, device):    
    if device is "cuda":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cuda()
    elif device is "cpu":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cpu()

TypeError in GC operation for Conv layers and FC layers

TypeError: mean() received an invalid combination of arguments - got (keepdim=bool, dim=tuple, ), but expected one of:

()
(torch.dtype dtype)
(int dim, torch.dtype dtype)
didn't match because some of the keywords were incorrect: keepdim
(int dim, bool keepdim, torch.dtype dtype)
(int dim, bool keepdim)
didn't match because some of the arguments have invalid types: (dim=tuple, keepdim=bool, )

It makes sense to use it on a batch of 1?

@lessw2020 Thanks for this awesome optimizer. I´m very excited about it!

There is one particular workload that trains using a batch of 1 item.
Theoretically, make sense to use RAdam (Rectified Adam), LookAhead, and GC in this context?

I´m thinking about it, read the papers but I still could not make a conclusion. As you (or any other person here) is much more experienced than me, do you have an option on this?

Loading state doesn't seem to be fully working

To save : 'optimizer' : optimizer.state_dict()

optimizer.load_state_dict(checkpoint['optimizer'])

However, I have the impression restarting the training always bring the accuracy down and then it recovers.

Best,
Thomas Chaton>

Does it works well for transformer?

I am working on transformer now.
#13 I see this issue, but no one said they get a better result than AdamW yet.
Anyone have already make ranger work well in transformer by fine-tunning?

Also, I do not understand the Readme: 'Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.'
I use 1e-4 lr now, what is the '75% flat lr'?
What is 'lower lr for 25%'?
Could you show me some demo code about how to adjust the lr expect for the code init the Ranger?

Is there a publication of Ranger?

I want to cite ranger on a Medium article and I would like to know if there is an arXiv publication of Ranger or a published peer-reviewed paper on some conference or journal.

I saw you linked a paper o the README.md, but it does not seem to be about ranger, as the very word does not appear in any part of it. I know the Radam and Lookahead paper, but the Ranger one is missing on my library. Thanks

Not able to save the model_state_dict.

Hi I was trying to save the model checkpoints after each epoch using the below code.But only the state dictionary of the zeroth epoch got stored and none of the others.Does the ranger optimiser object support state_dict ? If yes then how can I save it after each epoch?

out_model = os.path.join(args.model_dir, 'model.th') with open(out_model, 'wb') as f: torch.save(model.state_dict(), f) print("Model is dumped")

N_sma_threshhold should be instance variable

Thank you for the great implementation.
I think I found a small part to modify at ranger.py line 116.

original code:
if N_sma > N_sma_threshhold:

to be left:
if N_sma > self.N_sma_threshhold:

What the "GC operations" mean?

This overload of addcmul_ is deprecated: addcmul_(Number value, Tensor tensor1, Tensor tensor2)

I get the following warning when using ranger with pytorch 1.6.0

/path/Ranger-Deep-Learning-Optimizer/ranger/ranger.py:138: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

Stochastic Weight Averaging support

Does Ranger support Stochastic Weight Averaging?

[question] Why Ranger is not available as a pip package

Why Ranger is only available on a github but not as an pip package. Wouldn't it be easier for the community to actually use it?

N_sma_threshhold

You first have
if N_sma > self.N_sma_threshhold:

and then you have
if N_sma > 4:

Is it right that the second one is constant or should that also be N_sma_threshhold parameter?

Making it a python package

Would you like to make this a python package that could be installed with pip? It would be more practical.

I'd like to include it in my repo asteroid and give you proper credit for it.

One way is to install a python package (I can make a PR for that), the other one would be to copy-paste some of the code and point to the license file. Which way would you prefer?

lessw2020 / ranger-deep-learning-optimizer Goto Github PK

ranger-deep-learning-optimizer's Issues

Recommend Projects

Recommend Topics

Recommend Org