pranz24 / pytorch-soft-actor-critic Goto Github PK

View Code? Open in Web Editor NEW

785.0 785.0 180.0 1009 KB

PyTorch implementation of soft actor critic

License: MIT License

Python 100.00%

deep-reinforcement-learning pytorch pytorch-implmention reinforcement-learning soft-actor-critic

pytorch-soft-actor-critic's People

Contributors

Stargazers

Watchers

Forkers

guyko81 dhanush-ai1990 samirw ethanabrooks jendelel cha0liu jkingf muhammedsaid hongkahjun yaoliucs tongxin-li nathangrinsztajn rock-it-with-asher yunqiuxu zhan0903 wangyy161 dexfrost89 wwxfromtju collector-m hello1910 crizcraig taomo fgolemo ajaysub110 xwinxu hzm2016 yubaoliu buptwcy shmuma fierval qxtian nkwhz philipjball kylinliu alxlampe mralioo llt1 intelligent-robotic-group qiu1234567 goncamateus llucid-97 shaluols tomeu7 rainwangphy ythuangyt harish-kamath shahrutav mohakbhardwaj josiahls johnny-zhang92 maitycyrus iimmer youjiyun chrisprogramming2018 manik-500 wassname integritynoble jason93415 jaylee0301 sudo-michael zhang-qing618 wyomingwolf zenkri lebeausc fanyangr kstarek wx-b jonberliner numahha kevin9010 cwilson51 nikunj-gupta nicholaspalomo ttimelord hardlygo kristery yifan-you-37 legendarylance soiliml shnippi zmhe haxrd leonda silentember linamez prememore agcxgz321 dailinh qiufengsly xingyu-lin gloriel621 yunkun2015 soonjune wangzelong0663 nagisazj w5688414 mehrantaghian luisenp yandong1223 lrxsxdl

pytorch-soft-actor-critic's Issues

Target value calculation maistake

Hi.
I guess there is a mistake in the target value which have been written as:
vf_target = min_qf_pi - (self.alpha * log_pi)

that is:
pi, log_pi, mean, log_std = self.policy.sample(state_batch)
qf1_pi, qf2_pi = self.critic(state_batch, pi)
min_qf_pi = torch.min(qf1_pi, qf2_pi)

But I blieve min_qf_pi should be:
qf1_pi, qf2_pi = self.critic(state_batch, action_batch)
min_qf_pi = torch.min(qf1_pi, qf2_pi)

Resume training

Hello I am trying to use the SAC agent and resume training, to do that I do:

def load_model(actor_path, critic_path, optimizer_actor_path, optimizer_critic_path, optimizer_alpha_path):

  policy = torch.load(actor_path)
  self.alpha = policy['alpha'].detach().item()
  self.log_alpha = torch.tensor([policy['log_alpha'].detach().item()], requires_grad=True, device=self.device)
  self.alpha_optim = Adam([self.log_alpha], lr=self.lr) # I had to recreate alpha optim with the new log_alpha loaded

  self.policy.load_state_dict(policy['model_state_dict'])
  self.policy.train()
  self.critic.load_state_dict(torch.load(critic_path))
  self.critic.train()

  self.policy_optim.load_state_dict(torch.load(optimizer_actor_path))
  self.critic_optim.load_state_dict(torch.load(optimizer_critic_path))
  self.alpha_optim.load_state_dict(torch.load(optimizer_alpha_path))

Is this correct? The loss explodes after resuming which is very strange.

Why do you need to use NormalizedActions()?

Excuse me, I don't understand that why do you need to use NormalizedActions()?
Can you explain it ? Thank you!

Environment

env = NormalizedActions(gym.make(args.env_name))

class NormalizedActions(gym.ActionWrapper):

def action(self, action):
    action = (action + 1) / 2  # [-1, 1] => [0, 1]
    action *= (self.action_space.high - self.action_space.low)
    action += self.action_space.low
    return action

def _reverse_action(self, action):
    action -= self.action_space.low
    action /= (self.action_space.high - self.action_space.low)
    action = action * 2 - 1
    return action

what is the derivation behind the log_prob equation?

First of all, thanks for this amazing repo!

I am trying to clarify why the log_prob of the action taken by the policy is calculated as in this line:

pytorch-soft-actor-critic/model.py

Line 103 in 847edf5

log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)

When this resulting log_prob is used in the loss to update alpha, It seems that there is some imbalance between it and how the target_entropy is calculated. The target one just takes into account the dimensionality of the action vector, but the log_prob is affected by the action_scale.

At the end, aren't we just comparing a target entropy with the entropy of the policy? and since this last one is basically given by the standard deviation in the case of Gaussians, could we just return that entropy in the place of the log_pi returned by policy.sample(state), or simply the sum of the elements of normal.log_prob(x_t) as if the above indicated line was removed?

Thanks in advance. Sorry if I said something stupid, by I am confused and would really appreciate some help to understand what's going on.

Training policy for more complex tasks, converges to sub-optimal solutions

I recently implemented a gym environment where a robot should learn to push different boxes conditioned on different skills, getting only sparse rewards. I wanted to train the agent using the SAC implementation from this repository. There I observed, that for more complex problems the agent seems to converge quickly to some non-optimal policy, where it would not get any reward, or only little reward in the case where I used reward shaping.
Thus, I used the exact same gym environment and trained the agent with the SAC implementation from stable baselines3. I made sure that all the hyperparameters were the same as when I trained it with this implementation.
From the following plot, it is visible that the latter training has much better performance.
In the middle task, where the agent has to learn to push only one box, from different initial positions, the agent trained with this implementation performs quite well. However, in the right task, where the agent has to push four boxes from different initial positions it does not show any improvement.

For other less complex tasks, one of which is plotted in the left plots, I also managed to successfully train an agent with this implementation. I observed, that using this implementation, the agent seems to explore much less than with the implementation from Stable Baselines3, which might be why for more complex tasks, where more exploratory behavior is necessary to find good states, the result is so bad.
However, unfortunately, I was not able to find any specific part of the code that might have an error in it.

I am not looking for any solutions any longer, although I would be very much interested if someone finds the reason why this implementation has poor performance compared to the one from Stable Baselines3.

This is more of a disclaimer, that the implementation might not work for all tasks, than a problem I need help with.

Images

Model saving and loading

Hey, how to use the save_checkpoint and load_checkpoint in the sac.py?

Could you please explain the "# Enforcing Action Bound" comment?

In model.py line 102 and 103:

# Enforcing Action Bound
log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)

refer to the original paper, I can not understand the aim of this line of code.

I try some code like GaussianPolicy.sample(), and sometimes get positive log_prob in the end, and I'm still confused about this line.

Could you please explain it? Thank you very much!

Question about policy_loss

Hi, thank you for your great work!!

I have a question related to #10.
Can you explain the meaning of the code below in the GaussianPolicy??

# Enforcing Action Bound
log_prob -= torch.log(1 - action.pow(2) + epsilon)

Also, can you provide some information which you referenced to code this loss??

Anyway, thank you for sharing your great codes.

[Question] Mask Batch

Hi,

For this line: why do you need the mask batch here? In the original SAC paper, the target q value is represented by r+\gamma*Q(st+1, at+1). Does removing the mask batch here affect the performance?

Thanks

Inconsistent seeding

Hi,

I just wanted to point out that the code produces inconsistent results on multiple runs due to seeding issues. I have found two reasons for that, upon fixing which I am able to get consistent results for a fixed seed value.

The Python random package is used in ReplayMemory. However, the seed for it is not set in main.py
You would need set the seed for the action_space for the environment explicitly using env.action_space.seed(args.seed) as the env.seed(seed) function does not do that. This gives different action samples in the initial exploration phase of the algorithm. I am using Gym version 0.17.2

Hope this helps!

Policy Loss with Minimum or Q1?

In line:
https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L125

should it not it be this?
policy_loss = ((self.alpha * log_prob) - q1_new).mean()

Running SAC: Operation failed to compute its gradient

Environment:

torch==1.5.0
mujoco-py==2.0.2.10

Usage:
python main.py --env-name Humanoid-v2 --alpha 0.05

Error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Hello there,
I am trying to run SAC and am getting the error shown above. I was told that rolling back to Torch 1.0.0 would fix the issue(installing now), but I do not understand why. Is this something that you are aware of?

Support OpenAI Gym Robotic Env?

Hi, may i know this repo support for OpenAI Gym Robotic Environment like FetchPickAndPlace-v1?

Why is the value function not used in this implementation?

I noticed two Q-functions and one V-function were presented in the SAC algorithm in the original paper. And so it is in the author's implementation (see https://github.com/haarnoja/sac/blob/master/sac/algos/diayn.py).
Is there some consideration on computation costs or training stability?

reparametrization trick issue

It seems that you are not implementing the reparametrization trick when taking an action

https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L98-L99

although you wrote it in coments
my points is in that way shouldn t you used
an other loss function according to this repo? (this should be the second actor_loss function? )
https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/twin_sac.py#L178-L184

thanks

puzzles about action scaling

Hi,thanks for your pytorch implementation of SAC,it's really readable.
I have a question about action scaling when i read your code.

pytorch-soft-actor-critic/model.py

Line 100 in cc42a1f

action = y_t * self.action_scale + self.action_bias

What is its function if action_scale equals 1.0 and action_bias equals 0.And why can that make a big improment upon performence?

A little question about calculating log likelihood

Hi!

Thank you for sharing the code!

I am reading the code and I got somehow confused about a detail

In model.py, GuassianPolicy class, sample function, why the returned log_prob is
normal.log_prob(x_t) - torch.log(1 - action.pow(2) + epsilon)?
What's the meaning of 1-action.pow(2)?

I really appreciate your help!

Doubts about Regularization in policy loss

Thank you for your contribution. However I'm confused about
reg_loss = 0.001 * (mean.pow(2).mean() + log_std.pow(2).mean())# Regularization Loss
in the code. Can you explain it, any help will be grateful.

Question: Why optimize loss_alpha?

Hi,
I'm fairly new to RL and recently I studied DDPG then trying to understand SAC.
I see code you optimized log_alpha instead of alpha directly, is there a reason for that?

Thanks so much for you explanation.
Toby.

About model.py line 105

In model line 105, isn't the third return value should be:

torch.tanh(mean) * self.action_scale + self.action_bias

Exploding entropy temperature

Hi,

When I set the automatic_entropy_tuning to be true in an environment with action space of shape 1, my entropy temperature explodes and increases exponentially to a magnitude of 10^8 before pytorch fails and crashes the run. Any ideas as to why it is so?

Is this code SAC-V not a SAC?

I thought this implementation was SAC rather than SAC-V.
however, it seems like the agent does not contian ValueNetowrk, while the model.py include the ValuenNetwork.
I am wondering this code of SAC-V and the value network is trace of previous implementation.

reproducibility for HalfCheetah-v2

Hi,

I ran your code by just setting timestep to 3 millions like in the official paper (the other parameters were let by default like in your code). I couldn't reproduce the 15,000 result of the paper. (Joined the plot over 4 seeds).
Is there any specific parameters to set ?

Thank you !

Derivative in reparametrization trick?

Hi, I met a problem in understanding the log-likelihood again and hopefully you can help me!

in .sample() method in GuassianPolicy class,

pytorch-soft-actor-critic/model.py

Line 88 in 86412e1

x_t = normal.rsample() # for reparameterization trick (mean + std * N(0,1))

generates a new x_t, with gradients to the network, and line

pytorch-soft-actor-critic/model.py

Line 90 in 86412e1

log_prob = normal.log_prob(x_t)

calculates the log-likelihood of x_t, but somehow I think this log-likelihood does not have gradient to the mean part but only std part. Am I correct?

Here is what I think:
x_t = mean + noise*std, here mean and std are outputs from .forward() and noise is a generated N(0,1) with no gradient.
Then when you are calculating log_prob = normal.log_prob(x_t), the function returns something like

-((value - self.loc) ** 2) / (2 * var) - log_scale - math.log(math.sqrt(2 * math.pi))

which is -0.5 - log_scale - math.log(math.sqrt(2 * math.pi)) so it only has gradient to std part but not mean part. Do you agree with me?

I am also trying to implement this algorithm but it seems that when I include the entropy term in the loss of policy, everything goes wrong. But without it everything is fine. I am looking into the problem so I am comparing all details between my implement and yours. Please help me if you can. Thank you very much!

Action scale and action bias

Hi guys,
You did a great job here!
I'm trying to modify algorithms to my need, and I can't quite get two variables in neuron network classes. What are action_scale and action_bias variables, and why do you use it? Could you, please, reference them in the article?

Thanks

question about q_loss and alpha_loss

in sac.update_parameters(), the "qf_loss" was and optimizer are associated with F.mse_loss(qf1, next_q_value), F.mse_loss(qf2, next_q_value), but for policy_loss and alpha_loss, there seems to be no association between loss function and the optimizer. Did i miss something about it?

Can I use this in custom gym env?

Value network

Hi,

I skimmed over the author's implementation and it seems that they don't use the value network. Instead they only use the Q-networks. Seems they removed it in this commit

Thanks,

Lukas

Unable to reproduce results on Humanoid-v2 in new SAC

I am unable to obtain the result as reported in the paper ‘Soft Actor-Critic Algorithms and Applications ’ on the openai environment Humanoid-v2. The result is 6000 while the original paper is 8000, after 10million steps.

Do you know what might be causing this issue? Thank you!

Normalized Actions has bugs

One should be careful in uncommenting the normalized actions wrapper, as one has to make sure to call _reverse_action() and _max_episode_steps has a typo and should not be a function, otherwise the following in main.py would not work: mask = 1 if episode_steps == env._max_episode_steps else float(not done)

This small bug caused a lot of headaches but the repo is super nice otherwise!

No normalization of state space

I realized that the state is not normalized. This might not be a big issue, because if the state is never normalized, the networks should still be able to learn to make correct predictions from this. However, I think for fixed hyperparameters, an unnoramlized state could have a different influence on, for example, the magnitude of losses and also predictions right after the network weights are initialized.

I would very much appreciate someone else's insight on this and how much this may really change the resulting policy.

Cheers,
Rosa

Action scaling is missing on SAC_V branch

Hey.
action = y_t * self.action_scale + self.action_bias
this part is missing in sample function on SAC_V branch.

A question in the deterministic case

https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L87

Should we here use new_action or self.policy(next_state_batch)?

does your code support multi-dimension discrete action space?

FYI

the bound enforce for log_prob in line 103 of model.py

I do not mathematically agree with the bound enforcement for log_prob offset in your Gauss_policy. For pdf's of x and y, in the multivariate cases, the offset would be the logarithm of a determinant of the Jacobian matrix ( y = tanh(x) ) based on the tanh function. The Jacobian happens to be a diagonal matrix, so the offset should be the logarithm of the product of the diagonal elements of the Jacobian matrix. Please let me know if my understanding of pdf's transformation with element-wise change of vector variables is correct or wrong.

Look forward to hearing from you.

Cheers,

Old Yang

multiplying action_scale in the log_prob computation

Could anyone help me understand why the action_scale is multiplied here?
Thanks in advance!

pytorch-soft-actor-critic/model.py

Line 103 in 1bd1158

log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)

pranz24 / pytorch-soft-actor-critic Goto Github PK

pytorch-soft-actor-critic's People

Contributors

Stargazers

Watchers

Forkers

pytorch-soft-actor-critic's Issues

Environment

Recommend Projects

Recommend Topics

Recommend Org