Giter VIP home page Giter VIP logo

pytorch-soft-actor-critic's People

Contributors

fgolemo avatar jendelel avatar llucid-97 avatar pranz24 avatar shmuma avatar shnippi avatar toshikwa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-soft-actor-critic's Issues

Target value calculation maistake

Hi.
I guess there is a mistake in the target value which have been written as:
vf_target = min_qf_pi - (self.alpha * log_pi)

that is:
pi, log_pi, mean, log_std = self.policy.sample(state_batch)
qf1_pi, qf2_pi = self.critic(state_batch, pi)
min_qf_pi = torch.min(qf1_pi, qf2_pi)

But I blieve min_qf_pi should be:
qf1_pi, qf2_pi = self.critic(state_batch, action_batch)
min_qf_pi = torch.min(qf1_pi, qf2_pi)

Resume training

Hello I am trying to use the SAC agent and resume training, to do that I do:

def load_model(actor_path, critic_path, optimizer_actor_path, optimizer_critic_path, optimizer_alpha_path):

  policy = torch.load(actor_path)
  self.alpha = policy['alpha'].detach().item()
  self.log_alpha = torch.tensor([policy['log_alpha'].detach().item()], requires_grad=True, device=self.device)
  self.alpha_optim = Adam([self.log_alpha], lr=self.lr) # I had to recreate alpha optim with the new log_alpha loaded

  self.policy.load_state_dict(policy['model_state_dict'])
  self.policy.train()
  self.critic.load_state_dict(torch.load(critic_path))
  self.critic.train()

  self.policy_optim.load_state_dict(torch.load(optimizer_actor_path))
  self.critic_optim.load_state_dict(torch.load(optimizer_critic_path))
  self.alpha_optim.load_state_dict(torch.load(optimizer_alpha_path))

Is this correct? The loss explodes after resuming which is very strange.

Why do you need to use NormalizedActions()?

Excuse me, I don't understand that why do you need to use NormalizedActions()?
Can you explain it ? Thank you!

Environment

env = NormalizedActions(gym.make(args.env_name))

class NormalizedActions(gym.ActionWrapper):

def action(self, action):
    action = (action + 1) / 2  # [-1, 1] => [0, 1]
    action *= (self.action_space.high - self.action_space.low)
    action += self.action_space.low
    return action

def _reverse_action(self, action):
    action -= self.action_space.low
    action /= (self.action_space.high - self.action_space.low)
    action = action * 2 - 1
    return action

what is the derivation behind the log_prob equation?

First of all, thanks for this amazing repo!

I am trying to clarify why the log_prob of the action taken by the policy is calculated as in this line:

log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)

When this resulting log_prob is used in the loss to update alpha, It seems that there is some imbalance between it and how the target_entropy is calculated. The target one just takes into account the dimensionality of the action vector, but the log_prob is affected by the action_scale.

At the end, aren't we just comparing a target entropy with the entropy of the policy? and since this last one is basically given by the standard deviation in the case of Gaussians, could we just return that entropy in the place of the log_pi returned by policy.sample(state), or simply the sum of the elements of normal.log_prob(x_t) as if the above indicated line was removed?

Thanks in advance. Sorry if I said something stupid, by I am confused and would really appreciate some help to understand what's going on.

Training policy for more complex tasks, converges to sub-optimal solutions

I recently implemented a gym environment where a robot should learn to push different boxes conditioned on different skills, getting only sparse rewards. I wanted to train the agent using the SAC implementation from this repository. There I observed, that for more complex problems the agent seems to converge quickly to some non-optimal policy, where it would not get any reward, or only little reward in the case where I used reward shaping.
Thus, I used the exact same gym environment and trained the agent with the SAC implementation from stable baselines3. I made sure that all the hyperparameters were the same as when I trained it with this implementation.
From the following plot, it is visible that the latter training has much better performance.
In the middle task, where the agent has to learn to push only one box, from different initial positions, the agent trained with this implementation performs quite well. However, in the right task, where the agent has to push four boxes from different initial positions it does not show any improvement.

For other less complex tasks, one of which is plotted in the left plots, I also managed to successfully train an agent with this implementation. I observed, that using this implementation, the agent seems to explore much less than with the implementation from Stable Baselines3, which might be why for more complex tasks, where more exploratory behavior is necessary to find good states, the result is so bad.
However, unfortunately, I was not able to find any specific part of the code that might have an error in it.

Comparison

I am not looking for any solutions any longer, although I would be very much interested if someone finds the reason why this implementation has poor performance compared to the one from Stable Baselines3.

This is more of a disclaimer, that the implementation might not work for all tasks, than a problem I need help with.

Could you please explain the "# Enforcing Action Bound" comment?

In model.py line 102 and 103:

# Enforcing Action Bound
log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)

refer to the original paper, I can not understand the aim of this line of code.

I try some code like GaussianPolicy.sample(), and sometimes get positive log_prob in the end, and I'm still confused about this line.

Could you please explain it? Thank you very much!

Question about policy_loss

Hi, thank you for your great work!!

I have a question related to #10.
Can you explain the meaning of the code below in the GaussianPolicy??

# Enforcing Action Bound
log_prob -= torch.log(1 - action.pow(2) + epsilon)

Also, can you provide some information which you referenced to code this loss??

Anyway, thank you for sharing your great codes.

[Question] Mask Batch

Hi,

For this line: why do you need the mask batch here? In the original SAC paper, the target q value is represented by r+\gamma*Q(st+1, at+1). Does removing the mask batch here affect the performance?

Thanks

Inconsistent seeding

Hi,

I just wanted to point out that the code produces inconsistent results on multiple runs due to seeding issues. I have found two reasons for that, upon fixing which I am able to get consistent results for a fixed seed value.

  1. The Python random package is used in ReplayMemory. However, the seed for it is not set in main.py
  2. You would need set the seed for the action_space for the environment explicitly using env.action_space.seed(args.seed) as the env.seed(seed) function does not do that. This gives different action samples in the initial exploration phase of the algorithm. I am using Gym version 0.17.2

Hope this helps!

Running SAC: Operation failed to compute its gradient

Environment:

  • torch==1.5.0
  • mujoco-py==2.0.2.10

Usage:
python main.py --env-name Humanoid-v2 --alpha 0.05

Error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Hello there,
I am trying to run SAC and am getting the error shown above. I was told that rolling back to Torch 1.0.0 would fix the issue(installing now), but I do not understand why. Is this something that you are aware of?

puzzles about action scaling

Hi,thanks for your pytorch implementation of SAC,it's really readable.
I have a question about action scaling when i read your code.

action = y_t * self.action_scale + self.action_bias

What is its function if action_scale equals 1.0 and action_bias equals 0.And why can that make a big improment upon performence?

A little question about calculating log likelihood

Hi!

Thank you for sharing the code!

I am reading the code and I got somehow confused about a detail

In model.py, GuassianPolicy class, sample function, why the returned log_prob is
normal.log_prob(x_t) - torch.log(1 - action.pow(2) + epsilon)?
What's the meaning of 1-action.pow(2)?

I really appreciate your help!

Doubts about Regularization in policy loss

Thank you for your contribution. However I'm confused about
reg_loss = 0.001 * (mean.pow(2).mean() + log_std.pow(2).mean())# Regularization Loss
in the code. Can you explain it, any help will be grateful.

Question: Why optimize loss_alpha?

Hi,
I'm fairly new to RL and recently I studied DDPG then trying to understand SAC.
I see code you optimized log_alpha instead of alpha directly, is there a reason for that?

Thanks so much for you explanation.
Toby.

About model.py line 105

In model line 105, isn't the third return value should be:

torch.tanh(mean) * self.action_scale + self.action_bias

?

Exploding entropy temperature

Hi,

When I set the automatic_entropy_tuning to be true in an environment with action space of shape 1, my entropy temperature explodes and increases exponentially to a magnitude of 10^8 before pytorch fails and crashes the run. Any ideas as to why it is so?

Is this code SAC-V not a SAC?

I thought this implementation was SAC rather than SAC-V.
however, it seems like the agent does not contian ValueNetowrk, while the model.py include the ValuenNetwork.
I am wondering this code of SAC-V and the value network is trace of previous implementation.

reproducibility for HalfCheetah-v2

Hi,

I ran your code by just setting timestep to 3 millions like in the official paper (the other parameters were let by default like in your code). I couldn't reproduce the 15,000 result of the paper. (Joined the plot over 4 seeds).
Is there any specific parameters to set ?

Thank you !
halfcheetah_sac

Derivative in reparametrization trick?

Hi, I met a problem in understanding the log-likelihood again and hopefully you can help me!

in .sample() method in GuassianPolicy class,

x_t = normal.rsample() # for reparameterization trick (mean + std * N(0,1))

generates a new x_t, with gradients to the network, and line
log_prob = normal.log_prob(x_t)

calculates the log-likelihood of x_t, but somehow I think this log-likelihood does not have gradient to the mean part but only std part. Am I correct?

Here is what I think:
x_t = mean + noise*std, here mean and std are outputs from .forward() and noise is a generated N(0,1) with no gradient.
Then when you are calculating log_prob = normal.log_prob(x_t), the function returns something like

-((value - self.loc) ** 2) / (2 * var) - log_scale - math.log(math.sqrt(2 * math.pi))

which is -0.5 - log_scale - math.log(math.sqrt(2 * math.pi)) so it only has gradient to std part but not mean part. Do you agree with me?

I am also trying to implement this algorithm but it seems that when I include the entropy term in the loss of policy, everything goes wrong. But without it everything is fine. I am looking into the problem so I am comparing all details between my implement and yours. Please help me if you can. Thank you very much!

Action scale and action bias

Hi guys,
You did a great job here!
I'm trying to modify algorithms to my need, and I can't quite get two variables in neuron network classes. What are action_scale and action_bias variables, and why do you use it? Could you, please, reference them in the article?

Thanks

question about q_loss and alpha_loss

in sac.update_parameters(), the "qf_loss" was and optimizer are associated with F.mse_loss(qf1, next_q_value), F.mse_loss(qf2, next_q_value), but for policy_loss and alpha_loss, there seems to be no association between loss function and the optimizer. Did i miss something about it?

Value network

Hi,

I skimmed over the author's implementation and it seems that they don't use the value network. Instead they only use the Q-networks. Seems they removed it in this commit

Thanks,

Lukas

Unable to reproduce results on Humanoid-v2 in new SAC

I am unable to obtain the result as reported in the paper ‘Soft Actor-Critic Algorithms and Applications ’ on the openai environment Humanoid-v2. The result is 6000 while the original paper is 8000, after 10million steps.

Do you know what might be causing this issue? Thank you!

Normalized Actions has bugs

One should be careful in uncommenting the normalized actions wrapper, as one has to make sure to call _reverse_action() and _max_episode_steps has a typo and should not be a function, otherwise the following in main.py would not work: mask = 1 if episode_steps == env._max_episode_steps else float(not done)

This small bug caused a lot of headaches but the repo is super nice otherwise!

No normalization of state space

I realized that the state is not normalized. This might not be a big issue, because if the state is never normalized, the networks should still be able to learn to make correct predictions from this. However, I think for fixed hyperparameters, an unnoramlized state could have a different influence on, for example, the magnitude of losses and also predictions right after the network weights are initialized.

I would very much appreciate someone else's insight on this and how much this may really change the resulting policy.

Cheers,
Rosa

the bound enforce for log_prob in line 103 of model.py

I do not mathematically agree with the bound enforcement for log_prob offset in your Gauss_policy. For pdf's of x and y, in the multivariate cases, the offset would be the logarithm of a determinant of the Jacobian matrix ( y = tanh(x) ) based on the tanh function. The Jacobian happens to be a diagonal matrix, so the offset should be the logarithm of the product of the diagonal elements of the Jacobian matrix. Please let me know if my understanding of pdf's transformation with element-wise change of vector variables is correct or wrong.

Look forward to hearing from you.

Cheers,

Old Yang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.