nisheeth-golakiya / hybrid-sac Goto Github PK

Single-file pytorch implementation of hybrid-SAC

License: MIT License

Python 100.00%

reinforcement-learning soft-actor-critic pytorch-implementation openai-gym-environments paper-implementation parameterised-action-spaces parameterized-action

hybrid-sac's People

Contributors

Stargazers

Watchers

Forkers

hardlygo dbsxdbsx olegnv47 kowshikchills xujinming01 sahilm1992 stephlee12 guheem rocky-cn anandsingh3996 doutdex

hybrid-sac's Issues

Question on formula of the continuous action

First, thank your for the code related to paper Discrete and Continuous Action Representation for Practical RL in Video Games.

Second, according to your code, all of action spaces of the environments you used for this project are based on the 5th action space architecture stated from the paper ---only ONE dimension of discrete action space + continuous action space for EACH action from discrete action space(If I am wrong ,please tell me).

And My question is the loss formula related to the continous part:

# from calculating critic loss
min_qf_next_target = next_state_prob_d * (torch.min(qf1_next_target, qf2_next_target) - alpha * next_state_prob_d * next_state_log_pi_c - alpha_d * next_state_log_pi_d)

# from calculating policy loss
policy_loss_c = (prob_d * (alpha * prob_d * log_pi_c - min_qf_pi)).sum(1).mean()

# from calculating temperature loss
alpha_loss = (-log_alpha * p_d * (p_d * lpi_c + target_entropy)).sum(1).mean()

From each code formula, you times distribution object (next_state_prob_d ,prob_d and p_d ) with the continuous objects (next_state_log_pi_c ,log_pi_c and lpi_c ),
even there is another same distribution object outside parenthesis.

Intuitionally, I think there is no need to multiple distribution object right with the corresponding continuous objects inside parenthesis.

I don't know whether I am wrong mathmatically. So I ask this question.

Dependency of actions

Hi, thank you so much for offering this code.

I read the code: hybrid_sac_platform.py

And, the paper said:
If the continuous action ac must depend on the discrete action chosen by the agent, then ad
can be used as input when computing µc and σc.

I think the continuous action depend on the discrete action, but I did not find the
where using ad to compute µc and σc.

Could you tell me where to show this dependency of actions.

Thank you very much for any advance!!!

about policy loss

hi, thank you for this wonderful job.

I am a new bird in RL, and I try to use your code in my env,

however, this policy loss seems a little strange.

I wonder it is right or I make some mistake?

Thank you again and wish you everything good

numpy.core._exceptions.MemoryError: Unable to allocate 8.00 GiB for an array with shape (2147483647,) and data type int32

python hybrid_sac_platform.py --seed 7 --gym-id Platform-v0 --total-timesteps 100000 --learning-starts 1000
pygame 2.1.0 (SDL 2.0.16, Python 3.8.16)
Hello from the pygame community. https://www.pygame.org/contribute.html
D:\Anaconda3\envs\pytorch2023\lib\site-packages\gym\spaces\box.py:73: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(
<class 'gym.spaces.tuple.Tuple'>
Tuple(Box([0. 0. 0. 0. 0. 0. 0. 0. 0.], [1. 1. 1. 1. 1. 1. 1. 1. 1.], (9,), float32), Discrete(200))
Traceback (most recent call last):
File "hybrid_sac_platform.py", line 155, in
env.action_space.seed(args.seed)
File "D:\Anaconda3\envs\pytorch2023\lib\site-packages\gym\spaces\tuple.py", line 30, in seed
subseeds = self.np_random.choice(
File "mtrand.pyx", line 1010, in numpy.random.mtrand.RandomState.choice
File "mtrand.pyx", line 4699, in numpy.random.mtrand.RandomState.permutation
numpy.core._exceptions.MemoryError: Unable to allocate 8.00 GiB for an array with shape (2147483647,) and data type int32

I would be very grateful to you for answering this question

Don't try and do the continuous action scaling in the Policy network.

This isn't so much an issue with the code as it was user error that I'd like to help others avoid:

For anyone who is going to use this code, make sure that you DO NOT try and do the action scaling/bias in the policy model itself. I'm using a custom environment, so I figured it would be easier to do the scaling in Policy.get_action() as they do in the cleanrl implementation (lines 133 and 136). With the scaling in the policy, my code refused to converge even on very simple cases. Unless I'm very bad at synthesizing the two codes, I think there's some issue with either (i) erroneous values getting attached to the backpropagation graph or (ii) values being put into the replay buffer with(out) scaling when they should or shouldn't have it.

Solution: Leave the code completely alone and do all of your scaling inside the environment. For anyone else doing a custom environment, here's the easy solution:

where as_high, as_low are the high and low parameters (of type np.array) that you pass to the environment's Box() space in init().

Also, thanks for making this code. It has helped me out a lot!

nisheeth-golakiya / hybrid-sac Goto Github PK

hybrid-sac's People

Contributors

Stargazers

Watchers

Forkers

hybrid-sac's Issues

Question on formula of the continuous action

Dependency of actions

about policy loss

numpy.core._exceptions.MemoryError: Unable to allocate 8.00 GiB for an array with shape (2147483647,) and data type int32

Don't try and do the continuous action scaling in the Policy network.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent