Mujoco XLA - MJX Integration

Stable Baselines Jax (SB3 + Jax = SBX)

Proof of concept version of Stable-Baselines3 in Jax.

Implemented algorithms:

Install using pip

For the latest master version:

pip install git+https://github.com/araffin/sbx

or:

pip install sbx-rl

Example

import gymnasium as gym

from sbx import DDPG, DQN, PPO, SAC, TD3, TQC, CrossQ

env = gym.make("Pendulum-v1", render_mode="human")

model = TQC("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10_000, progress_bar=True)

vec_env = model.get_env()
obs = vec_env.reset()
for _ in range(1000):
    vec_env.render()
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = vec_env.step(action)

vec_env.close()

Using SBX with the RL Zoo

Since SBX shares the SB3 API, it is compatible with the RL Zoo, you just need to override the algorithm mapping:

import rl_zoo3
import rl_zoo3.train
from rl_zoo3.train import train
from sbx import DDPG, DQN, PPO, SAC, TD3, TQC, CrossQ

rl_zoo3.ALGOS["ddpg"] = DDPG
rl_zoo3.ALGOS["dqn"] = DQN
# See note below to use DroQ configuration
# rl_zoo3.ALGOS["droq"] = DroQ
rl_zoo3.ALGOS["sac"] = SAC
rl_zoo3.ALGOS["ppo"] = PPO
rl_zoo3.ALGOS["td3"] = TD3
rl_zoo3.ALGOS["tqc"] = TQC
rl_zoo3.ALGOS["crossq"] = CrossQ
rl_zoo3.train.ALGOS = rl_zoo3.ALGOS
rl_zoo3.exp_manager.ALGOS = rl_zoo3.ALGOS

if __name__ == "__main__":
    train()

Then you can run this script as you would with the RL Zoo:

python train.py --algo sac --env HalfCheetah-v4 -params train_freq:4 gradient_steps:4 -P

The same goes for the enjoy script:

import rl_zoo3
import rl_zoo3.enjoy
from rl_zoo3.enjoy import enjoy
from sbx import DDPG, DQN, PPO, SAC, TD3, TQC, CrossQ

rl_zoo3.ALGOS["ddpg"] = DDPG
rl_zoo3.ALGOS["dqn"] = DQN
# See note below to use DroQ configuration
# rl_zoo3.ALGOS["droq"] = DroQ
rl_zoo3.ALGOS["sac"] = SAC
rl_zoo3.ALGOS["ppo"] = PPO
rl_zoo3.ALGOS["td3"] = TD3
rl_zoo3.ALGOS["tqc"] = TQC
rl_zoo3.ALGOS["crossq"] = CrossQ
rl_zoo3.enjoy.ALGOS = rl_zoo3.ALGOS
rl_zoo3.exp_manager.ALGOS = rl_zoo3.ALGOS

if __name__ == "__main__":
    enjoy()

Note about DroQ

DroQ is a special configuration of SAC.

To have the algorithm with the hyperparameters from the paper, you should use (using RL Zoo config format):

HalfCheetah-v4:
  n_timesteps: !!float 1e6
  policy: 'MlpPolicy'
  learning_starts: 10000
  gradient_steps: 20
  policy_delay: 20
  policy_kwargs: "dict(dropout_rate=0.01, layer_norm=True)"

and then using the RL Zoo script defined above: python train.py --algo sac --env HalfCheetah-v4 -c droq.yml -P.

We recommend playing with the policy_delay and gradient_steps parameters for better speed/efficiency. Having a higher learning rate for the q-value function is also helpful: qf_learning_rate: !!float 1e-3.

Note: when using the DroQ configuration with CrossQ, you should set layer_norm=False as there is already batch normalization.

Benchmark

A partial benchmark can be found on OpenRL Benchmark where you can also find several reports.

Citing the Project

To cite this repository in publications:

@article{stable-baselines3,
  author  = {Antonin Raffin and Ashley Hill and Adam Gleave and Anssi Kanervisto and Maximilian Ernestus and Noah Dormann},
  title   = {Stable-Baselines3: Reliable Reinforcement Learning Implementations},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {268},
  pages   = {1-8},
  url     = {http://jmlr.org/papers/v22/20-1364.html}
}

Maintainers

Stable-Baselines3 is currently maintained by Ashley Hill (aka @hill-a), Antonin Raffin (aka @araffin), Maximilian Ernestus (aka @ernestum), Adam Gleave (@AdamGleave), Anssi Kanervisto (@Miffyli) and Quentin Gallouédec (@qgallouedec).

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

How To Contribute

To any interested in making the baselines better, there is still some documentation that needs to be done. If you want to contribute, please read CONTRIBUTING.md guide first.

Contributors

We would like to thank our contributors: @jan1854.

	update_carry["actor_state"],
	update_carry["ent_coef_state"],
	key,
	(update_carry["info"]["actor_loss"], update_carry["info"]["qf_loss"], update_carry["info"]["ent_coef_loss"]),

	actor_state,
	ent_coef_state,
	key,
	(actor_loss_value, qf_loss_value, ent_coef_value),

	@classmethod
	@partial(jax.jit, static_argnames=["cls", "gradient_steps"])
	def _train(
	cls,
	gamma: float,
	tau: float,
	target_entropy: np.ndarray,
	gradient_steps: int,
	data: ReplayBufferSamplesNp,
	policy_delay_indices: flax.core.FrozenDict,
	qf_state: RLTrainState,
	actor_state: TrainState,
	ent_coef_state: TrainState,
	key,
	):
	actor_loss_value = jnp.array(0)

	for i in range(gradient_steps):

	def slice(x, step=i):

araffin / sbx Goto Github PK

sbx's Introduction

Stable Baselines Jax (SB3 + Jax = SBX)

Install using pip

Example

Using SBX with the RL Zoo

Note about DroQ

Benchmark

Citing the Project

Maintainers

How To Contribute

Contributors

sbx's People

Contributors

Stargazers

Watchers

Forkers

sbx's Issues

🐛 Bug

To Reproduce

Additional context

Checklist

🐛 Bug

To Reproduce

Checklist

🐛 Bug

To Reproduce

Expected behavior

Additional context

Checklist

🤖 Custom Gym Environment

Additional context

To Reproduce

Expected behavior

Potential Fix

Checklist

🐛 Bug

To Reproduce

Expected behavior

Checklist

🚀 Feature

Motivation

Pitch

Idea on how to implement it

🐛 Bug

To Reproduce

Expected behavior

Additional context

Checklist

🐛 Bug

To Reproduce

Additional context

Checklist

🐛 Bug

To Reproduce

Expected behavior

Additional context

Checklist

🚀 Feature

Motivation

Question

Checklist

Checklist

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

🤖 Custom Gym Environment

Describe the bug

Code example

🚀 Feature

Motivation

Question

Checklist

Recommend Projects

Recommend Topics

Recommend Org