- Deep Q learning (DQL)
- Double Deep Q learning (DDQN)
- Stochastic Actor-Critic (AC)
- Soft Actor-Critic (SAC)
- Advantage Actor-Critic (A2C)
- Stochastic Actor-Critic (AC)
- Deep Deterministic Policy Gradient (DDPG)
- Advantage Actor-Critic (A2C)
- Soft Actor-Critic (SAC)
- Twin Delayed Deep Deterministic Policy Gradient (TD3)
- Proximal Policy Optimization (PPO)
This repository encompasses implementations of RL algorithms tailored for both continuous and discrete environments, as delineated above. The codebase adheres as closely as feasible to the original papers, and supplementary enhancements drawn from other repositories are used. The appended table provides a succinct overview of key attributes characterizing the developed algorithms. Specifically, 'AV' signifies action-value, 'SV' designates state-value, 'Dt' conveys deterministic, and 'St' denotes stochastic.
DDPG | TD3 | A2C | SAC | PPO | |
---|---|---|---|---|---|
Topology | AV | AV | SV | SV+AV | AV |
Action | Dt | Dt | St | St | Dt |
Replay Buffer | ✓ | ✓ | ☒ | ✓ | ✓ |
Policy | Off | Off | On | Off | Off |
Advantage Func. | ☒ | ☒ | ✓ | entropy-based | ✓ |
Algorithm | Component | Equation |
---|---|---|
SAC | Objective Function | |
Critic Update | ||
Actor Update | ||
Temperature Update | Adjusts |
|
A2C | Objective Function | |
Policy (Actor) Update | ||
Value Function (Critic) Update | ||
PPO | Objective Function | |
Policy Update | ||
Value Function Update | The value function is updated by minimizing the mean squared error between the predicted value and the computed returns. | |
TD3 | Objective Function | Uses the Q-function's objective, minimized over two Q-functions to mitigate overestimation. |
Critic Update | For each of the two Q-functions: |
|
Actor Update | ||
DDPG | Objective Function | |
Critic Update |
|
|
Actor Update |