Here you can find several projects dedicated to the Deep Reinforcement Learning methods.
These projects are developed as part of the Udacity Deep Reinforcement Learning Nanodegree Program.
Several projects are devoted to Deep Reinforcement Learning Architectures,
Value-Based Methods and Bellman Equation,
Policy-Based Methods,
Policy-Gradient Methods and
Actor-Critic Methods.
- Monte-Carlo Methods
In Monte Carlo (MC), we play episodes of the game until we reach the end, we grab the rewards collected on the way
and move backward to the start of the episode. We repeat this method a sufficient number of times and we average
the value of each state. - Temporal Difference Methods and Q-learning
- Reinforcement Learning in Continuous Space (Deep Q-Network)
- Function Approximation and Neural Network
The Universal Approximation Theorem (UAT) states that feed-forward neural networks containing a
single hidden layer with a finite number of nodes can be used to approximate any continuous function provided
rather mild assumptions about the form of the activation function are satisfied. - Policy-Based Methods, Hill Climbing, Simulating Annealing
Random-restart hill climbing is a surprisingly effective algorithm in many cases. Simulated annealing is a good probabilistic technique because it does not accidentally think a local extrema is a global extrema. - Policy-Gradient Methods, REINFORCE, PPO
Define a performance measure J(\theta) to maximaze. Learn policy paramter \theta throgh approximate gradient ascent.
- Actor-Critic Methods, A3C, A2C, DDPG, SAC
The key difference from A2C is the Asynchronous part. A3C consists of multiple independent agents(networks) with
their own weights, who interact with a different copy of the environment in parallel. Thus, they can explore
a bigger part of the state-action space in much less time.
CartPole, Policy Based Methods, Hill Climbing
CartPole, Policy Gradient Methods, REINFORCE
Markov Decision Process, Monte-Carlo, Gridworld 6x6
Pong, Policy Gradient Methods, PPO
Pong, Policy Gradient Methods, REINFORCE
Project 1: Navigation, Deep-Q-Network, ReplayBuffer
Project 2: Continuous Control-Reacher, DDPG, environment Reacher (Double-Jointed-Arm)
Project 2: Continuous Control-Crawler, PPO, environment Crawler
Project 3: Collaboration_Competition-Tennis, Multi-agent DDPG, environment Tennis
BipedalWalker, Twin Delayed DDPG (TD3)
BipedalWalker, PPO, Vectorized Environment
BipedalWalker, Soft-Actor-Critic (SAC)
BipedalWalker, A2C, Vectorized Environment
CarRacing with PPO, Learning from Raw Pixels
- Pong, 8 parallel agents
- CarRacing, Single agent, Learning from pixels
- C r a w l e r , 12 parallel agents
- BipedalWalker, 16 parallel agents
- on Policy-Gradient Methods, see 1, 2, 3.
- on REINFORCE, see 1, 2, 3.
- on PPO, see 1, 2, 3, 4, 5.
- on DDPG, see 1, 2.
- on Actor-Critic Methods, and A3C, see 1, 2, 3, 4.
- on TD3, see 1, 2, 3
- on SAC, see 1, 2, 3, 4, 5
- on A2C, see 1, 2, 3, 4, 5
How does the Bellman equation work in Deep Reinforcement Learning?