Welcome to RL playground!

About RL Playground

This playground is to record some result and tips from my experiment.

The name of Reinforcemnet Learning is quite comfusing. For me, I see "learning" as infrastructure, which not necessarily contains neural network.

Deep "learning" is simple one.
Reinforcemnet Learning is complex one which combines HMM.
Self-supervised learning is using multi-task trick to enhance model ability.

So by this insight, my implementation contains several modules:

Main: the "Break_out_TD_A2C". The most important implementation. It defines how the model been trained, the training method is exactly the "algorithm of Reinforcement Learning".
Model: the "RL_model". Neural network model implementation. It could be as easy as several CNN layers. Or contains complex implementation of curious-driven unit.
Module: like "octave_module". Some handy implementations are here.

Implementation Detail

The hardest part is how to compute the reward correctly. The design must rightly deal with reward, timing of training, end of game. The second one is the policy gradient loss. It is very different from the loss we use every day.

Here's explain of my implementation by image:

For now, I try curiosity model to accelerate the training process:

####But#### ,Even I implement above all correctly, still I can't promise it will work perfectly for every case. For me, there's several tips to deal with conditions that model is broken.

If all probability of action stuck for every frame in the end: decrease the learning rate.
How to know that all hyperparameter are been set properly: observe the probabilities change frame by frame.
How to know that it is converging: the episode reward should go up. But the process is extremely slow. I take a month to train. So if it seems not converge for days, be patient.
The losses of actor and critic explain few.
The advantage should be sometimes positive, sometimes negative.
All elements in loss function should be given a proper coefficient. Look carefully how loss change, and adjust them. For my experience, try and error is the only way.

By all the tips and correct implementation, I take a month to train. It performs not particular good. But it indeed learned how to catch the ball.

Experiment Result

Here's the result of A2C model:

finnweng / rl_playground Goto Github PK

rl_playground's Introduction

Welcome to RL playground!

About RL Playground

Implementation Detail

Experiment Result

rl_playground's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent