metarl-literature-review's Introduction

MetaRL-Literature-Review

Review related papers for the topic: Meta Reinforcement Learning for sequential decision problems.
1-5 score how confident I am in understanding the paper (5: I think I got all the key concepts. 1: I just skim through the paper)
Meta-RL: leverage varied experiences from previous tasks to adapt quickly to the new task at hand.

Problem statement and notation:

We assume a distribution of tasks p(𝓣), where each task is a Markov decision process (MDP).
𝓣 = {p(s0), p(st+1|st, at), r(st, at)}
p(s0): initial state distribution
p(st+1|st, at): transition distribution
r(st, at): reward function .
p(𝓣) encompasses tasks with varying transition functions (e.g., robots with different dynamics) and varying reward functions (e.g., navigating to different locations).
The transition and reward functions are unknown, but can be sampled by taking actions in the environment
=> Objective: maximize cumulative reward (or minimize regret)

Note: At the moment, I focus more on MetaRL for Bandit problem.

Note2: Brief summary slides: https://docs.google.com/presentation/d/1W_qSnl3KwAoLb0yGoETh4grZNTcUocEr4Vlm8CNQh2g/edit?usp=sharing

Open questions:

Augmented DQN basically teach the model to do Meta-Exploration => How do we do this to extend to other problems and method. => MAESN paper mentioned that following multiple (non-optimal) policies can inform the agent about the task structure (meta-exploration)
Why does eps = 0.1 help improve the performance ? (training with eps is understanable, but it helps even when inference)
Understanding PEARL: investigating the prior that they estimate. Does it overlap with the (bandit) environment actual prior? Can we improve upon this method? => They extract the latent context vector, probably contain the arms' prior for bandit problem (but I'm not gonna do an elaborate research on that).
Gradient based method used gradient update(s) at test time to quickly adapt, while context based just extract context information from some approximate function. Can we combine them to increase efficient?
When facing out-out-distribution tasks, Gradient Based methods (MAESN, MAML) revert back to normal Policy Gradient, while Context Based methods will most likely fail. Can we exploit this characteristic to increase the robustness of PEARL ?

Paper list:

Not classify yet:

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(4) (VariBAD)

Context based:

Probabilistic latent context:

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL) (4.5 Read Soft Actor-Critic paper to fully understand final details)

Deterministic latent context:

A Simple Neural Attentive Meta-Learner(5) (SNAIL)
RL2: Fast Reinforcement Learning via Slow Reinforcement Learning(5) (RL2)
Learning to Reinforcement Learn(4)
Learning to Learn without Gradient Descent by Gradient Descent (3.5)

Gradient based: learn from aggregated experience using:

Policy gradient:

With probabilistic latent context
- Meta-Reinforcement Learning of Structured Exploration Strategies (MAESN) (3)
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML) (3)
Learning to Explore with Meta-Policy Gradient
ProMP: Proximal Meta-Policy Search
Some Considerations on Learning to Explore via Meta-Reinforcement Learning.

Meta-learned loss functions:

Result comparison:

Setup (N, K)	Gittins (optimal as N → ∞)	Random	RL2	MAML	SNAIL	TS	OTS	Tuned-UCB	Eps-Greedy	Greedy
10,5	6.6	5.0	6.7	6.5	6.6	5.7	6.5	6.7	6.6	6.6
10,10	6.6	5.0	6.7	6.6	6.7	5.5	6.2	6.7	6.6	6.6
10,50	6.5	5.1	6.8	6.6	6.7	5.2	5.5	6.6	6.5	6.5
100,5	78.3	49.9	78.7	67.1	79.1	74.7	77.9	78.0	75.4	74.8
100,10	82.8	49.9	83.5	70.1	83.5	76.7	81.4	82.4	77.4	77.1
100,50	85.2	49.8	84.9	70.3	85.1	64.5	67.7	84.3	78.3	78.0
500,5	405.8	249.8	401.5	-	408.1	402.0	406.7	405.8	388.2	380.6
500,10	437.8	249.0	432.5	-	432.4	429.5	438.9	437.1	408.0	395.0
500,50	463.7	249.6	438.9	-	442.6	427.2	437.6	457.6	413.6	402.8
1000,50	944.1	499.8	847.43	-	889.8	-	-	-	-	-

Table 1: Results on multi-arm bandit problems. OTS: Optimistic Thompson Sampling. Greedy: with the best empirical mean. Horizon = N, Number of arms = K

N	Random	Eps-Greedy	PSRL	OPSRL	UCRL2	RL2	MAML	SNAIL
10	0.482	0.640	0.665	0.694	0.706	0.752	0.563	0.766
25	0.482	0.727	0.788	0.819	0.817	0.859	0.591	0.862
50	0.481	0.793	0.871	0.897	0.885	0.902	-	0.908
75	0.482	0.831	0.910	0.931	0.917	0.918	-	0.930
100	0.481	0.857	0.934	0.951	0.936	0.922	-	0.941

Table 2: Results on tabular MDPs. Check SNAIL paper for original source.

My experiment results:

Methods	Regret
Random	50.1715 +/- 36.0777
Thompson Sampling	3.5319 +/- 8.0465
Un-tuned UCB	10.2620 +/- 8.2752
Finite Difference	~ random
A2C	~ random
DQN (eps=0)	~ random
DQN (eps=0.1) replay-memory: >100 trajectories	~14-16
DQN (eps=0.1) replay-memory: ~12 trajectories	7.6187 +/- 9.9622
Augmented DQN (eps=0.1)	8.5549 +/- 11.5184
Augmented DQN (eps=0)	9.8793 +/- 28.7358

Table 3: Results on multi-arm bandit problems. Horizon = 300, number of arms = 2, gamma = 0.9.

NOTE:

There is some bias in the number: DQN (12 trajectories) method received more trials than others.
Augmented DQN: generate training samples with known good latent features (average reward, number_of_chosen**-0.5, current timestep).
Augmented DQN on average required ~7 times less data to converge than vanilla DQN.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

duongnhatthang / metarl-literature-review Goto Github PK