yrlu / irl-imitation Goto Github PK

Implementation of Inverse Reinforcement Learning (IRL) algorithms in Python/Tensorflow. Deep MaxEnt, MaxEnt, LPIRL

Python 100.00%

irl inverse-reinforcement-learning imitation imitation-learning ml machine-learning rl reinforcement-learning lfd learning-from-demonstration tensorflow

irl-imitation's Introduction

irl-imitation

Implementation of selected Inverse Reinforcement Learning (IRL) algorithms in Python/Tensorflow.

$ python demo.py

Implemented Algorithms

Linear inverse reinforcement learning (Ng & Russell, 2000)
Maximum entropy inverse reinforcement learning (Ziebart et al., 2008)
Maximum entropy deep inverse reinforcement learning (Wulfmeier et al., 2015)

Implemented MDPs & Solver

2D gridworld
1D gridworld
Value iteration

If you use this software in your publications, please cite it using the following BibTeX entry:

@misc{lu2017irl-imitation,
  author = {Lu, Yiren},
  doi = {10.5281/zenodo.6796157},
  month = {7},
  title = {{Implementations of inverse reinforcement learning algorithms in Python/Tensorflow}},
  url = {https://github.com/yrlu/irl-imitation},
  year = {2017}
}

Dependencies

python 2.7
cvxopt
Tensorflow 0.12.1
matplotlib

Linear Inverse Reinforcement Learning

Following Ng & Russell 2000 paper: Algorithms for Inverse Reinforcement Learning, algorithm 1

$ python linear_irl_gridworld.py --act_random=0.3 --gamma=0.5 --l1=10 --r_max=10

Maximum Entropy Inverse Reinforcement Learning

(This implementation is largely influenced by Matthew Alger's maxent implementation)

Following Ziebart et al. 2008 paper: Maximum Entropy Inverse Reinforcement Learning
$ python maxent_irl_gridworld.py --help for options descriptions

$ python maxent_irl_gridworld.py --height=10 --width=10 --gamma=0.8 --n_trajs=100 --l_traj=50 --no-rand_start --learning_rate=0.01 --n_iters=20

$ python maxent_irl_gridworld.py --gamma=0.8 --n_trajs=400 --l_traj=50 --rand_start --learning_rate=0.01 --n_iters=20

Maximum Entropy Deep Inverse Reinforcement Learning

Following Wulfmeier et al. 2015 paper: Maximum Entropy Deep Inverse Reinforcement Learning. FC version implemented. The implementation does not follow exactly the model proposed in the paper. Some tweaks applied including elu activations, clipping gradients, l2 regularization etc.
$ python deep_maxent_irl_gridworld.py --help for options descriptions

$ python deep_maxent_irl_gridworld.py --learning_rate=0.02 --n_trajs=200 --n_iters=20

MIT License

irl-imitation's People

Contributors

Stargazers

Watchers

Forkers

gandalfvn amoliu gitsamshi andrewliao11 ruotianluo vikingmew benjamesbabala botyue jdc08161063 zxsted xinhandi sarthak10193 qifeng2010 sanaiqbalw mors25 yluo42 magnusja rosssong jfan2016 wz1938 meelement rosvill monkeyjohn pencilandbike huiwenzhang rohansaphal97 uotter kwnsiy ompugao junchenjin shamanez gereon-boehm ratidevidze robin970822 ai3dvision afcarl jkwang1992 pidipidi achenr wellbeing18 vigneshramk liujiangjiang iamwangyunkai alanxu89 samangel93 nunofernandes-plight megayeye badfisher himelys kun-son sapanachaudhary geonhee-lee decoderkurt zhangfuyang sfschouten hyzcn kingstarcraft znittzel 15327311512 haochen3611 attler caizhuo hanyangliu lamperougeyxy ceciliaxiyang yichen89 likangxidian mg-yatming daominglyu tangmhmhmh silvaco hankerbit sean0719 fcdtc abanddd zhousiyuhit jlks96 gzelda davidjunl thomasrantian etarakci-hvl awohlford lanseyege antoniopereira1996 jackblandin lucianzhong zivzone manuelschmidt p10rahulm soniabaee wenshuowang digital-idiot ramonpereira n-nsh doitdodo marisssssa justinwnicholson panxuetin yimingzhang521 drdink2012

irl-imitation's Issues

Possible bug: state visitation frequency

Hey there,

I am not a 100% sure but I feel like there is something wrong with calculating the state visitation frequency (https://github.com/stormmax/irl-imitation/blob/master/deep_maxent_irl.py#L93).

You iterate over all the states and calculate the frequency for every timestep then.

for s in range(N_STATES):
    for t in range(T-1):
      if deterministic:
        mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
      else:
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])

In my opinion the loops should be switched:

for t in range(T-1):
    for s in range(N_STATES):
      if deterministic:
        mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
      else:
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])

Because the visitation frequency of timestep t+1 depends on all the state frequencies of timestamp t. This also reflects the formular from the original MaxEnt paper (Ziebart et al, 2008):

Unfortunately if I change the loop heads, the reward is not recovered correctly anymore. Do you have any hints on this?

Possible bug: value iteration

Hey there,

I found another issue. Value iteration is defined like this:

See: http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf

Your code:

for s in range(N_STATES):
      v_s = []
      values[s] = max([sum([P_a[s, s1, a]*(rewards[s] + gamma*values_tmp[s1]) for s1 in range(N_STATES)]) for a in range(N_ACTIONS)])

https://github.com/stormmax/irl-imitation/blob/master/mdp/value_iteration.py#L42

So you are using reward of current state s and add it to the discounted value of the next state s1. How I understand the formular you should be doing:

for s in range(N_STATES):
      v_s = []
      values[s] = max([sum([P_a[s, s1, a]*(rewards[s1] + gamma*values_tmp[s1]) for s1 in range(N_STATES)]) for a in range(N_ACTIONS)])

LPIRL: Redundant Constraints

Hi! Thank you for this great reference implementation - it is very helpful.

I was going over the LPIRL implementation and I think you have some redundant constraints in your LP matrices - see line 59 in lp_irl.py - this loop does the same thing as the previous loop on line 55, resulting in a redundant set of constraints.

Thanks again,

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

\irl-imitation\mdp\gridworld.py", line 151, in get_transition_states_and_probs
nei_s[1] < 0 or nei_s[1] >= self.width or self.grid[nei_s[0]][nei_s[1]] == 'x':
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

Possible bugs : Determine action with previous ( not current ) state

Hi,

I feel like something is wrong with gw.step() call at
(https://github.com/stormmax/irl-imitation/blob/master/maxent_irl_gridworld.py#L95)
and
(https://github.com/stormmax/irl-imitation/blob/master/deep_maxent_irl_gridworld.py#L72) .

I think
cur_state, action, next_state, reward, is_done = gw.step(int(policy[gw.pos2idx(cur_state)]))
should be
cur_state, action, next_state, reward, is_done = gw.step(int(policy[gw.pos2idx(next_state)])).
By calling step() , current state inside gridworld object is iterated. So local variable here
next_state (not cur_state confusingly) always corresponds to the current state, and
that should be passed to the policy.

Do I misunderstand something?