google-research / batch-ppo Goto Github PK

Efficient Batched Reinforcement Learning in TensorFlow

License: Apache License 2.0

Python 100.00%

reinforcement-learning tensorflow multi-processing artificial-intelligence python vectorized-computation control

batch-ppo's Introduction

Batch PPO

This project provides optimized infrastructure for reinforcement learning. It extends the OpenAI gym interface to multiple parallel environments and allows agents to be implemented in TensorFlow and perform batched computation. As a starting point, we provide BatchPPO, an optimized implementation of Proximal Policy Optimization.

Please cite the TensorFlow Agents paper if you use code from this project in your research:

@article{hafner2017agents,
  title={TensorFlow Agents: Efficient Batched Reinforcement Learning in TensorFlow},
  author={Hafner, Danijar and Davidson, James and Vanhoucke, Vincent},
  journal={arXiv preprint arXiv:1709.02878},
  year={2017}
}

Dependencies: Python 2/3, TensorFlow 1.3+, Gym, ruamel.yaml

Instructions

Clone the repository and run the PPO algorithm by typing:

python3 -m agents.scripts.train --logdir=/path/to/logdir --config=pendulum

The algorithm to use is defined in the configuration and pendulum started here uses the included PPO implementation. Check out more pre-defined configurations in agents/scripts/configs.py.

If you want to resume a previously started run, add the --timestamp=<time> flag to the last command and provide the timestamp in the directory name of your run.

To visualize metrics start TensorBoard from another terminal, then point your browser to http://localhost:2222:

tensorboard --logdir=/path/to/logdir --port=2222

To render videos and gather OpenAI Gym statistics to upload to the scoreboard, type:

python3 -m agents.scripts.visualize --logdir=/path/to/logdir/<time>-<config> --outdir=/path/to/outdir/

Modifications

We release this project as a starting point that makes it easy to implement new reinforcement learning ideas. These files are good places to start when modifying the code:

File	Content
`scripts/configs.py`	Experiment configurations specifying the tasks and algorithms.
`scripts/networks.py`	Neural network models.
`scripts/train.py`	The executable file containing the training setup.
`algorithms/ppo/ppo.py`	The TensorFlow graph for the PPO algorithm.

To run unit tests and linting, type:

python2 -m unittest discover -p "*_test.py"
python3 -m unittest discover -p "*_test.py"
python3 -m pylint agents

For further questions, please open an issue on Github.

Implementation

We include a batched interface for OpenAI Gym environments that fully integrates with TensorFlow for efficient algorithm implementations. This is achieved through these core components:

agents.tools.wrappers.ExternalProcess is an environment wrapper that constructs an OpenAI Gym environment inside of an external process. Calls to step() and reset(), as well as attribute access, are forwarded to the process and wait for the result. This allows to run multiple environments in parallel without being restricted by Python's global interpreter lock.
agents.tools.BatchEnv extends the OpenAI Gym interface to batches of environments. It combines multiple OpenAI Gym environments, with step() accepting a batch of actions and returning a batch of observations, rewards, done flags, and info objects. If the individual environments live in external processes, they will be stepped in parallel.
agents.tools.InGraphBatchEnv integrates a batch environment into the TensorFlow graph and makes its step() and reset() functions accessible as operations. The current batch of observations, last actions, rewards, and done flags is stored in variables and made available as tensors.
agents.tools.simulate() fuses the step of an in-graph batch environment and a reinforcement learning algorithm together into a single operation to be called inside the training loop. This reduces the number of session calls and provides a simple way to train future algorithms.

To understand all the code, please make yourself familiar with TensorFlow's control flow operations, especially tf.cond(), tf.scan(), and tf.control_dependencies().

Disclaimer

This is not an official Google product.

batch-ppo's People

Contributors

Stargazers

Watchers

Forkers

ml-lab praneetdutta geeky-bit samithaj chingshan86 codeaudit ustcpcs vincentvanhoucke benjamesbabala shmuma hephaex world2005 arungpro kashif feryal hhy5277 dogwaves nunofernandes-plight redeipirati winnerineast frankatmech pythonai suqi youngjt zergey xiaowei-hu dl-yc tony32769 hwidongna bruceling rlfx paperstiger kismuz cowboy-lee grseb9s algoskynet veronicachelu boylittle yiwei-prowler adamstelmaszczyk hackintoshrao danijar ningweikang gjtucker gbyfbi pvrancx cnn-gan sky4star alexpashevich kelvinson andriusland jlewi tien-nguyen jimfleming red7hj davidtranno1 science-code plexzhang tibachang cwbeitel blazejosinski snake3355 karthikkbaalaji deepsense-ai secice honglinwen caozhengquan feiyilicare ourobouros capri2014 linchao0815 codegank thinkcloudgroup yimingpeng miturchi brettkoonce pedronahum lasolitudine crosstuck bkungfoo arunkumarramanan ai3dvision qipa 307509256 zdx chychen kiaragrouwstra afcarl wwxfromtju landoufulxf lizhangzhan cao-dut mystery-college-of-the-adapts shamanez raymonddixon bkalapar darrellenns vermashresth dave-msk pankeshgupta

batch-ppo's Issues

Performance issues in the program

Hello,I found a performance issue in google-research/batch-ppo/blob/master/agents/parts/iterate_sequences.py ,dataset = dataset.map was called without num_parallel_calls.I think it will increase the efficiency of your program if you add this.

Here is the documemtation of tensorflow to support this thing.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

Errors about the regularizer term

Dear Danijar,

Thanks for your great contribution, this repo help me learn a lot.
And I found some lines of code might be wrong.

in the file ppo.py lines 486~488 as below:

entropy = policy.entropy()
if self._config.entropy_regularization:
____ policy_loss -= self._config.entropy_regularization * entropy

the shape of entropy (batch_size, episode_length) and the shape policy_loss (batch_size,) are mismatched.
if I enable the entropy_regularization in config files, then it will cause an error:
ValueError: Inconsistent shapes: saw (?,) but expected () (and infer_shape=True)

I think the rank of entropy should be reduced before the subtraction:
entropy = tf.reduce_mean(policy.entropy(), axis=1)
if self._config.entropy_regularization:
____ policy_loss -= self._config.entropy_regularization * entropy

Thanks

Different frames error when adding another RNNCell

I have two different neural networks, both inherit tf.contrib.rnn.RNNCell, say a = net_1(s) and scal = net_2(s,a), net_1(s) is exactly the policy network defined in networks, and net_2(s,a) is a similiar network takes both a and s as input. I want to do some tf.scan operation to update net_2 which includes some loss conditioned on both neural networks' outputs, say loss = a - scal. Doing this throws an error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: The node 'end_episode/cond/cond/training/update_policy/scan_1/while/policy_loss/ExpandDims_2/Switch' has inputs from different frames. The input 'simulate/cond_2/pred_id' is in frame ''. The input 'end_episode/cond/cond/training/update_policy/scan_1/while/policy_loss/cfnetwork/rnn/while/steincf/gradients/end_episode/cond/cond/training/update_policy/scan_1/while/policy_loss/cfnetwork/rnn/while/steincf/Flatten_1/Reshape_grad/Reshape' is in frame 'end_episode/cond/cond/training/update_policy/scan_1/while/policy_loss/cfnetwork/rnn/while/end_episode/cond/cond/training/update_policy/scan_1/while/policy_loss/cfnetwork/rnn/while/'.

It states that in loss = a - scal, a is in frame '' while scal is its while frame.

I searched google but did not find any problems like this, so I am not sure if it is a bug or something others, could anyone help me figure out this problem?

Distributed training with Kubernetes

Opening this issue to start a discussion about whether it would be worth investing to make it easy to run tensorflow agents K8s.

For some inspiration you can look at TfJob CRD.

Some questions:

Is there a need to be able to distribute the environments across multiple machines?
What is the communication pattern between the simulations and TensorFlow job?
* Is data fetched from all simulations simultaneously?
* Does each simulation need to be individually addressable?

OpenAI Retro environment support/example

This is a feature request to add support/examples for OpenAI Retro environments.

The Retro API is the same as the standard gym environments, however the observations consist of screen images. This probably means we need some sort of preprocessing/CNN/downsample stages.

For some environments, it may be necessary to build an input state using multiple past observed images in order to capture complete state (as things like velocity/acceleration of sprites are not captured in a single frame image).

Modification of network: How to handle multiple inputs

Hi, I am modifying this code to write a new function which conditions on both action and state to reduce variance(I know this may introduce bias but can be corrected later like in the Q-prop paper).

I have two batches data, action and observ, with shape [batch_size, act_dim] and [batch_size, obs_dim], respectively, and I want to feed them into tf.nn.dynamic_rnn.
Since tf.nn.dynamic_rnn expect input with shape as [batch_size, max_time, input_size], so we can input action[:, None] and observ[:, None] instead to match shape.

What I want is to inherit tf.contrib.rnn.RNNCell and process action and observ inside __call__(self, input, state), so I really need to input both observ and action instead of merge them first and then input.

However, I do not know how to handle two inputs for tf.nn.dynamic_rnn.
documentation says that it accepts tuple of tensor, so I input tuple_input = [action, observ] and hope to get action and observ inside __call__ through tuple_input[0] and tuple_input[1]. However, an error occurs:

  File "/opt/anaconda/envs/rl/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 547, in dynamic_rnn
    flat_input = tuple(_transpose_batch_time(input_) for input_ in flat_input)
  File "/opt/anaconda/envs/rl/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 547, in <genexpr>
    flat_input = tuple(_transpose_batch_time(input_) for input_ in flat_input)
  File "/opt/anaconda/envs/rl/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 67, in _transpose_batch_time
    (x, x_static_shape))
ValueError: Expected input tensor Tensor("main/CheckNumerics_1:0", shape=(2,), dtype=float32, device=/device:CPU:0) to have rank at least 2, but saw shape: (2,)

It seems that I can not input a tuple, could you please suggest how to handle multiple inputs for tf.nn.dynamic_rnn.

The actual code using tf.nn.dynamic_rnn() is follows,

tuple_input = [action[:, None], observ[:, None]]
cell = self._config.network(self._batch_env.action.shape[1].value)                                                                                                                                          
(mean, logstd, value), state = tf.nn.dynamic_rnn(                                                                                                                                                           
    cell, tuple_input, length, state, tf.float32, swap_memory=True)

And this how we inherit tf.contrib.rnn.RNNCell,

class NewNetwork(tf.contrib.rnn.RNNCell):
  """ Inherited RNN Network
  """

  def __init__(
      self, layers, action_size,
      mean_weights_initializer=_MEAN_WEIGHTS_INITIALIZER,
      logstd_initializer=_LOGSTD_INITIALIZER):
    self._layers = layers
    self._action_size = action_size
    self._mean_weights_initializer = mean_weights_initializer
    self._logstd_initializer = logstd_initializer

  @property
  def state_size(self):
    unused_state_size = 1
    return unused_state_size

  @property
  def output_size(self):
    return tf.TensorShape([])

  def __call__(self, obsact, state):
    with tf.variable_scope('network'):
      observation = obsact[0]
      action = obsact[1]
      x = tf.contrib.layers.flatten(observation)
      y = tf.contrib.layers.flatten(action)
      for size in self._stein_layers:
        x = tf.contrib.layers.fully_connected(x, size, tf.nn.relu)
        y = tf.contrib.layers.fully_connected(y, size, tf.nn.relu)
      xy = tf.concat(x, y, axis=0)
      value = tf.contrib.layers.fully_connected(xy, 1, None)[:, 0]
    return (value), state

Training threads don't start on Windows

I started the learning a few minutes ago and this is what I got in command prompt:

E:\agents>python -m agents.scripts.train --logdir=E:\model --config=pendulum
INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\
20170918T084053-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.

It's been like this for about 10 minutes and tensorboard doesn't show anything.
In the log directory, there is only one file called 'config.yaml'.
Is it ok? It would be nice to see if the agent is progressing or it is hung or something.

Thanks
Amin

Generalized advantage estimation

Hi Danijar,

Recently, I was diving into the details of the code and have found some lines of code might be wrong.

In the file ppo.py lines 356-358, the function of computing advantage is given five inputs:
advantage = utility.lambda_advantage( reward, value, length, self._config.discount, self._config.gae_lambda),
but in the file utility.py where defined this function only has four arguments.
In the file utility.py lines 89-90, the code of computing the return:
return_ += discount ** window * tf.concat( [value[:, window:], tf.zeros_like(value[:, -window:]), 1]),
should it be:
return_ += discount ** window * tf.concat( [value[:, window:], tf.zeros_like(value[:, -window:])], 1) ?

No module named 'mujoco_py.mjlib'

Hi.
I'm trying to run the code on Linux Mint.
With pendulum environment, it works perfectly. But when I try to run with Mujoco environments, I get the following error (I installed mujoco & mujoco-py):

Traceback (most recent call last):
  File "/home/amin/Projects/agents/agents/tools/wrappers.py", line 436, in _worker
    env = constructor()
  File "/home/amin/Projects/agents/agents/scripts/train.py", line 107, in <lambda>
    batch_env = utility.define_batch_env(lambda: _create_environment(config), config.num_agents, env_processes)
  File "/home/amin/Projects/agents/agents/scripts/train.py", line 47, in _create_environment
    env = gym.make(config.env)
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/registration.py", line 161, in make
    return registry.make(id)
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/registration.py", line 119, in make
    env = spec.make()
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/registration.py", line 85, in make
    cls = load(self._entry_point)
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/registration.py", line 17, in load
    result = entry_point.load(False)
  File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 2405, in load
    return self.resolve()
  File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 2411, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/mujoco/__init__.py", line 1, in <module>
    from gym.envs.mujoco.mujoco_env import MujocoEnv
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/mujoco/mujoco_env.py", line 14, in <module>
    raise error.DependencyNotInstalled("{}. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)".format(e))
gym.error.DependencyNotInstalled: No module named 'mujoco_py.mjlib'. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

Maximum number of clients reachedSegmentation fault (core dumped)

Hi,

I'm trying to train my custom env (Xmoto game) with ppo, i get the following error after training about 2 hours and Phase eval (phase step 11100, global step 33900).

Maximum number of clients reachedSegmentation fault (core dumped)

my config

def xmoto():
  """Configuration for Xmoto task."""
  locals().update(default())
  env = 'Xmoto-v0'
  use_gpu = False
  max_length = 10
  steps = 3e4
  update_every = 60
  network = networks.feed_forward_categorical
  return locals()

running tensorflow 1.12.0 (not gpu)
pc config: i7 6700HQ 16GB ram Ubuntu 18 04

ran with

python3 -m batch-ppo-master.agents.scripts.train --config=xmoto --noenv_processes --logdir=models

Thanks for any help

Typo in Paper

Hi,
I didn't know what is the best way to communicate but I think this will do, as the .tex is not in the repo.

In the linked paper, there is a typo. Just look for:

A TensorFlow Agents algorithm defines the inference and learning computation
for of a batch of agents.

Between the "for" and the "off" there's something up

Also

The agents are represented as indices into the batch dimension.

I think that is "in" instead of "into"

and returns an actions as a tensor with batch dimension

I think its returns actions as a tensor with batch dimensions ?

Sorry, if you don't care. I thought it's a preprint so correcting some typos would be welcomed.

Thanks for the work by the way! I'll employ it in the next months in my thesis.

P.S. What is the difference between your approach (using the TF graph) and the ELF approach using a separate C++ environment for multi-threading? Did the other team not want to use TF or are they having better Multithreading performance than your system? I posted the same question on stackexchange

Enabling GPU compromises learning performance

I've run several trials with TF Agents and found that enabling the GPU through the use_gpu configuration flag stalls or inhibits task convergence. Any help troubleshooting this would be appreciated.

The problem seems to exist in all environments but is most prominent in the pendulum task. (Plots below.)

With GPU (4 runs):

Without GPU (4 runs):

These runs were generated with a fresh clone of the TF Agents repo as of this morning but previous versions showed similar results. The only difference between the two graphs is the use of the GPU.

It's also about 3x slower to use the GPU on the pendulum task but I suspect that's due to the relatively small size of the network vs the cost of data transfer to the GPU.

Also tested with:

CUDNN 6 and 5
TensorFlow 1.3.0 and 1.2.1
Both tensorflow and tensorflow-gpu packages (no apparent difference b/w these two for CPU)

(This issue may be related to #8?)

cc @danijar

EDIT:
GPU run logs: https://gist.github.com/jimfleming/0a163522f02ef9411a5b478099321497
CPU-only run logs: https://gist.github.com/jimfleming/e1eaafb720ee1ee969ea2f4a879ab17b

How to draw the score curve of PPO?

Hello,

I don't know how to draw the score curve PPO which in the paper of PPO? How to deal with the situation when the game is not over but the sample pool is full? In this cause if we end the game, it means that we cannot calculate the score which agent needs to perform more than Horizon (T) interactions with the environment to get, such as Walker2d-v1. But if we don't end the game when the sample pool is full, the number of samples maybe larger than Horizon (T).

I don't know how to deal with this problem. How do you deal with this problem in the PPO experiment? I really care about this. Thanks for your help.

Computation time needed for HalfCheetah

Hi
I downloaded the repository and tried to replicate the paper results in Cheetah environment.
But the code seems to be super slow. As you can see in the picture, reward improvement starts after 4 hours and it gets to the early peak after 8 hours (which according to the paper, should happen in less than an hour). Am I doing something wrong?

I have Linux Mint and I installed tensorflow GPU version (however the use_gpu flag is false in the code by default). This is my system config:
Hexa core Intel Core i7-4930K (-HT-MCP-) cache: 12288 KB
GeForce GTX 1060 6GB/PCIe/SSE2

Potential integration of MonitoredTrainingSession and revision of tools/loop.py

Using MonitoredTrainingSession has various benefits:

Easy integration with SyncReplicasOptimizer for synchronous training including initialization of variables in distributed setting i.e.

  opt = tf.SyncReplicasOptimizer(...)
  train_op = opt.minimize(total_loss, global_step=global_step)
  sync_rep_hook = opt.make_session_run_hook(is_chief)
  with training.MonitoredTrainingSession(master=master, is_chief=is_chief, hooks=[sync_rep_hook]) as mon_sess:
    while not mon_sess.should_stop():
      mon_sess.run(training_op)

Increases fault tolerance by automating recovery of failed sessions as well as graceful crashes?
Hooks for writing checkpoints and summaries every number of steps or seconds

We might benefit by migrating some of the functionality of the Loop object to this paradigm. One question is how to run a hook every specified number of steps. This could toggle a phase variable that conditionally executes one or another phase of the graph.

Motivated by difficulty debugging the distributed ppo implementation and wondering if working outwards-in from a best practice for using SyncReplicasOptimizer together with MonitoredTrainingSession might be one way to go?

Minimal Example For TPU?

Hey all,

Are there plans to support training on TPU? I use Cloud TPUs and would really appreciate an example of the correct way to do RL on TPUs.

I imagine we would avoid TPUEstimator and use tf.Sessions along with tpu.rewrite directly, but I haven't seen any of this written down besides tensorflow/minigo which is a bit of a behemoth and not exactly a "minimal example".

Let me know your thoughts and feel free to email me directly.

Performance issue in agents/parts/iterate_sequences.py

Hello, I've found a performance issue in "agents/parts/iterate_sequences.py": dataset = dataset.batch(batch_size or num_sequences)(here) should be called before dataset = dataset.map(remove_padding).flat_map(here) which would make your program more efficient. Moreover, length = sequence.pop('length')(here) indicates that the function remove_padding(here) isn't vectorized. To call dataset = dataset.batch(batch_size or num_sequences) before dataset = dataset.map(remove_padding).flat_map, remove_padding is supposed to be vectorized. Can remove_padding be vectorized? The performance of your code could be further improved if it can.

Here is the tensorflow document to support this thing.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

With GPU enabled, tensorflow freezes unless I force "Discounted Monte-Carlo returns." to CPU

With GPU enabled, tensorflow freezes unless I force "Discounted Monte-Carlo returns." to CPU. Adding with tf.device("/cpu") into discounted_return(reward, length, discount) seem to address the issue.

def discounted_return(reward, length, discount):
  """Discounted Monte-Carlo returns."""
  timestep = tf.range(reward.shape[1].value)
  mask = tf.cast(timestep[None, :] < length[:, None], tf.float32)
  with tf.device("/cpu"):                  
  	   return_ = tf.reverse(tf.transpose(tf.scan(
          lambda agg, cur: cur + discount * agg,
          tf.transpose(tf.reverse(mask * reward, [1]), [1, 0]),
          tf.zeros_like(reward[:, -1]), 1, False), [1, 0]), [1])
  return tf.check_numerics(tf.stop_gradient(return_), 'return')

I've seen this with TF 1.7, 1.11, 1.12, CUDA 8, 9, 10 and CUDA compute capability from 5.2 to 7.5.

Not sure how to debug TensorFlow when it quietly freezes (or crashes). Tried the thing with TensorFlow Debugger - it doesn't really show where it happens and also has GRPC issues. GDB shows that the process is in the following place, but with so many threads it is hard to tell if this has any relevance:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f4902dde6db in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007f4902dddcf9 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007f4902ddb2bb in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007f4902ddb793 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007f490280594c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007f490280599b in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00007f490280c5fb in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
   from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x00007f4902815598 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007f48ff8afc8c in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f48ffaa49b4 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) ()
   from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f48ffaa57e6 in TF_SessionRun () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f48ff8acffd in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f48ff8ad032 in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f48ff867d84 in _wrap_TF_SessionRun_wrapper () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

Support for discrete action spaces

I'm trying to train an agent in a discrete environment space and I'm getting the following exception:
Is that by design? If yes, any pointers onto changes required to make it work with discrete space?

ERROR:tensorflow:Error in environment process: Traceback (most recent call last):
  File "agents/tools/wrappers.py", line 445, in _worker
    env = constructor()
  File "/home/dchichkov/w/agents/agents/scripts/train.py", line 106, in <lambda>
    lambda: _create_environment(config),
  File "/home/dchichkov/w/agents/agents/scripts/train.py", line 52, in _create_environment
    env = tools.wrappers.RangeNormalize(env)
  File "agents/tools/wrappers.py", line 195, in __init__
    observ is not False and self._is_finite(self._env.observation_space))
  File "agents/tools/wrappers.py", line 251, in _is_finite
    return np.isfinite(space.low).all() and np.isfinite(space.high).all()
AttributeError: 'Discrete' object has no attribute 'low'

def copy():
  """Configuration for the copy task."""
  locals().update(default())
  # Environment
  env = 'Copy-v0'
  max_length = 200
  steps = 2e6  # 2M
  return locals()

setup.py calls this module agents and refers to tensorflow/agents

The name is listed as agents which at least matches https://pypi.org/project/agents/, but you should update the github url in setup.py and on pypi to point here if this is moving.

GPU doesn't seem to work

I've set use_gpu = True, but the GPU useage is almost close to zero when running the code. When I look into tensorboard, it shows that all operations are assigned to CPU. Then I disable sess_config = tf.ConfigProto(allow_soft_placement=True) and force it running on GPU, the system console throws an error as:
`INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\Code\PythonScripts\DeepRL\BatchPPO\20180308T091941-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.
2018-03-08 09:19:41.315004: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-08 09:19:41.595863: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.64GiB
2018-03-08 09:19:41.596493: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
INFO:tensorflow:Graph contains 42003 trainable variables.
2018-03-08 09:19:57.811479: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
Traceback (most recent call last):
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1323, in _do_call
return fn(*args)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1293, in _run_fn
self._extend_graph()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
tf.app.run()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 145, in main
for score in train(config, FLAGS.env_processes):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 127, in train
utility.initialize_variables(sess, saver, config.logdir)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\scripts\utility.py", line 116, in initialize_variables
tf.global_variables_initializer()))
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 889, in run
run_metadata_ptr)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _do_run
options, run_metadata)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]

Caused by op 'ppo_temporary/episodes/Variable', defined at:
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
tf.app.run()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 145, in main
for score in train(config, FLAGS.env_processes):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 113, in train
batch_env, config.algorithm, config)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\scripts\utility.py", line 48, in define_simulation_graph
algo = algo_cls(batch_env, step, is_training, should_log, config)
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\ppo\algorithm.py", line 78, in init
template, len(batch_env), config.max_length, 'episodes')
File "E:\Code\PythonScripts\DeepRL\BatchPPO\agents\ppo\memory.py", line 44, in init
self._length = tf.Variable(tf.zeros(capacity, tf.int32), False)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\variables.py", line 213, in init
constraint=constraint)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\variables.py", line 331, in _init_from_args
name=name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\state_ops.py", line 133, in variable_op_v2
shared_name=shared_name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 926, in _variable_v2
shared_name=shared_name, name=name)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
op_def=op_def)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]`

It seems that tensorflow does not allow assign an int type variable on GPU.

No initial states in update_steps

I'm not sure I understood correctly but it seems there is no initial state set in _perform_update_steps and _update_step in Algorithm.py

'''in def _perform_update_steps()'''
value = self._network(observ, length).value
...

'''in def _update_step()'''
network = self._network(observ, length)

While it is set in perform function,

'''in def perform()'''
output = self._network(observ[:, None], tf.ones(observ.shape[0]), state)

In the above two functions, are the initial states always set to zero? Is there any reason?

Running out of GPU memory

Hey all:

When I try to run the train.py, it take the all GPU memory, and I try to add per_process_gpu_memory_fraction, it can't work.

so how to change the GPU memory use?