Giter VIP home page Giter VIP logo

dqn-atari's Introduction

Playing Atari with Deep Reinforcement Learning

Abstract

A model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function.

Introduction

Most successful RL applications have relied on hand-crafted features combined with linear value functions or policy representations. Recent advances in deep learning have made it possible to extract high-level features from raw sensory data.

several challenges from a deep learning perspective:

  1. large amounts of handlabelled training data.
  2. The delay between actions and resulting rewards.
  3. RL encounters sequences of highly correlated states.

Use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors to alleviate the problems of correlated data and non-stationary distributions.

Background

Define environment E - Atari emulator

At each time-step, the agent selects an action at from the set of legal game actions, A = {1, . . . , K}. The action is passed to the emulator and modifies its internal state and the game score.

The agent observes an image equation from the emulator, which is a vector of raw pixel values representing the current screen and reward equation representing the change in game score instead of internal state.

Since it is impossible to fully understand the current situation from only the current screen, they consider sequences of actions and observations, equation and learn game strategies that depend upon these sequences. As a result, they can apply standard reinforcement learning methods for MDP(Markov Decision Process) where each sequence is a distinct state.

The goal of the agent: Maximize future rewards.
Make assumption: future rewards are discounted by a factor of equation per time-step.
Define the future reward at time t as image where T is the terminal time.
Define optimal action-value function equation where equation is a policy mapping sequences to actions (distributions over actions)
The basic idea behind RL algorithm is to estimate the action-value function, by using Bellan equation as an iterative update, equation which is called value-iteration algorithm.
However, this approach is impractical because action-value function is estimated separately[독립적으로] for each sequence. (각각의 sequence (고차원의 data)에 대해서 함수를 estimate하기에는 시간과 메모리 문제) Therefore, it's common to use a function approximator (sequence의 경향성(parameter)을 통해 함수화 시켜놓는 것), image
They refer to a neural network function approximator with weights equation as a Q-network. A Q-network can be trained by minimising a sequence of loss functions equation that changes at each iteration i, image
The parameters from the previous iteration equation are held fixed when optimising the loss function. (target의 값이 θ의 값에 민감하게 영향을 받기 때문에 stable한 learning을 위하여 θ값을 고정하는 것이다) Differentiating the loss function with respect to the weights we arrive at the following gradient, image

Deep Reinforcement Learning

  • Experience replay
    • store the agent’s experiences at each time-step equation in data-set equation pooled over many episodes into a replay memory.
    • In practice, the algorithm in the paper only stores the last N experiences and samples uniformly at random form D.
  • After performing experience replay, the agent selects and executes an action according to an equation-greedy policy.
  • our Q-function instead works on fixed length representation of histories produced by a function equation

The full algorithm, which we call deep Q-learning, is presented in Algorithm 1. image

  • Advantage:
  1. each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
  2. randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
  3. when learning on-policy the current parameters determine the next data sample that the parameters are trained on.
  • The algorithm is model-free and off-policy
    • model-free: Agent learns Trial-and-Error not planning(model-based)
    • off-policy: Divide a policy which is refered to action and one which is updated. (on-policy uses the same policy on two parts.)

Pre-processing and Model Architecture

  • raw Atari frames: 210 x 160 pixel images with 128 color palette.
  • gray-scale and down-sampling it to a 110 x 84 image.
  • crop 84 x 84 image.
  • the function equation implies pre-processing to the last 4 frames of a history and stacks them to produce the input to the Q-function.
  • use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network.

Experiments

  • Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged.
  • used the RMSProp algorithm with minibatches of size 32.
  • use a simple frame-skipping technique. More precisely, the agent sees and selects actions on every kth frame instead of every frame.

References

dqn-atari's People

Contributors

gyeongchan-yun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.