Giter VIP home page Giter VIP logo

quantaji / amd-ssd Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 15.99 MB

Adaptive Mechanism Design (AMD) in Sequential Social Dilemma (SSD). Foundations of Reinforcement Learning 2023 Spring, team project.

Python 9.65% Jupyter Notebook 89.87% Shell 0.48%
mechanism-design multi-agent-reinforcement-learning reinforcement-learning reinforcement-learning-algorithms reinforcement-learning-environments

amd-ssd's People

Contributors

leon-lxa avatar minxuanqin avatar quantaji avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

amd-ssd's Issues

Current progress and problems so far

Current progress

  • Wolfpack env finished. (test code here)
  • run a few Adaptive Mechanism Design runs:
    • currently no cooperative evidence on AMD, but there are some cooperation when no AMD applied.
      截屏2023-05-02 21 22 48asdfasd

Existing Games as Sequential Social Dilemmas

Notes on Existing Games

The original SSD paper have two social dilema

  • Gathering
    • existing code
  • Wolfpack
    • existing codes

In vermashresth's SSD, they have three other social dilemmas

  • Watershed
    • Watershed Management is a problem of resource allocation consisting of several self-interested agents. These agents can withdraw water from a finite but commonsupply of water for individual purposes. The problem involves several constraints,multiple objectives, and optimisation involves continuous variables. Watershed management here is modelled as a multi-agent system.
  • Cleanup
    • A public goods dilemma in which agents get a reward for consuming apples, but must use a cleaning beam to clean a river in order for apples to grow. While an agent is cleaning the river, other agents can exploit it by consuming the apples that appear.
  • Harvest (This might be similar to Gathering)
    • A tragedy-of-the-commons dilemma in which apples regrow at a rate that depends on the amount of nearby apples. If individual agents employ an exploitative strategy by greedily consuming too many apples, the collective reward of all agents is reduced.
    • seems not likely to be multiagent...

In social-dilemma/multiagent, there are three ssd

  • PrisonEnvironment
    • The PrisonEnvironment instantiates a gridworld variant of the classic Prisoner's Dilemma. At each step of the game, both agents independently choose to move left, move right, or stay still. The left side of the board represents full defection and the right side of the board represents full cooperation. Intermediate positions are a linear combination of the extremes. Rewards are distributed every 10 timesteps of the game. The figure below shows the corresponding rewards for four primary states of the game.
  • proto_harvest by Simon Mendelsohn
    • The player goes around the board, collecting apples. New apples spawn with some probability depending on how many nearby apples there are (more nearby applies --> increased chance of getting more apples).

In eugenevinitsky's ssd, they also include watershed and cleanup

Adaptive Mechanism Design (AMD) algorithm pipeline introduction

Adaptive Mechanism Design (AMD) algorithm pipeline

theoretical derivation

From the paper, we know each agent's pseudo policy gradient assumed by the central planner
$$\nabla_{i, \text{pseudo}} J_i (\theta_i) = \sum_{\tau} \nabla_i \log \pi_{i}(\tau) \Psi_i(\tau),$$
where $\Psi_i(\tau)$ can be GAE estimator, or TD-error, the value function may also come from a critic. The actual parameter update is
$$\nabla_{i, \text{actual}} J_i (\theta_i) = \sum_{\tau} \nabla_i \log \pi_{i}(\tau) \left[ \Psi_i(\tau) + r_{\text{planner}}(\tau, i)\right].$$
The reward given by planner is a function of current states,
$$r_{\text{planner}}(\tau, i) \sim \pi_{\text{planner}}(s_1(\tau), \ldots, s_n(\tau); \theta_{p}) $$
The planner update its parameter by a similar policy gradient
$$\Delta \theta_p = \eta_p \sum_i \eta_i \nabla_{i, \text{pseudo}} J_i (\theta_i)\times \left(\sum_\tau \nabla_i \log \pi_i(\tau) \nabla_{p} \log\pi_{p}(\tau) r_{\text{planner}}(\tau, i) \right) $$
$$= \eta_p \sum_i \sum_\tau \eta_i \times \left( \nabla_{i, \text{pseudo}} J_i (\theta_i)\cdot \log \pi_i(\tau) \right) \times r_{\text{planner}}(\tau, i) \times \nabla_p \log\pi_{p}(\tau)$$
$$= \eta_p \sum_i \sum_\tau \eta_i a_i(\tau) \cdot r_{\text{planner}}(\tau, i) \cdot \nabla_p \log\pi_{p}(\tau).$$
Here I call $a_i(\tau) := \nabla_{i, \text{pseudo}} J_i (\theta_i)\cdot \log \pi_i(\tau)$ the awarenss of individual client, this is computed after sampling, but before passing to learning algorithm. One may also add the cost of central planner's reward by additional gradient
$$-\alpha \nabla_p \sum_i \sum_{\tau} \left|r_{\text{planner}}(\tau, i)\right|_2$$

core implementation

  • on algorithm init
    • Given a pettingzoo.utils.env.ParallelEnv, convert it to ray.rllib.env.multi_agent_env.MultiAgentEnv. [source]
    • Generate a new environment with central planner as one of the agent. [source1] [source2]
    • Initialize some constants (gym.spaces.Space for unflatten rewards), and identify central planner to the algorithm's worker. This is achieved by a rllib callback. [source]
  • on sampling
    • after sampling, each individual agent post-process trajectories, compute GAE targets using critics. [source]
  • on training step
    • each individual agent compute the awareness of this current batch. [source]
    • then individual agents and central planner communicate, exchange infomation on rewards from central planner, and awareness from agents. [source]
    • in the training step
      • agent update policy loss and critic's value loss. [source]
      • central planner compute the loss according to AMD algorithm, as shown in previous section. [source]

The whole training and configuration script is shown here.

Existing Repositories

Multi-agent Reinforcement Learning in Sequential Social Dilemmas (SSD)

Learning with Opponent-Learning Awareness (LOLA)

Adaptive Mechanism Design: Learning to Promote Cooperation (AMD)

COLA: Consistent Learning with Opponent-Learning Awareness

Proximal Learning With Opponent-Learning Awareness (POLA)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.