quantaji / amd-ssd Goto Github PK

Adaptive Mechanism Design (AMD) in Sequential Social Dilemma (SSD). Foundations of Reinforcement Learning 2023 Spring, team project.

Python 9.65% Jupyter Notebook 89.87% Shell 0.48%

mechanism-design multi-agent-reinforcement-learning reinforcement-learning reinforcement-learning-algorithms reinforcement-learning-environments

amd-ssd's People

Contributors

Stargazers

Watchers

amd-ssd's Issues

Current progress and problems so far

Current progress

Wolfpack env finished. (test code here)
run a few Adaptive Mechanism Design runs:
- currently no cooperative evidence on AMD, but there are some cooperation when no AMD applied.

Existing Games as Sequential Social Dilemmas

Notes on Existing Games

The original SSD paper have two social dilema

Gathering
- existing code
  - repo
Wolfpack
- existing codes
  - repo
  - repo probably bad practice

In vermashresth's SSD, they have three other social dilemmas

Watershed
- Watershed Management is a problem of resource allocation consisting of several self-interested agents. These agents can withdraw water from a finite but commonsupply of water for individual purposes. The problem involves several constraints,multiple objectives, and optimisation involves continuous variables. Watershed management here is modelled as a multi-agent system.
Cleanup
- A public goods dilemma in which agents get a reward for consuming apples, but must use a cleaning beam to clean a river in order for apples to grow. While an agent is cleaning the river, other agents can exploit it by consuming the apples that appear.
Harvest (This might be similar to Gathering)
- A tragedy-of-the-commons dilemma in which apples regrow at a rate that depends on the amount of nearby apples. If individual agents employ an exploitative strategy by greedily consuming too many apples, the collective reward of all agents is reduced.
- seems not likely to be multiagent...

In social-dilemma/multiagent, there are three ssd

PrisonEnvironment
- The PrisonEnvironment instantiates a gridworld variant of the classic Prisoner's Dilemma. At each step of the game, both agents independently choose to move left, move right, or stay still. The left side of the board represents full defection and the right side of the board represents full cooperation. Intermediate positions are a linear combination of the extremes. Rewards are distributed every 10 timesteps of the game. The figure below shows the corresponding rewards for four primary states of the game.
proto_harvest by Simon Mendelsohn
- The player goes around the board, collecting apples. New apples spawn with some probability depending on how many nearby apples there are (more nearby applies --> increased chance of getting more apples).

In eugenevinitsky's ssd, they also include watershed and cleanup

Adaptive Mechanism Design (AMD) algorithm pipeline introduction

Adaptive Mechanism Design (AMD) algorithm pipeline

theoretical derivation

From the paper, we know each agent's pseudo policy gradient assumed by the central planner
$$\nabla_{i, \text{pseudo}} J_i (\theta_i) = \sum_{\tau} \nabla_i \log \pi_{i}(\tau) \Psi_i(\tau),$$
where $\Psi_i(\tau)$ can be GAE estimator, or TD-error, the value function may also come from a critic. The actual parameter update is
$$\nabla_{i, \text{actual}} J_i (\theta_i) = \sum_{\tau} \nabla_i \log \pi_{i}(\tau) \left[ \Psi_i(\tau) + r_{\text{planner}}(\tau, i)\right].$$
The reward given by planner is a function of current states,
$$r_{\text{planner}}(\tau, i) \sim \pi_{\text{planner}}(s_1(\tau), \ldots, s_n(\tau); \theta_{p}) $$
The planner update its parameter by a similar policy gradient
$$\Delta \theta_p = \eta_p \sum_i \eta_i \nabla_{i, \text{pseudo}} J_i (\theta_i)\times \left(\sum_\tau \nabla_i \log \pi_i(\tau) \nabla_{p} \log\pi_{p}(\tau) r_{\text{planner}}(\tau, i) \right) $$
$$= \eta_p \sum_i \sum_\tau \eta_i \times \left( \nabla_{i, \text{pseudo}} J_i (\theta_i)\cdot \log \pi_i(\tau) \right) \times r_{\text{planner}}(\tau, i) \times \nabla_p \log\pi_{p}(\tau)$$
$$= \eta_p \sum_i \sum_\tau \eta_i a_i(\tau) \cdot r_{\text{planner}}(\tau, i) \cdot \nabla_p \log\pi_{p}(\tau).$$
Here I call $a_i(\tau) := \nabla_{i, \text{pseudo}} J_i (\theta_i)\cdot \log \pi_i(\tau)$ the awarenss of individual client, this is computed after sampling, but before passing to learning algorithm. One may also add the cost of central planner's reward by additional gradient
$$-\alpha \nabla_p \sum_i \sum_{\tau} \left|r_{\text{planner}}(\tau, i)\right|_2$$

core implementation

on algorithm init
- Given a pettingzoo.utils.env.ParallelEnv, convert it to ray.rllib.env.multi_agent_env.MultiAgentEnv. [source]
- Generate a new environment with central planner as one of the agent. [source1] [source2]
- Initialize some constants (gym.spaces.Space for unflatten rewards), and identify central planner to the algorithm's worker. This is achieved by a rllib callback. [source]
on sampling
- after sampling, each individual agent post-process trajectories, compute GAE targets using critics. [source]
on training step
- each individual agent compute the awareness of this current batch. [source]
- then individual agents and central planner communicate, exchange infomation on rewards from central planner, and awareness from agents. [source]
- in the training step
  - agent update policy loss and critic's value loss. [source]
  - central planner compute the loss according to AMD algorithm, as shown in previous section. [source]

quantaji / amd-ssd Goto Github PK

amd-ssd's People

Contributors

Stargazers

Watchers

amd-ssd's Issues

Current progress

Notes on Existing Games

Adaptive Mechanism Design (AMD) algorithm pipeline

theoretical derivation

core implementation

Multi-agent Reinforcement Learning in Sequential Social Dilemmas (SSD)

Learning with Opponent-Learning Awareness (LOLA)

Adaptive Mechanism Design: Learning to Promote Cooperation (AMD)

COLA: Consistent Learning with Opponent-Learning Awareness

Proximal Learning With Opponent-Learning Awareness (POLA)

Recommend Projects

Recommend Topics

Recommend Org