Giter VIP home page Giter VIP logo

a0-jax's Introduction

a0-jax

AlphaZero in JAX using deepmind mctx library.

pip install -r requirements.txt

Train agent

Connect-Two game

python train_agent.py --weight-decay=1e-2 --num-iterations=3

Connect-Four game

TF_CPP_MIN_LOG_LEVEL=2 \
python train_agent.py \
    --game_class="games.connect_four_game.Connect4Game" \
    --agent_class="policies.resnet_policy.ResnetPolicyValueNet" \
    --batch-size=4096 \
    --num_simulations_per_move=32 \
    --num_self_plays_per_iteration=102400 \
    --learning-rate=1e-2 \
    --num_iterations=500 \
    --lr-decay-steps=200000

A live Connect-4 agent is running at https://huggingface.co/spaces/ntt123/Connect-4-Game. We use tensorflow.js to run the policy on the browser.

Caro (Gomoku) game

TF_CPP_MIN_LOG_LEVEL=2 \
python3 train_agent.py \
    --game-class="games.caro_game.CaroGame" \
    --agent-class="policies.resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

A live Caro agent is running at https://caro.ntt123.repl.co.

Go game

TF_CPP_MIN_LOG_LEVEL=2 \
python3 train_agent.py \
    --game-class="games.go_game.GoBoard9x9" \
    --agent-class="policies.resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./go_agent_9x9_128.ckpt" \
    --num-iterations=200 \
    --lr-decay-steps=1000000

A live Go agent is running at https://go.ntt123.repl.co. You can run the agent on your local machine with the go_web_app.py script.

We also have an interative colab notebook that runs the agent on GPU to reduce inference time.

Plot the search tree

python plot_search_tree.py 
# ./search_tree.png

Play

python play.py

TPU sponsor

Agents in the above demos are trained on Google TPUs sponsored by Google under the TPU Research Cloud program.

a0-jax's People

Contributors

ntt123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

a0-jax's Issues

Consider using qtransform_completed_by_mix_value.

Thanks for the nice project.
Have you tried using the default qtransform_completed_by_mix_value for the gumbel_muzero_policy?

The qtransform_by_min_max gives zero values to unvisited actions. That does not have a good theoretical justification.

Training on external environments

I've encountered a containerization issue when tried to implement a new environment that calls external application for game logic. I would need to call in step to get a new state, but at this point action is batched tracer so I can't extract it's value with call because batched input doesn't implement it.

class CheckersGame(Environment):
    ...

    def _step(self, action: chex.Array) -> Tuple["CheckersGame", chex.Array]:
        action = self._prepare_action(action) # get a concrete value of action
        new_state, reward = call_external_env(action)
        return self, jnp.array(reward, dtype=jnp.int32)

    @pax.pure
    def step(self, action: chex.Array) -> Tuple["CheckersGame", chex.Array]:
        # batched action comes in, but concrete value is required
        env, reward = jax.vmap(lambda a: self._step(a))(action.reshape(-1, 1))
        return self, reward

    ...

I can tap into action with id_print, id_tap here, but can't block _step that way.

What's correct way to do that?

Killed unexpectedly in Colab with TPU

On a budget, I'm running the training_agent for Caro on Colab with TPU.
However, somehow it always got killed at iteration #1 around 64% without much stacktraces provided.

Any experiences or theories on why this may happen?

!TF_CPP_MIN_LOG_LEVEL=0
!time python3 train_agent.py \
    --game-class="caro_game.CaroGame" \
    --agent-class="resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

2022-11-25 08:59:37.077139: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Cores: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
Loading weights at ./caro_agent_9x9_128.ckpt
Iteration 1
self play [######################--------------] 63% 00:09:41 /bin/bash: line 1: 2377 Killed python3 train_agent.py --game-class="caro_game.CaroGame" --agent-class="resnet_policy.ResnetPolicyValueNet128" --selfplay-batch-size=1024 --training-batch-size=1024 --num-simulations-per-move=32 --num-self-plays-per-iteration=102400 --learning-rate=1e-2 --random-seed=42 --ckpt-filename="./caro_agent_9x9_128.ckpt" --num-iterations=100 --lr-decay-steps=500000

real 17m19.797s
user 10m5.645s
sys 5m3.467s

Tic Tac Toe - Missing winning condition

Hi,

We have the winning condititons identified:

image

I think we might have missed winning by 3 in a row in the middle (vertically and horizontally), i.e., with spaces 1 , 4 7 (vertical) and 3, 5 6 (horizontal)

I have no idea how to fix this in your code :( sorry!

2 player games with non-alternating turns.

I've implemented a game which doesn't have a strictly alternating turn order (some actions change player, others don't). How could this be used in your framework? I think it's the discount, but wanted to check. Should the discount returned be 1 for any action that doesn't change player and -1 otherwise?

Support MuZero

It's a great job! I learned a lot in your repo. Where can I find the implementation of Muzero using mctx? Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.