ntt123 / a0-jax Goto Github PK

View Code? Open in Web Editor NEW

70.0 6.0 17.0 3.47 MB

AlphaZero in JAX

Home Page: https://go.ntt123.repl.co

License: MIT License

Python 88.22% HTML 11.11% Shell 0.67%

a0-jax's Introduction

a0-jax

AlphaZero in JAX using deepmind mctx library.

pip install -r requirements.txt

Train agent

Connect-Two game

python train_agent.py --weight-decay=1e-2 --num-iterations=3

Connect-Four game

TF_CPP_MIN_LOG_LEVEL=2 \
python train_agent.py \
    --game_class="games.connect_four_game.Connect4Game" \
    --agent_class="policies.resnet_policy.ResnetPolicyValueNet" \
    --batch-size=4096 \
    --num_simulations_per_move=32 \
    --num_self_plays_per_iteration=102400 \
    --learning-rate=1e-2 \
    --num_iterations=500 \
    --lr-decay-steps=200000

A live Connect-4 agent is running at https://huggingface.co/spaces/ntt123/Connect-4-Game. We use tensorflow.js to run the policy on the browser.

Caro (Gomoku) game

TF_CPP_MIN_LOG_LEVEL=2 \
python3 train_agent.py \
    --game-class="games.caro_game.CaroGame" \
    --agent-class="policies.resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

A live Caro agent is running at https://caro.ntt123.repl.co.

Go game

TF_CPP_MIN_LOG_LEVEL=2 \
python3 train_agent.py \
    --game-class="games.go_game.GoBoard9x9" \
    --agent-class="policies.resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./go_agent_9x9_128.ckpt" \
    --num-iterations=200 \
    --lr-decay-steps=1000000

A live Go agent is running at https://go.ntt123.repl.co. You can run the agent on your local machine with the go_web_app.py script.

We also have an interative colab notebook that runs the agent on GPU to reduce inference time.

Plot the search tree

python plot_search_tree.py 
# ./search_tree.png

Play

python play.py

TPU sponsor

Agents in the above demos are trained on Google TPUs sponsored by Google under the TPU Research Cloud program.

a0-jax's People

Contributors

Stargazers

Watchers

Forkers

mbrukman johnppp vlin02 hyu2000 xieren58 zpyoung jcfrw oriskunk antonyjia159 siasio ajcutuli gy11564 davidrsewell ztztztztztztz shawwn niart120 akaneiroo

a0-jax's Issues

Consider using qtransform_completed_by_mix_value.

Thanks for the nice project.
Have you tried using the default qtransform_completed_by_mix_value for the gumbel_muzero_policy?

The qtransform_by_min_max gives zero values to unvisited actions. That does not have a good theoretical justification.

Can I contact you directly?

Hello Mr NTT,
Can I contact you directly? It's about AlphaZero.
My email is [email protected]

Training on external environments

I've encountered a containerization issue when tried to implement a new environment that calls external application for game logic. I would need to call in step to get a new state, but at this point action is batched tracer so I can't extract it's value with call because batched input doesn't implement it.

class CheckersGame(Environment):
    ...

    def _step(self, action: chex.Array) -> Tuple["CheckersGame", chex.Array]:
        action = self._prepare_action(action) # get a concrete value of action
        new_state, reward = call_external_env(action)
        return self, jnp.array(reward, dtype=jnp.int32)

    @pax.pure
    def step(self, action: chex.Array) -> Tuple["CheckersGame", chex.Array]:
        # batched action comes in, but concrete value is required
        env, reward = jax.vmap(lambda a: self._step(a))(action.reshape(-1, 1))
        return self, reward

    ...

I can tap into action with id_print, id_tap here, but can't block _step that way.

What's correct way to do that?

Killed unexpectedly in Colab with TPU

On a budget, I'm running the training_agent for Caro on Colab with TPU.
However, somehow it always got killed at iteration #1 around 64% without much stacktraces provided.

Any experiences or theories on why this may happen?

!TF_CPP_MIN_LOG_LEVEL=0
!time python3 train_agent.py \
    --game-class="caro_game.CaroGame" \
    --agent-class="resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

2022-11-25 08:59:37.077139: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Cores: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
Loading weights at ./caro_agent_9x9_128.ckpt
Iteration 1
self play [######################--------------] 63% 00:09:41 /bin/bash: line 1: 2377 Killed python3 train_agent.py --game-class="caro_game.CaroGame" --agent-class="resnet_policy.ResnetPolicyValueNet128" --selfplay-batch-size=1024 --training-batch-size=1024 --num-simulations-per-move=32 --num-self-plays-per-iteration=102400 --learning-rate=1e-2 --random-seed=42 --ckpt-filename="./caro_agent_9x9_128.ckpt" --num-iterations=100 --lr-decay-steps=500000

real 17m19.797s
user 10m5.645s
sys 5m3.467s

Tic Tac Toe - Missing winning condition

Hi,

We have the winning condititons identified:

I think we might have missed winning by 3 in a row in the middle (vertically and horizontally), i.e., with spaces 1 , 4 7 (vertical) and 3, 5 6 (horizontal)

I have no idea how to fix this in your code :( sorry!

question on 9x9 go agent training

the 9x9 go agent is pretty strong! how many iterations was it trained on? how long does it take to train (i saw it's on TPUs)?

2 player games with non-alternating turns.

I've implemented a game which doesn't have a strictly alternating turn order (some actions change player, others don't). How could this be used in your framework? I think it's the discount, but wanted to check. Should the discount returned be 1 for any action that doesn't change player and -1 otherwise?

Support MuZero

It's a great job! I learned a lot in your repo. Where can I find the implementation of Muzero using mctx? Thanks a lot.