Giter VIP home page Giter VIP logo

alpha-zero-general's Introduction

Alpha Zero General (any game, any framework!)

A simplified, highly flexible, commented and (hopefully) easy to understand implementation of self-play based reinforcement learning based on the AlphaGo Zero paper (Silver et al). It is designed to be easy to adopt for any two-player turn-based adversarial game and any deep learning framework of your choice. A sample implementation has been provided for the game of Othello in PyTorch and Keras. An accompanying tutorial can be found here. We also have implementations for many other games like GoBang and TicTacToe.

To use a game of your choice, subclass the classes in Game.py and NeuralNet.py and implement their functions. Example implementations for Othello can be found in othello/OthelloGame.py and othello/{pytorch,keras}/NNet.py.

Coach.py contains the core training loop and MCTS.py performs the Monte Carlo Tree Search. The parameters for the self-play can be specified in main.py. Additional neural network parameters are in othello/{pytorch,keras}/NNet.py (cuda flag, batch size, epochs, learning rate etc.).

To start training a model for Othello:

python main.py

Choose your framework and game in main.py.

Docker Installation

For easy environment setup, we can use nvidia-docker. Once you have nvidia-docker set up, we can then simply run:

./setup_env.sh

to set up a (default: pyTorch) Jupyter docker container. We can now open a new terminal and enter:

docker exec -ti pytorch_notebook python main.py

Experiments

We trained a PyTorch model for 6x6 Othello (~80 iterations, 100 episodes per iteration and 25 MCTS simulations per turn). This took about 3 days on an NVIDIA Tesla K80. The pretrained model (PyTorch) can be found in pretrained_models/othello/pytorch/. You can play a game against it using pit.py. Below is the performance of the model against a random and a greedy baseline with the number of iterations. alt tag

A concise description of our algorithm can be found here.

Citation

If you found this work useful, feel free to cite it as

@misc{thakoor2016learning,
  title={Learning to play othello without human knowledge},
  author={Thakoor, Shantanu and Nair, Surag and Jhunjhunwala, Megha},
  year={2016},
  publisher={Stanford University, Final Project Report}
}

Contributing

While the current code is fairly functional, we could benefit from the following contributions:

  • Game logic files for more games that follow the specifications in Game.py, along with their neural networks
  • Neural networks in other frameworks
  • Pre-trained models for different game configurations
  • An asynchronous version of the code- parallel processes for self-play, neural net training and model comparison.
  • Asynchronous MCTS as described in the paper

Some extensions have been implented here.

Contributors and Credits

Note: Chainer and TensorFlow v1 versions have been removed but can be found prior to commit 2ad461c.

alpha-zero-general's People

Contributors

brettkoonce avatar brianprichardson avatar corochann avatar dependabot[bot] avatar edwardtau avatar evg-tyurin avatar goshawk22 avatar jernejhabjan avatar jjw-megha avatar leviathan91 avatar marcoleewow avatar mikhail avatar mlkorra avatar mwilliammyers avatar nmo13 avatar pavolkacej avatar rlronan avatar rodneyodonnell avatar saravanan21 avatar shantanuthakoor avatar sourkream avatar sunfc avatar suragnair avatar threedliteguy avatar vochicong avatar wang-zm18 avatar wasdee avatar yangboz avatar zhiqingxiao avatar zxkyjimmy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpha-zero-general's Issues

Returning -v from MCTS.search()

return -v

I've read your note about returning negative -v from the search(). Could you please provide some considerations about handling v/-v for a game where a player's move can be a chain of atomic moves. When we do search() for an atomic move we can't return -v because the player keeps moving and next player takes a turn after full chain of atomic moves is completed. Example of a game is Checkers (Draughts).

GPU utilization

Hi, first i want to thank you for this awesome repo.

I have a question about gpu utilization, because on my computer, when building model using OthelloGame and pytorch\OthelloNNNet, gpu utilization doesn't surpass the 2% mark, and I have gtx 1070 graphics card.

image

Is there something going on that I don't know about, or is this NN model just not deep enough to use all gpu power.

I've tried increasing batch size and changing 'cuda' flag to True, but that doesn't seem to help.

Also one more question. Is gpu also used with tensorflow and keras models? With these NN models, I am only getting CPU load and not gpu.

Dirichlet noise

From pp 24 of https://deepmind.com/documents/119/agz_unformatted_nature.pdf :

Additional exploration is achieved by adding Dirichlet noise 
to the prior probabilities in the root node s0, specifically 
P(s, a) = (1 − ε)p_a + εη_a, where η ∼ Dir(0.03) and ε = 0.25; 
this noise ensures that  all moves may be tried, but the
 search may still overrule bad moves

Searching this repo brings no matches for "dirichlet". I think you want numpy.random.dirichlet

Thanks for sharing this repo!

Go game implementation?

Since the alpha zero paper uses Go as the example, it seems natural to provide a Go implementation as well. Or perhaps is there any pointer that you can give to implement it?

Reward calculation in Self-Play is probably Othello specific

return [(x[0],x[2],r*((-1)**(x[1]!=self.curPlayer))) for x in trainExamples]

This is probably Othello specific while works good if black make the first move.

What could I do if I implement a game with white make the first move?

Assume white won and we calculate reward for the first move:
reward = r * ( (-1) ** (x[1] != curPlayer) )

r=1 as white won
curPlayer=-1 as black to move (white made the last move and won the game)
x[1]=1 as assume this is the first move

reward = 1* ( -1 ** 1) = -1 - must be +1

Othello pytorch working well

After commenting out the progress bar (NNet.py '# plot progress section' around line 95), pytorch version is running about 3-4 iterations per hour on GTX 1070, and only uses around 1.4GB memory, although 95% gpu utilization for the 10 epochs per iteration. Up to 65 iterations so far with no visible issues. Just wanted to share a positive post. Thanks.

Channel Size hardcoded to 512?

Love your project, and I'm working on adding code to accommodate playing Mini-Shogi.

I've determined the number of channels will need to be around 250 (to keep an 8 step history of every kind of piece, etc). However, I'm noticing that in the current default args there's:

args = dotdict({
    'lr': 0.001,
    'dropout': 0.3,
    'epochs': 10,
    'batch_size': 64,
    'num_channels': 512,
})

Which is used in OthelloNNet.py as follows

h_conv1 = Relu(BatchNormalization(self.conv2d(x_image, args.num_channels, 'same'), axis=3, training=self.isTraining))     # batch_size  x board_x x board_y x num_channels

But it seems like this should be 1, seeing as the Othello board is represented entirely by a single matrix of -1s, 0s, and 1s. Can you explain how 512 channels was chosen and why? I'm sure it's something straightforward, but I haven't been able to figure it out. Any help would be very much appreciated!

supervised learning on policy

The Neural network train P(S, a) and Q(s, a) against labels of policy Pi obtained from MCST and game results Z. However, the policy Pi from MCST (label) was guided by P(s, a) (which is from Neural network). So the question is who is supervise whom?

Keras out of GPU memory

Othello Keras after 34 iterations was out of gpu memory.
I have an 8GB GTX1070 but limited to per_process_gpu_memory_fraction = 0.4 (about 3.2GB)

Of course, I can run with more, but perhaps there should be some gpu memory size guidelines in the readme, assuming it is not an error.

Caused by op 'batch_normalization_199/FusedBatchNorm', defined at:
File "main.py", line 29, in
c.learn()
File "/home/brian/hitme/bin/alpha-zero-general/Coach.py", line 90, in learn
pnet = self.nnet.class(self.game)
File "/home/brian/hitme/bin/alpha-zero-general/othello/keras/NNet.py", line 27, in init
self.nnet = onnet(game, args)
File "/home/brian/hitme/bin/alpha-zero-general/othello/keras/OthelloNNet.py", line 37, in init
h_conv1 = Activation('relu')(BatchNormalization(axis=3)(Conv2D(args.num_channels, 3, padding='same')(x_image))) # batch_size x board_x x board_y x num_channels
File "/home/brian/hitme/lib/python3.6/site-packages/keras/engine/topology.py", line 617, in call
output = self.call(inputs, **kwargs)
File "/home/brian/hitme/lib/python3.6/site-packages/keras/layers/normalization.py", line 181, in call
epsilon=self.epsilon)
File "/home/brian/hitme/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1824, in normalize_batch_in_training
epsilon=epsilon)
File "/home/brian/hitme/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1799, in _fused_normalize_batch_in_training
data_format=tf_data_format)
File "/home/brian/hitme/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 831, in fused_batch_norm
name=name)
File "/home/brian/hitme/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 2034, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/brian/hitme/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/brian/hitme/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/brian/hitme/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,512,6,6]
[[Node: batch_normalization_199/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NHWC", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv2d_133/BiasAdd, batch_normalization_199/gamma/read, batch_normalization_199/beta/read, batch_normalization_199/Const_4, batch_normalization_199/Const_4)]]
[[Node: loss_33/add/_17985 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3237_loss_33/add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

bugs related to usage of tensorflow BatchNormalization

Hi Surg,
I found a bug related to the usage of tensorflow BatchNormalization.
in ../othello/tensorflow/OthelloNNet.py

check API of tf.layers.batch_normalization (https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization)

See parameter "training". We need to create a bool placeholder to check whether it is in the training phase.

That is, After line 25 in tensorflow/OthelloNNet.py, update
self.isTraining = tf.placeholder(tf.bool, name="is_training")
h_conv1 = Relu(BatchNormalization(self.conv2d(x_image, args.num_channels, 'same'), axis=3, training=self.isTraining)) # batch_size x board_x x board_y x num_channels
h_conv2 = Relu(BatchNormalization(self.conv2d(h_conv1, args.num_channels, 'same'), axis=3, training=self.isTraining)) # batch_size x board_x x board_y x num_channels
h_conv3 = Relu(BatchNormalization(self.conv2d(h_conv2, args.num_channels, 'valid'), axis=3, training=self.isTraining)) # batch_size x (board_x-2) x (board_y-2) x num_channels
h_conv4 = Relu(BatchNormalization(self.conv2d(h_conv3, args.num_channels, 'valid'), axis=3, training=self.isTraining)) # batch_size x (board_x-4) x (board_y-4) x num_channels

s_fc1 = Dropout(Relu(BatchNormalization(Dense(h_conv4_flat, 1024), axis=1, training=self.isTraining)), rate=self.dropout)
s_fc2 = Dropout(Relu(BatchNormalization(Dense(s_fc1, 512), axis=1, training=self.isTraining)), rate=self.dropout)

After Line 50, add
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_step = tf.train.AdamOptimizer(self.args.lr).minimize(self.total_loss)

Also directly generate action probability after line 37
self.prob = tf.nn.softmax(self.pi)

In tensorflow/NNet.py, in train member function, update
input_dict = {self.nnet.input_boards: boards, self.nnet.target_pis: pis, self.nnet.target_vs: vs, self.nnet.dropout: args.dropout, self.nnet.isTraining: True}

In predict function,
prob, v = self.sess.run([self.nnet.prob, self.nnet.v], feed_dict={self.nnet.input_boards: board, self.nnet.dropout: 0, self.nnet.isTraining: False})
comment out line: "pi = np.exp(pi) / np.sum(np.exp(pi))"
return prob[0], v[0]

If you like, you can give me a permission. I can create a bug feature branch and update the above change and other changes so that you can review them.

Thanks

Jianxiong

connect4 symmetries

Hi,
I've been playing with this code for some days, very instructive, thank you!
I'm wondering, in getSymmetries for connect4, shouldn't the vector pi also be reversed along with the board?
Thanks

parallelization of MCTS ?!

Thank you very much for providing this useful repository. I have a question regarding with parallelization of MCTS simulations. In the paper they have mentioned that they have used several threads for doing MCTS simulations in parallel. But I can't see any source of randomness in the Monte Carlo simulations in each step. Do you have any idea? Sorry if my question is stupid! I am new in this field.

I would also be grateful if you could let me know how can one parallelize collecting data and training networks, cause it seems that this is the method that they implemented in their work.

Thank you very much!

Bug in getGameEnded() / GoBang

There are two 5 in row results. Black from 6,0 to 2,4 and White from 1,0 to 5,4. Black had it first, but White won.

Game over: Turn 65 Result 1
0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |

0 |W W W W b W b W b W |
1 |W W W W b W b W b W |
2 |b W b W b W b W b - |
3 |W b W b W - - - - - |
4 |b W b W - - - - - - |
5 |W b W b W - - - - - |
6 |b W b - - - b - - - |
7 |W b W - - - - - - - |
8 |b W b W b - - - - - |
9 |b b b W b W b b b b |

Max recursion depth exceeded

Othello TF per below on iteration 1, epoch 10.
BTW: With TF less verbosity would be helpful (400 lines per epoch)

Training Net |############################### | (400/403) Data: 0.000s | Batch: 0.031s | Total: 0:00:12 | ETA: 0:00:01 | Loss_pi: 3.4790 |Training Net |############################### | (401/403) Data: 0.000s | Batch: 0.031s | Total: 0:00:12 | ETA: 0:00:01 | Loss_pi: 3.4789 |Training Net |############################### | (402/403) Data: 0.000s | Batch: 0.031s | Total: 0:00:12 | ETA: 0:00:01 | Loss_pi: 3.4789 |Training Net |################################| (403/403) Data: 0.000s | Batch: 0.031s | Total: 0:00:12 | ETA: 0:00:01 | Loss_pi: 3.4789 | Loss_v: 0.382
PITTING AGAINST PREVIOUS VERSION
/home/brian/hitme/bin/alpha-zero-general/MCTS.py:80: RuntimeWarning: invalid value encountered in true_divide
self.Ps[s] /= np.sum(self.Ps[s]) # renormalize
Traceback (most recent call last):
File "main.py", line 29, in
c.learn()
File "/home/brian/hitme/bin/alpha-zero-general/Coach.py", line 99, in learn
pwins, nwins, draws = arena.playGames(self.args.arenaCompare)
File "/home/brian/hitme/bin/alpha-zero-general/Arena.py", line 81, in playGames
gameResult = self.playGame(verbose=verbose)
File "/home/brian/hitme/bin/alpha-zero-general/Arena.py", line 46, in playGame
action = players[curPlayer+1](self.game.getCanonicalForm(board, curPlayer))
File "/home/brian/hitme/bin/alpha-zero-general/Coach.py", line 98, in
lambda x: np.argmax(nmcts.getActionProb(x, temp=0)), self.game)
File "/home/brian/hitme/bin/alpha-zero-general/MCTS.py", line 31, in getActionProb
self.search(canonicalBoard)
File "/home/brian/hitme/bin/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
File "/home/brian/hitme/bin/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
File "/home/brian/hitme/bin/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
[Previous line repeated 983 more times]
File "/home/brian/hitme/bin/alpha-zero-general/MCTS.py", line 103, in search
next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
File "/home/brian/hitme/bin/alpha-zero-general/othello/OthelloGame.py", line 31, in getNextState
b = Board(self.n)
File "/home/brian/hitme/bin/alpha-zero-general/othello/OthelloLogic.py", line 24, in init
for i in range(self.n):
RecursionError: maximum recursion depth exceeded in comparison

How To Install and use?

Could you please write a step how to install and use this to play Othello or tic tac toe?

Thanks in advance

getCanonicalForm rotate board

We are trying to implement a version for Hex/Nash https://en.wikipedia.org/wiki/Hex_(board_game)

There the two players are trying to reach a slightly different goal. One player is playing from left to right the other one from top to bottom. Therefore, the idea was to rotate the board and flip it in getCanonicalForm but then the moves aren't legal anymore as the board has changed at least that's what we believe is happening :D

Any ideas of how to overcome this issue?

This is continuation of #51

Yes, I read your article, the best one for me. However, the neural network's output (P(S,a), Q(s,a)) is trained by MCTS results(Pi) and the game results(Z). For Q(s,a) against Z, there is no issue; but for P(S,a) against Pi, there is an issue, because Pi is also depends on P(S,a), as indicated in the formula:

U(s,a) = Q(s,a) + c_{puct}\cdot P(s,a)\cdot\frac{\sqrt{\Sigma_b N(s,b)}}{1+N(s,a)}

Package as a reusable library

Thanks for sharing this project, I find it easier to start with working code than a research paper.

I'm planning to use this code in my own project, and I'd rather import the coach and arena classes from this project than add my game classes to a fork of this project.

I'm going to create a branch as a proof of concept to start the discussion.

batch_size

In the paper, they said that they optimized neural networks using 64 GPU workers and the batch size was "32 per worker, for a total mini-batch size of 2048".

I'm a bit confused how do their GPU workers could correlate to my single GPU which internally processes the batches in a multi-threaded way. So, if I use single GPU should I set batch_size=32 or batch_size=2048 to be more close to the original algorithm and methods?

stringRepresentation is not readable when printing

Hi, Surg,
In https://github.com/suragnair/alpha-zero-general/blob/master/othello/OthelloGame.py: line 85 under function stringRepresentation.

def stringRepresentation(self, board):
# 8x8 numpy array (canonical board)
return board.tostring()

When you try to print the returned string, it shows "@@@". It is not good for debugging.

We can do it in the following:
import json

def stringRepresentation(self, board):
return json.dumps(board.tolist())

e.g.
import numpy as np
import json
a = np.array([[2.0, 3.0], [4.0, 5.0]))
print(a.tostring())
@@@
print(json.dumps(a.tolist()))
[[2.0, 3.0], [4.0, 5.0]]

Probably tostring is more efficient but losing readability.

Thanks

Jianxiong

flips assertion failure

Hi Surag,
Thank for sharing good software. When I ran it under tensorflow framework, I got the flips assertion failure.

------ITER 19------
Self Play |###################### | (71/100) Eps Time: 1.669s | Total: 0:01:58 | ETA: 0:00:51Traceback (most recent call last):
File "main.py", line 30, in
c.learn()
File "/home//tools/alpha-zero-general/Coach.py", line 78, in learn
trainExamples += self.executeEpisode()
File "/home/
/tools/alpha-zero-general/Coach.py", line 46, in executeEpisode
pi = self.mcts.getActionProb(canonicalBoard, temp=temp)
File "/home//tools/alpha-zero-general/MCTS.py", line 31, in getActionProb
self.search(canonicalBoard)
File "/home/
/tools/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
File "/home//tools/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
File "/home/
/tools/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
File "/home//tools/alpha-zero-general/MCTS.py", line 106, in search
v = self.search(next_s)
File "/home/
/tools/alpha-zero-general/MCTS.py", line 103, in search
next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
File "/home//tools/alpha-zero-general/othello/OthelloGame.py", line 34, in getNextState
b.execute_move(move, player)
File "/home/
/tools/alpha-zero-general/othello/OthelloLogic.py", line 111, in execute_move
assert len(list(flips))>0
AssertionError

System info:
tensorflow-gpu 1.1.0
master branch (head of commit:
"
commit 263eccb
Author: suragnair [email protected]
Date: Wed Jan 3 12:05:11 2018 +0530

added dim to pytorch log_softmax (UserWarning)" )

I wonder whether it is possible that flips can be empty

Thanks

Jianxiong

possible bug in MCTS.py ?

Hi, I've read your code, and I found that in MCTS.py line 106:
u = self.args.cpuct*self.Ps[s][a]*math.sqrt(self.Ns[s])
if we're at a node with no child node, then self.Ns[s] would be 0. Hence, the u for all the (s, a) would be 0. It will choose the first valid action as the best one. self.Ns[s] will not be updated to 1 before the v of its first child node is returned.
shouldn't it be
u = self.args.cpuct*self.Ps[s][a]*math.sqrt(self.Ns[s]+1) ?

Is best_act = -1 normal ?

I got this error :

Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pudb/init.py", line 77, in runscript
dbg.runscript(mainpyfile)
File "/usr/lib/python2.7/dist-packages/pudb/debugger.py", line 419, in runscript
self.run(statement, globals=globals
, locals=locals
)
File "/usr/lib/python2.7/bdb.py", line 400, in run
exec cmd in globals, locals
File "", line 1, in
File "main.py", line 29, in
c.learn()
File "Coach.py", line 99, in learn
pwins, nwins, draws = arena.playGames(self.args.arenaCompare)
File "Arena.py", line 81, in playGames
gameResult = self.playGame(verbose=verbose)
File "Arena.py", line 46, in playGame
action = players[curPlayer+1](self.game.getCanonicalForm(board, curPlayer))
File "Coach.py", line 98, in
lambda x: np.argmax(nmcts.getActionProb(x, temp=0)), self.game)
File "MCTS.py", line 31, in getActionProb
self.search(canonicalBoard)
File "MCTS.py", line 106, in search
v = self.search(next_s)
File "MCTS.py", line 106, in search
v = self.search(next_s)
File "MCTS.py", line 106, in search
v = self.search(next_s)
File "MCTS.py", line 106, in search
v = self.search(next_s)
File "MCTS.py", line 106, in search
v = self.search(next_s)
File "MCTS.py", line 103, in search
next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
File "othello/OthelloGame.py", line 34, in getNextState
b.execute_move(move, player)
File "othello/OthelloLogic.py", line 111, in execute_move
assert len(list(flips))>0
AssertionError

When i execute exectue_move() function my action and move is -1 and (-1 ,5)

Othello conceptual bug in managing games ended in a draw

Hi, first of all thank you for this repo, I really appreciate the clarity of the code. :D

I think there is a problem in your choice of managing draws in the game of Othello, you assign a value of 1 if player 1 won and -1 if player 1 did not won, penalizing in the same way draws and defeats.

def getGameEnded(self, board, player): 

    # return 0 if not ended, 1 if player 1 won, -1 if player 1 lost
    # player = 1

    b = Board(self.n)
    b.pieces = np.copy(board)
    if b.has_legal_moves(player):
        return 0
    if b.has_legal_moves(-player):
        return 0
    if b.countDiff(player) > 0:
        return 1
    return -1

It is a legitimate choice, that has some sense if you want to force the agent to play for victory, but in a draw game you should give the same -1 value to the other player, while you are assigning +1. So you are creating a dataset where actions that led to draws can have either value +1 and -1, this leads to an ambiguous dataset and a poor training performance in case of several draws.

   while True:

        episodeStep += 1
        canonicalBoard = self.game.getCanonicalForm(board,self.curPlayer)
        temp = int(episodeStep < self.args.tempThreshold)

        pi = self.mcts.getActionProb(canonicalBoard, temp=temp)
        sym = self.game.getSymmetries(canonicalBoard, pi)
        for b,p in sym:
            trainExamples.append([b, self.curPlayer, p, None]) 

        action = np.random.choice(len(pi), p=pi)
        board, self.curPlayer = self.game.getNextState(board, self.curPlayer, action)

        r = self.game.getGameEnded(board, self.curPlayer)

        if r!=0:
            return [(x[0],x[2],r*((-1)**(x[1]!=self.curPlayer))) for x in trainExamples]

Picking the best child in the MCTS in event of a tie

A slight improvement to the MCTS.search code would cope with multiple actions having the same confidence.

line 99 becomes:
best_act = []

lines 109-111 become
if u > cur_best:
best = u
best_act = [ a ]
elif u == cur_best:
best_act.append(a)

line 113:
a = random.choice(best_act)

"best player from previous iterations" should not exist in AlphaZero implementation?

Thanks for the really cool project! I was reading over the AlphaZero paper and it seems like they dropped the self-play evaluation and best player selection in favor of continuous updates to the latest model when they moved from AlphaGo Zero to AlphaZero.

In AlphaGo Zero, self-play games were generated by the best player from all previous iterations.
After each iteration of training, the performance of the new player was measured against
the best player; if it won by a margin of 55% then it replaced the best player and self-play games
were subsequently generated by this new player. In contrast, AlphaZero simply maintains a single
neural network that is updated continually, rather than waiting for an iteration to complete. Self-play games are generated by using the latest parameters for this neural network, omitting
the evaluation step and the selection of best player.

It seems like they did not offer a lot of detail about the way they continuously update the model. I'm guessing they used some distributed model with Tensorflow that takes care of coordinating the model updates.

Are there any plans to update this repo with a continuously updated model like in AlphaZero?

temp in MCTS.getActionProb()

Hi, I noticed that in MCTS.py line 43:
counts = [x**(1./temp) for x in counts]
However, temp passed in here only alternate between 0 and 1. The only calculation of temp is at Coach.py line 49:
temp = int(episodeStep < self.args.tempThreshold)
which makes x**(1./temp) meaningless since temp is always 1. I guess your original intention was to flatten the distribution a little bit to allow more exploration?
Please correct me if I missed something. Thanks!

othello tensorflow version bugs?

@suragnair

    def __init__(self, game, args):
        ......
        self.pi = Dense(s_fc2, self.action_size)
        self.prob = tf.nn.softmax(self.pi)
    def calculate_loss(self):
        ......
        self.loss_pi =  tf.losses.softmax_cross_entropy(self.target_pis, self.pi)

in this line
self.loss_pi = tf.losses.softmax_cross_entropy(self.target_pis, self.pi)
is self.pi should be self.prob
??

How to choose framework

from othello.keras.NNet import NNetWrapper as nn <----------------keras instead of pytorch

Is this not choosing the framework in main.py per the readme, or am I missing something ?
Thanks,
Brian

Masking Ps[s]*valids may give an array of zeros

self.Ps[s] /= np.sum(self.Ps[s]) # renormalize

In this line we divide by sum of initial policy which were previously masked by valid moves (valids).

I observe the cases when product of Ps[s]*valids is an array of zeros. So sum(Ps[s]) is also zero and in the given line we have numpy warning about "division by zero" after that all Ps[s][a] become NaN. Numpy doesn't raise an error so we go on and in the next visits of the state [s] we have best_act = -1 and then a = -1. Then we invoke
next_s, next_player = self.game.getNextState(canonicalBoard, 1, a=-1)
which is conceptually illegal operation.

Such cases occur in the PIT phase of 1st iteration and then continue. When I debug a case I see that nn.predict() returns few nonzero values inside Ps[s] which don't match any value from valids and its product gives all zeros. On the next iterations a number of cases is gradually decreasing.

I think that occurences of those cases depend on random number generator because they are platform dependent. If I run the same program on different physical computers I may or may not observe those cases.

The question is, should we detect such cases and try to avoid NaNs in Ps[s]? If the case is detected we can for example make all valid moves equally probable, i.e.
self.Ps[s] = self.Ps[s] + valids
self.Ps[s] /= np.sum(self.Ps[s])

Separate Policy and Value Networks

The AlphaGo Zero paper mentions a difference from the prior AlphaGo approach is that the Policy and Value networks are combined. Could you please explain a bit about this and the method chosen for alpha-zero-general?
Thanks.

Chess game implementation

Can this framework be used to implement games like chess or shogi?
If so, would then neural network have to receive multi dimensional input, because for example chess has different types of figures with different actions. This means that action space would look 3 dimensional, where the third dimension covers actions of figures?

Here for games like othello, gobang, tictactoe... neural network receives 2d input which is grid and actions for that grid correspond to places the next figure is placed on. But this cannot be applied directly to games like chess.

I'm trying to implement RTS game, which consists of grid for example 8x8, but each tile can host up to 5 figures, which have different number of actions like "walk up", "attack" etc..
My action size would be

grid_width * grid_height * 5 * num_actions

where num_actions is number of all actions for figures.

Can this type of game even be implemented in this algorithm?

TF change getting error (softmax_cross_entropy_with_logits)

See #21

(alpha-zero-general) C:\alpha-zero-general\alpha-zero-general\Scripts\alpha-zero-general>python main.py
C:\alpha-zero-general\alpha-zero-general\lib\site-packages\h5py_init_.py:36: FutureWarning: Conversion of the second
argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.d type(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Traceback (most recent call last):
File "main.py", line 23, in
nnet = nn(g)
File "C:\alpha-zero-general\alpha-zero-general\Scripts\alpha-zero-general\othello\tensorflow\NNet.py", line 29, in i
nit

self.nnet = onnet(game, args)
File "C:\alpha-zero-general\alpha-zero-general\Scripts\alpha-zero-general\othello\tensorflow\OthelloNNet.py", line 40,
in init
self.calculate_loss()
File "C:\alpha-zero-general\alpha-zero-general\Scripts\alpha-zero-general\othello\tensorflow\OthelloNNet.py", line 48,
in calculate_loss
self.loss_pi = tf.losses.softmax_cross_entropy_with_logits(self.target_pis, self.pi)
AttributeError: module 'tensorflow.python.ops.losses.losses' has no attribute 'softmax_cross_entropy_with_logits'

Implement Connect4 player

I've been working on adding a Connect4 player on my fork and thought I should open an issue here to track progress. I assume you'd like to merge this in when it's done?

https://github.com/rodneyodonnell/alpha-zero-general/commits/connect4

Currently I've been training it for ~20 iterations (numEps=100, numMCTSSims=200) and it doesn't seem to play significantly better than it did after a single iteration. Still trying to understand why.

The model I've trained easily defeats random even with minimal MCTS steps in the arena, and can beat humans (me) with a moderate MCTS.

possible bug in OthelloNNet.py?

Hi, it's me again. I've noticed that in OthelloNNet.py line 37, you wrote:
self.prob = tf.nn.softmax(self.pi)
and in line 48:
self.loss_pi = tf.losses.softmax_cross_entropy(self.target_pis, self.prob)
I just checked the tf doc softmax_cross_entropy and it says:

tf.losses.softmax_cross_entropy(
onehot_labels,
logits,
weights=1.0,
label_smoothing=0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)

which means the softmaxcross_entropy already implements the softmax for you, and you should provide logits as input to softmax_cross_entropy, not some prob.
Actually, I tried to changed line 48 to
self.loss_pi = tf.losses.softmax_cross_entropy(self.target_pis, self.pi)
and it runs fine as well.
Please correct me if I got something wrong. Thanks!

Why to collect all examples from all iterations?

trainExamples = deque([], maxlen=self.args.maxlenOfQueue)

What is the main goal of collecting all examples from all iterations? On one hand we get more smooth loss dynamics and more examples for the same computation cost, on other hand when we are far enough from the first iterations our model has gained some experience and old examples depict weak predictions and thus pull the model back.

Better model/weights

Tried to upload, but too large.
After checkpoint 153.
Results pitted against pre-trained best (1580):

Arena.playGames |################################| (4097/2048) Eps Time: 3.473s | Total: 3:57:03 | ETA: 0:00:03
(1580, 2516, 0)

Keras framework

Apologize for formatting.
Cannot run Keras framework from scratch. Made 2 minor changes to main.py as follows:
from Coach import Coach
from othello.OthelloGame import OthelloGame
from othello.keras.NNet import NNetWrapper as nn <----------------keras instead of pytorch
from utils import *

args = dotdict({
'numIters': 1000,
'numEps': 100,
'tempThreshold': 15,
'updateThreshold': 0.6,
'maxlenOfQueue': 200000,
'numMCTSSims': 25,
'arenaCompare': 40,
'cpuct': 1,
'checkpoint': './tempkeras/', <---------------just to keep frameworks separate
'load_model': False,
'load_folder_file': ('/dev/models/8x100x50','best.pth.tar'),
})

if name=="main":
g = OthelloGame(6)
nnet = nn(g)
if args.load_model:
nnet.load_checkpoint(args.load_folder_file[0], args.load_folder_file[1])
c = Coach(g, nnet, args)
c.learn()

And get this:
(alpha-zero-general) brian@Tinker2Ubuntu:~/alpha-zero-general/bin/alpha-zero-general$ python mainkeras.py Using TensorFlow backend. /usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6 return f(*args, **kwds) ------ITER 1------ 2018-01-05 15:26:14.110757: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 2018-01-05 15:26:14.230623: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-01-05 15:26:14.231014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7845 pciBusID: 0000:03:00.0 totalMemory: 7.92GiB freeMemory: 3.83GiB 2018-01-05 15:26:14.231031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1) Self Play |#### | (15/100) Eps Time: 6.800s | Total: Self Play |##### | (16/100) Eps Time: 6.570s | Total: Self Play |##### | (17/100) Eps Time: 6.379s | Total: Self Play |##### | (18/100) Eps Time: 6.219s | Total: Self Play |###### | (19/100) Eps Time: 6.070s | Total: 0:01:55 | ETA:Self Play |################################| (100/100) Eps Time: 3.966s | Total: 0:06:36 | ETA: 0:00:04 Checkpoint Directory does not exist! Making directory ./tempkeras/ Traceback (most recent call last): File "mainkeras.py", line 29, in <module> c.learn() File "/home/brian/alpha-zero-general/bin/alpha-zero-general/Coach.py", line 91, in learn pnet.load_checkpoint(folder=self.args.checkpoint, filename='temp.pth.tar') File "/home/brian/alpha-zero-general/bin/alpha-zero-general/othello/keras/NNet.py", line 70, in load_checkpoint raise("No model in path {}".format(filepath)) TypeError: exceptions must derive from BaseException (alpha-zero-general) brian@Tinker2Ubuntu:~/alpha-zero-general/bin/alpha-zero-general$

invalid value in division

When I ran 'main.py' under tensorflow, I also got runtime warning:

'.../othello/tensorflow/NNet.py: 103 RuntimeWarning: invalid value encountered in divide

pi = np.exp(pi) / np.sum(np.exp(pi))
"

Here we calculated softmax. When one value in pi is large, the denominator may overflow.

we can try the following method to avoid it.

x = np.exp(pi - np.max(pi))
pi = x/x.sum()

But not sure whether the above warning was caused by overflow.

Jianxiong

Is pytorch necessary?

I have changed pytorch to keras in main.py .
But...

\pytorch_classification\utils\misc.py", line 12, in
import torch.nn as nn
ModuleNotFoundError: No module named 'torch'

Bug in othello/pytorch/NNet.py

Dear authors of this repository,
There is a bug on line 150 of othello/pytorch/NNet.py. The line
raise("No model in path {}".format(checkpoint))
should be
raise("No model in path {}".format(filepath))

Also, in pit.py the filepath passed into load_checkpoint should be

./pretrained_models/othello/pytorch/

Thank you for looking into this!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.