Giter VIP home page Giter VIP logo

douzero's Introduction

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning

Logo

Building PyPI version Downloads Downloads License Open In Colab

中文文档

DouZero is a reinforcement learning framework for DouDizhu (斗地主), the most popular card game in China. It is a shedding-type game where the player’s objective is to empty one’s hand of all cards before other players. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary significantly from turn to turn. DouZero is developed by AI Platform, Kwai Inc. (快手).

Community:

  • Slack: Discuss in DouZero channel.

  • QQ Group: Join our QQ group to discuss. Password: douzeroqqgroup

    • Group 1: 819204202
    • Group 2: 954183174
    • Group 3: 834954839
    • Group 4: 211434658
    • Group 5: 189203636

News:

  • Thanks for the contribution of @Vincentzyx for enabling CPU training. Now Windows users can train with CPUs.

Demo

Cite this Work

If you find this project helpful in your research, please cite our paper:

Zha, Daochen et al. “DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning.” ICML (2021).

@InProceedings{pmlr-v139-zha21a,
  title = 	 {DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning},
  author =       {Zha, Daochen and Xie, Jingru and Ma, Wenye and Zhang, Sheng and Lian, Xiangru and Hu, Xia and Liu, Ji},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {12333--12344},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/zha21a/zha21a.pdf},
  url = 	 {http://proceedings.mlr.press/v139/zha21a.html},
  abstract = 	 {Games are abstractions of the real world, where artificial agents learn to compete and cooperate with other agents. While significant achievements have been made in various perfect- and imperfect-information games, DouDizhu (a.k.a. Fighting the Landlord), a three-player card game, is still unsolved. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary significantly from turn to turn. Unfortunately, modern reinforcement learning algorithms mainly focus on simple and small action spaces, and not surprisingly, are shown not to make satisfactory progress in DouDizhu. In this work, we propose a conceptually simple yet effective DouDizhu AI system, namely DouZero, which enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. Starting from scratch in a single server with four GPUs, DouZero outperformed all the existing DouDizhu AI programs in days of training and was ranked the first in the Botzone leaderboard among 344 AI agents. Through building DouZero, we show that classic Monte-Carlo methods can be made to deliver strong results in a hard domain with a complex action space. The code and an online demo are released at https://github.com/kwai/DouZero with the hope that this insight could motivate future work.}
}

What Makes DouDizhu Challenging?

In addition to the challenge of imperfect information, DouDizhu has huge state and action spaces. In particular, the action space of DouDizhu is 10^4 (see this table). Unfortunately, most reinforcement learning algorithms can only handle very small action spaces. Moreover, the players in DouDizhu need to both compete and cooperate with others in a partially-observable environment with limited communication, i.e., two Peasants players will play as a team to fight against the Landlord player. Modeling both competing and cooperation is an open research challenge.

In this work, we propose Deep Monte Carlo (DMC) algorithm with action encoding and parallel actors. This leads to a very simple yet surprisingly effective solution for DouDizhu. Please read our paper for more details.

Installation

The training code is designed for GPUs. Thus, you need to first install CUDA if you want to train models. You may refer to this guide. For evaluation, CUDA is optional and you can use CPU for evaluation.

First, clone the repo with (if you are in China and Github is slow, you can use the mirror in Gitee):

git clone https://github.com/kwai/DouZero.git

Make sure you have python 3.6+ installed. Install dependencies:

cd douzero
pip3 install -r requirements.txt

We recommend installing the stable version of DouZero with

pip3 install douzero

If you are in China and the above command is too slow, you can use the mirror provided by Tsinghua University:

pip3 install douzero -i https://pypi.tuna.tsinghua.edu.cn/simple

or install the up-to-date version (it could be not stable) with

pip3 install -e .

Note that Windows users can only use CPU as actors. See Issues in Windows about why GPUs are not supported. Nonetheless, Windows users can still run the demo locally.

Training

To use GPU for training, run

python3 train.py

This will train DouZero on one GPU. To train DouZero on multiple GPUs. Use the following arguments.

  • --gpu_devices: what gpu devices are visible
  • --num_actor_devices: how many of the GPU deveices will be used for simulation, i.e., self-play
  • --num_actors: how many actor processes will be used for each device
  • --training_device: which device will be used for training DouZero

For example, if we have 4 GPUs, where we want to use the first 3 GPUs to have 15 actors each for simulating and the 4th GPU for training, we can run the following command:

python3 train.py --gpu_devices 0,1,2,3 --num_actor_devices 3 --num_actors 15 --training_device 3

To use CPU training or simulation (Windows can only use CPU for actors), use the following arguments:

  • --training_device cpu: Use CPU to train the model
  • --actor_device_cpu: Use CPU as actors

For example, use the following command to run everything on CPU:

python3 train.py --actor_device_cpu --training_device cpu

The following command only runs actors on CPU:

python3 train.py --actor_device_cpu

For more customized configuration of training, see the following optional arguments:

--xpid XPID           Experiment id (default: douzero)
--save_interval SAVE_INTERVAL
                      Time interval (in minutes) at which to save the model
--objective {adp,wp}  Use ADP or WP as reward (default: ADP)
--actor_device_cpu    Use CPU as actor device
--gpu_devices GPU_DEVICES
                      Which GPUs to be used for training
--num_actor_devices NUM_ACTOR_DEVICES
                      The number of devices used for simulation
--num_actors NUM_ACTORS
                      The number of actors for each simulation device
--training_device TRAINING_DEVICE
                      The index of the GPU used for training models. `cpu`
                	  means using cpu
--load_model          Load an existing model
--disable_checkpoint  Disable saving checkpoint
--savedir SAVEDIR     Root dir where experiment data will be saved
--total_frames TOTAL_FRAMES
                      Total environment frames to train for
--exp_epsilon EXP_EPSILON
                      The probability for exploration
--batch_size BATCH_SIZE
                      Learner batch size
--unroll_length UNROLL_LENGTH
                      The unroll length (time dimension)
--num_buffers NUM_BUFFERS
                      Number of shared-memory buffers
--num_threads NUM_THREADS
                      Number learner threads
--max_grad_norm MAX_GRAD_NORM
                      Max norm of gradients
--learning_rate LEARNING_RATE
                      Learning rate
--alpha ALPHA         RMSProp smoothing constant
--momentum MOMENTUM   RMSProp momentum
--epsilon EPSILON     RMSProp epsilon

Evaluation

The evaluation can be performed with GPU or CPU (GPU will be much faster). Pretrained model is available at Google Drive or 百度网盘, 提取码: 4624. Put pre-trained weights in baselines/. The performance is evaluated through self-play. We have provided pre-trained models and some heuristics as baselines:

  • random: agents that play randomly (uniformly)
  • rlcard: the rule-based agent in RLCard
  • SL (baselines/sl/): the pre-trained deep agents on human data
  • DouZero-ADP (baselines/douzero_ADP/): the pretrained DouZero agents with Average Difference Points (ADP) as objective
  • DouZero-WP (baselines/douzero_WP/): the pretrained DouZero agents with Winning Percentage (WP) as objective

Step 1: Generate evaluation data

python3 generate_eval_data.py

Some important hyperparameters are as follows.

  • --output: where the pickled data will be saved
  • --num_games: how many random games will be generated, default 10000

Step 2: Self-Play

python3 evaluate.py

Some important hyperparameters are as follows.

  • --landlord: which agent will play as Landlord, which can be random, rlcard, or the path of the pre-trained model
  • --landlord_up: which agent will play as LandlordUp (the one plays before the Landlord), which can be random, rlcard, or the path of the pre-trained model
  • --landlord_down: which agent will play as LandlordDown (the one plays after the Landlord), which can be random, rlcard, or the path of the pre-trained model
  • --eval_data: the pickle file that contains evaluation data
  • --num_workers: how many subprocesses will be used
  • --gpu_device: which GPU to use. It will use CPU by default

For example, the following command evaluates DouZero-ADP in Landlord position against random agents

python3 evaluate.py --landlord baselines/douzero_ADP/landlord.ckpt --landlord_up random --landlord_down random

The following command evaluates DouZero-ADP in Peasants position against RLCard agents

python3 evaluate.py --landlord rlcard --landlord_up baselines/douzero_ADP/landlord_up.ckpt --landlord_down baselines/douzero_ADP/landlord_down.ckpt

By default, our model will be saved in douzero_checkpoints/douzero every half an hour. We provide a script to help you identify the most recent checkpoint. Run

sh get_most_recent.sh douzero_checkpoints/douzero/

The most recent model will be in most_recent_model.

Issues in Windows

You may encounter operation not supported error if you use a Windows system to train with GPU as actors. This is because doing multiprocessing on CUDA tensors is not supported in Windows. However, our code extensively operates on the CUDA tensors since the code is optimized for GPUs. Please contact us if you find any solutions!

Core Team

Acknowlegements

douzero's People

Contributors

daochenzha avatar hsywhu avatar karoka avatar nobles5e avatar vincentzyx avatar yffbit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

douzero's Issues

AI会把大小王作为四带二的二带出去,这是明显掉胜率的走法

以下一局我扮演地主,本来农民可以稳赢的,结果它出了个7777XD, 被我反杀。即使规则允许,这样做也是不应该的,建议强制让AI四带二不能带XD

new game start
you are landlord

your cards: 3344448889999JJJJQ22
888q
landlord: 888Q left:16 winRate:
landlord_down: 5555 left:13 winRate:100.0%
landlord_up: left:17 winRate:100.0%
your cards: 3344449999JJJJ22
9999
landlord: 9999 left:12 winRate:
landlord_down: TTTT left:9 winRate:100.0%
landlord_up: KKKK left:13 winRate:100.0%
your cards: 334444JJJJ22

landlord: left:12 winRate:
landlord_down: left:9 winRate:50.0%
landlord_up: 7777XD left:7 winRate:100.0%
your cards: 334444JJJJ22
4444
landlord: 4444 left:8 winRate:
landlord_down: left:9 winRate:100.0%
landlord_up: AAAA left:3 winRate:100.0%
your cards: 33JJJJ22

landlord: left:8 winRate:
landlord_down: left:9 winRate:50.0%
landlord_up: 3 left:2 winRate:100.0%
your cards: 33JJJJ22
2
landlord: 2 left:7 winRate:
landlord_down: left:9 winRate:0.0%
landlord_up: left:2 winRate:50.0%
your cards: 33JJJJ2
2
landlord: 2 left:6 winRate:
landlord_down: left:9 winRate:51.94%
landlord_up: left:2 winRate:50.0%
your cards: 33JJJJ
jjjj33
landlord: 33JJJJ left:0 winRate:

winner: landlord

Why is the first GPU over-consumed?

I run the code with the following command, spotting that there are 9 processes occupying the first GPU. Why would that be the case?

python3 train.py --gpu_devices 0,1,2,3 --num_actor_devices 3 --num_actors 3 --training_device 3

The initialization logs look fine to me
image

Here's a snapshot of the result of nvidia-smi

image

Error when fighting with RLCard in landlord_up position

I run it with
python3 evaluate.py --landlord rlcard --landlord_up baselines/douzero_ADP/landlord_up.ckpt --landlord_down baselines/douzero_ADP/landlord_down.ckpt
and I get this error:
Traceback (most recent call last):
File "/DouZero-main/douzero/evaluation/rlcard_agent.py", line 62, in act
the_type = CARD_TYPE[0][last_move][0][0]
KeyError: '666777BR'

新版本在Windows下仍然无法Train?

Log如下

`C:\Users\A\Downloads\DouZero-main>python train.py
Found log directory: douzero_checkpoints\douzero
Saving arguments to douzero_checkpoints\douzero/meta.json
Path to meta file already exists. Not overriding meta.
Saving messages to douzero_checkpoints\douzero/out.log
Path to message file already exists. New data will be appended.
Saving logs data to douzero_checkpoints\douzero/logs.csv
Saving logs' fields to douzero_checkpoints\douzero/fields.csv
THCudaCheck FAIL file=..\torch/csrc/generic/StorageSharing.cpp line=258 error=801 : operation not supported
Traceback (most recent call last):
File "train.py", line 8, in
train(flags)
File "C:\Users\A\Downloads\DouZero-main\douzero\dmc\dmc.py", line 138, in train
actor.start()
File "E:\Software\miniconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "E:\Software\miniconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "E:\Software\miniconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "E:\Software\miniconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "E:\Software\miniconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 247, in reduce_tensor
event_sync_required) = storage.share_cuda()
RuntimeError: cuda runtime error (801) : operation not supported at ..\torch/csrc/generic/StorageSharing.cpp:258

C:\Users\A\Downloads\DouZero-main>Traceback (most recent call last):
File "", line 1, in
File "E:\Software\miniconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "E:\Software\miniconda3\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input`

为什么同样是4个GPU我的训练时候的FPS很低呢,基本都在2000左右

[INFO:1052 dmc:233 2022-07-20 17:39:38,765] After 1632000 (L:556800 U:528000 D:547200) frames: @ 1918.7 fps (avg@ 2318.1 fps) (L:0.0 U:0.0 D:1918.7) Stats:
{'loss_landlord': 1.9155352115631104,
'loss_landlord_down': 2.5349276065826416,
'loss_landlord_up': 2.1095376014709473,
'mean_episode_return_landlord': 0.08421196788549423,
'mean_episode_return_landlord_down': -0.08074238896369934,
'mean_episode_return_landlord_up': -0.06534682214260101}
[INFO:1052 dmc:233 2022-07-20 17:39:43,769] After 1648000 (L:563200 U:537600 D:547200) frames: @ 3197.8 fps (avg@ 2398.1 fps) (L:1279.1 U:1918.7 D:0.0) Stats:
{'loss_landlord': 2.3213179111480713,
'loss_landlord_down': 2.5349276065826416,
'loss_landlord_up': 2.6052844524383545,
'mean_episode_return_landlord': 0.09171878546476364,
'mean_episode_return_landlord_down': -0.08074238896369934,
'mean_episode_return_landlord_up': -0.08009536564350128}
[INFO:1052 dmc:233 2022-07-20 17:39:48,773] After 1654400 (L:569600 U:537600 D:547200) frames: @ 1279.1 fps (avg@ 2398.1 fps) (L:1279.1 U:0.0 D:0.0) Stats:
{'loss_landlord': 2.185067892074585,
'loss_landlord_down': 2.5349276065826416,
'loss_landlord_up': 2.6052844524383545,
'mean_episode_return_landlord': 0.09759927541017532,
'mean_episode_return_landlord_down': -0.08074238896369934,
'mean_episode_return_landlord_up': -0.08009536564350128}
[INFO:1052 dmc:233 2022-07-20 17:39:53,779] After 1673600 (L:576000 U:540800 D:556800) frames: @ 3836.1 fps (avg@ 2344.8 fps) (L:1278.7 U:639.4 D:1918.1) Stats:
{'loss_landlord': 1.77787184715271,
'loss_landlord_down': 2.7444241046905518,
'loss_landlord_up': 2.508575677871704,
'mean_episode_return_landlord': 0.10005713254213333,
'mean_episode_return_landlord_down': -0.09260766953229904,
'mean_episode_return_landlord_up': -0.08521360903978348}
[INFO:1052 dmc:233 2022-07-20 17:39:58,781] After 1680000 (L:576000 U:547200 D:556800) frames: @ 1279.5 fps (avg@ 2398.1 fps) (L:0.0 U:1279.5 D:0.0) Stats:
{'loss_landlord': 1.77787184715271,
'loss_landlord_down': 2.7444241046905518,
'loss_landlord_up': 2.264894723892212,
'mean_episode_return_landlord': 0.10005713254213333,
'mean_episode_return_landlord_down': -0.09260766953229904,
'mean_episode_return_landlord_up': -0.08965221047401428}

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:0A:00.0 Off | 0 |
| N/A 31C P0 96W / 400W | 66690MiB / 81251MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:45:00.0 Off | 0 |
| N/A 32C P0 95W / 400W | 66704MiB / 81251MiB | 98% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:4B:00.0 Off | 0 |
| N/A 34C P0 95W / 400W | 66700MiB / 81251MiB | 98% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:84:00.0 Off | 0 |
| N/A 39C P0 66W / 400W | 2653MiB / 81251MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

fps一直很低,偶尔会出现FPS0的情况 偶尔也会跳到5000.请问这是正常训练的速度吗

咨询下LSTM策略调整

目前使用的 历史出牌记录 z_batch ,作为nn.LSTM的输入,没有搞明白这个历史出牌作为输入的用意,能否换成 x_batch 来作为输入,也就是 就是根据明牌数据作为输入,这样可以减少很多数据量。 估计能加快训练时间

最终如下:
`

class LandlordLstmModel(nn.Module):
def init(self):
super().init()
self.lstm = nn.LSTM(270, 270, batch_first=True)
self.dense1 = nn.Linear(270, 300)
self.dense2 = nn.Linear(300, 300)
self.dense3 = nn.Linear(300, 300)
self.dense4 = nn.Linear(300, 1)

def forward(self, x, return_value=False, flags=None):
    x = x.view(len(x), 1, -1)
    lstm_out, (h_n, _) = self.lstm(x)
    x = self.dense1(lstm_out)
    x = torch.relu(x)
    x = self.dense2(x)
    x = torch.relu(x)
    x = self.dense3(x)
    x = torch.relu(x)
    x = self.dense4(x)

`

这样是否可行

出牌未校验是否大过上家

训练时没问题,手动改出牌时有这个问题。比如上家出5,我强制让地主出3,游戏没提示出错,而且下家还接着我出了4

关于evaluation阶段

我看目前eval_data.pickle中有三个玩家的初始手牌,请问目前evaluation是不是必须提供其他两个玩家的手牌信息,可否只根据自己的手牌和其他玩家的出牌做推理?
谢谢!

关于models.py 几个问题

`
self.lstm = nn.LSTM(162, 128, batch_first=True)

self.dense1 = nn.Linear(373 + 128, 512) # 输入501,输出 512
self.dense2 = nn.Linear(512, 512)
self.dense3 = nn.Linear(512, 512)
self.dense4 = nn.Linear(512, 512)
self.dense5 = nn.Linear(512, 512)
self.dense6 = nn.Linear(512, 1)
`

1、这个输入维度是162 ,我理解的是 z_batch 有 num_legal_actions行-5列-162个元素,也就是对应输入的162,不知道是不是这个意思
2、隐藏维度128 不清楚 有什么含义,可以随意设置吗
3、这里设计了6层全连接网路,不清楚 有什么含义,可以随意设置吗

执行loss.backward()几个小时报错out of memory

应该出现了OOM,报错如下:
Exception in thread batch-and-learn-1: Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/home/zzy/robot/DouZeroV2/douzero/dmc/dmc.py", line 238, in batch_and_learn _stats = learn(position, models, learner_model.get_model(position), batch, optimizers[position], flags, position_lock) File "/home/zzy/robot/DouZeroV2/douzero/dmc/dmc.py", line 101, in learn loss.backward() File "/home/zzy/.local/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/zzy/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass

显存 2G,CUDA Version: 11.7, 启动参数已经调的比较小了:
python3 train.py --gpu_devices 0 --load_model --batch_size 16 --num_actor_devices 1 --num_actors 1 --num_threads 1 --training_device 0

env/env.py _get_obs_landlord函数的疑问

hi~ 瞻仰大佬代码的时候发现env/env.py中,_get_obs_landlord,_get_obs_landlord_up,_get_obs_landlord_down函数中关于last_action的定义均有2个一模一样的代码段,请问是有什么不一样吗?
image

为什么evaluate出错?

前面已经产生了测试数据,但是在这一步
python evaluate.py --landlord random --landlord_up random --landlord_down random
发现在simulation.py 里面num_total_wins = 0
请问怎么回事?

显存很快就满了

机器4张卡,单卡显存12G,上限只能跑如下参数,再大报显存错误了。
python -u train.py --gpu_devices 0,1,2,3 --num_actor_devices 3 --num_actors 2 --training_device 3

求助,用CPU无法训练问题!!!

--training_device 改为 CPU后
parser.add_argument('--training_device', default='cpu', type=str,
help='The index of the GPU used for training models. cpu means using cpu')

还是出现
AssertionError: CUDA not available. If you have GPUs, please specify the ID after --gpu_devices. Otherwise, please train with CPU with python3 train.py --actor_device_cpu --training_device cpu

SL模型的训练数据是如何实现和加载的?

你好,我这边在做斗地主的SL模型。因为数据量特别多,达到百万局以上。我想问下你们这边关于监督学习的的数据加载方式是什么样的?
我这边主要有两个想法:

  1. 将所有数据做出numpy格式,但这样就会个上百万个numpy文件,陷入IO瓶颈。
  2. 全部读入原始数据,一边编码,一边训练。
    我想问下你们哪张方法更好,或者说有哪种更好的方法?谢谢

关于算法

感觉目前开发的算法是基于三方都明牌的情况下开发的,并不能像黑盒一样辅助人或者自己去完成斗地主游戏

i cannot train by cuda 11.3 win10 gtx1060,how can i do?

THCudaCheck FAIL file=..\torch/csrc/generic/StorageSharing.cpp line=249 error=801 : operation not supported
Traceback (most recent call last):
File "E:\深度学习相关\DouZero相关群资料\模型训练\分布式训练\分布式训练\train.py", line 8, in
train(flags)
File "E:\深度学习相关\DouZero相关群资料\模型训练\分布式训练\分布式训练\douzero\dmc\dmc.py", line 202, in train
actor.start()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\multiprocessing\reductions.py", line 247, in reduce_tensor
event_sync_required) = storage.share_cuda()
RuntimeError: cuda runtime error (801) : operation not supported at ..\torch/csrc/generic/StorageSharing.cpp:249

2个问题

  1. 第一局我选择了地主上家角色,我的队友地主下家从头到尾就出了一张牌,然后地主赢,在现实中基本很难打成这样吧。
  2. 第二局我选择了地主角色,然后我居然赢了

我自己train了一下,为什么出来的结果不像你们训练出的那样智能?

我训练的是wp,代码是你们的源码,训练后相比训练前的ckpt很奇怪
第一个表现,不管是地主还是农民,在出牌预测时,胜率都变为不高于50%
第二个表现,总是出大牌,压制对方,即使自己是农民,也会压制队友,导致最后无牌可出
第三个表现,有炸会拆着走,比如自己手牌剩下最大的2炸和一个对三,会直接走四带二
这是我训练了一小时之后尝试的效果,loss是0.9
所以请问你们训练多久?loss到多少算成功?是需要什么trick吗?

train.py 运行 After 0 (L:0 U:0 D:0) frames: @ 0.0 fps (avg@ 0.0 fps) (L:0.0 U:0.0 D:0.0)

请问为什么训练模型时候的结果一直是0呢?
Found log directory: douzero_checkpoints/douzero
Saving arguments to douzero_checkpoints/douzero/meta.json
Path to meta file already exists. Not overriding meta.
Saving messages to douzero_checkpoints/douzero/out.log
Path to message file already exists. New data will be appended.
Saving logs data to douzero_checkpoints/douzero/logs.csv
Saving logs' fields to douzero_checkpoints/douzero/fields.csv
[INFO:13486 utils:118 2021-08-10 03:03:32,181] Device 0 Actor 0 started.
[INFO:13498 utils:118 2021-08-10 03:03:37,035] Device 0 Actor 1 started.
[INFO:13506 utils:118 2021-08-10 03:03:41,833] Device 0 Actor 2 started.
[INFO:13514 utils:118 2021-08-10 03:03:54,821] Device 0 Actor 3 started.
[INFO:13526 utils:118 2021-08-10 03:04:19,880] Device 0 Actor 4 started.
[INFO:13468 dmc:194 2021-08-10 03:04:24,883] Saving checkpoint to douzero_checkpoints/douzero/model.tar
[INFO:13468 dmc:243 2021-08-10 03:04:25,064] After 0 (L:0 U:0 D:0) frames: @ 0.0 fps (avg@ 0.0 fps) (L:0.0 U:0.0 D:0.0) Stats:
{'loss_landlord': 0,
'loss_landlord_down': 0,
'loss_landlord_up': 0,
'mean_episode_return_landlord': 0,
'mean_episode_return_landlord_down': 0,
'mean_episode_return_landlord_up': 0}
[INFO:13468 dmc:243 2021-08-10 03:04:30,069] After 0 (L:0 U:0 D:0) frames: @ 0.0 fps (avg@ 0.0 fps) (L:0.0 U:0.0 D:0.0) Stats:
{'loss_landlord': 0,
'loss_landlord_down': 0,
'loss_landlord_up': 0,
'mean_episode_return_landlord': 0,
'mean_episode_return_landlord_down': 0,
'mean_episode_return_landlord_up': 0}
[INFO:13468 dmc:243 2021-08-10 03:04:35,074] After 0 (L:0 U:0 D:0) frames: @ 0.0 fps (avg@ 0.0 fps) (L:0.0 U:0.0 D:0.0) Stats:
{'loss_landlord': 0,
'loss_landlord_down': 0,
'loss_landlord_up': 0,
'mean_episode_return_landlord': 0,
'mean_episode_return_landlord_down': 0,
'mean_episode_return_landlord_up': 0}
[INFO:13468 dmc:243 2021-08-10 03:04:40,080] After 0 (L:0 U:0 D:0) frames: @ 0.0 fps (avg@ 0.0 fps) (L:0.0 U:0.0 D:0.0) Stats:
{'loss_landlord': 0,
'loss_landlord_down': 0,
'loss_landlord_up': 0,
'mean_episode_return_landlord': 0,
'mean_episode_return_landlord_down': 0,
'mean_episode_return_landlord_up': 0}

策略可能性

如果使用 明牌来训练,因为明牌训练想看训练的智能程度, 是不是 蒙特卡洛树搜索(MCTS) 这个算法更好些

策略调整可行性

目前是根据当前人的手牌、历史出牌、各玩家出牌历史、剩余数量以及 炸弹组合的,非常庞大的数据量,训练需要耗费非常久的时间。

若调整为 当前人的手牌、下家手牌、上家手牌、最近一次出牌记录 来训练 会不会更快些,也就是大家都是明牌来打斗地主。

LSTM这里 也不要 历史出牌记录来初始化了,直接按照上面的明牌元素,Linear还是保持6层,512大小

最终反正都是按照牌局结束来奖励。

想知道这种策略在 地主与农民 合作与对抗上面 是否有效

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.