The 3m-asr from tencent-ailab

Any plan to release pretrained model

CTC loss does not converge when training confomer_base using 2 GPUS

Hi, thanks for the open-source code of 3m-asr, I am quite interested with this architecture and want to test it on Aishell1 first.
When I follow the run.sh, in the stage of training conformer_base model, I found CTC loss does not converge. From my point of view, the conformer_base is just a large version of conformer_embedding, however, the CTC loss converges smoothly in the latter case.

A different point between these two cases is that, I use 1 GPU to train conformer_embedding with batch_size=32 by default, and in the stage of conformer_base, the OOM issue arises, so 2 GPUs are used to train conformer_base with batch_size=16 in DDP mode in order to keep the same equal batch_size.

The loss figures are attached here (left is conformer_base, right is confrmer_embedding), any ideas? Also another question is that how about training the conformer_moe from scratch without pre-training conformer_embedding and conformer_base? Thanks in advance.

The performance is not as good as what you posted

Hi， I just have complete the training and decoding ，it's a long long long time。
The performance is not as good as what you posted，so is anything wrong for my training？

load model when test

2022-07-22 02:07:55 UTC -- INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group

2022-07-22 02:07:55 UTC -- INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic__uqeubqu/none_uv4sn3l_/attempt_0/0/error.json

2022-07-22 02:08:37 UTC -- Traceback (most recent call last):

2022-07-22 02:08:37 UTC -- File "bin/recognize.py", line 149, in

2022-07-22 02:08:37 UTC -- main(args)

2022-07-22 02:08:37 UTC -- File "bin/recognize.py", line 69, in main

2022-07-22 02:08:37 UTC -- model.load_state_dict_comm(param_dict)

2022-07-22 02:08:37 UTC -- File "/code/trainer/model/conformer_aed_moe_catEmbed.py", line 73, in load_state_dict_comm

2022-07-22 02:08:37 UTC -- return ConformerMoeEncoder.load_state_dict_comm(self, state_dict)

2022-07-22 02:08:37 UTC -- File "/code/trainer/model/conformer_moe_catEmbed.py", line 277, in load_state_dict_comm

2022-07-22 02:08:37 UTC -- return self.load_state_dict(whole_model_state)

2022-07-22 02:08:37 UTC -- File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict

2022-07-22 02:08:37 UTC -- raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

2022-07-22 02:08:37 UTC -- RuntimeError: Error(s) in loading state_dict for Net:

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.0.feed_forward.router_weights: copying a param with shape torch.Size([1024, 32]) from checkpoint, the shape in current model is torch.Size([1024, 4]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.0.feed_forward.experts.w_1.weight: copying a param with shape torch.Size([32, 2048, 512]) from checkpoint, the shape in current model is torch.Size([4, 2048, 512]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.0.feed_forward.experts.w_1.bias: copying a param with shape torch.Size([32, 2048]) from checkpoint, the shape in current model is torch.Size([4, 2048]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.0.feed_forward.experts.w_2.weight: copying a param with shape torch.Size([32, 512, 2048]) from checkpoint, the shape in current model is torch.Size([4, 512, 2048]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.0.feed_forward.experts.w_2.bias: copying a param with shape torch.Size([32, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.1.feed_forward.router_weights: copying a param with shape torch.Size([1024, 32]) from checkpoint, the shape in current model is torch.Size([1024, 4]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.1.feed_forward.experts.w_1.weight: copying a param with shape torch.Size([32, 2048, 512]) from checkpoint, the shape in current model is torch.Size([4, 2048, 512]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.1.feed_forward.experts.w_1.bias: copying a param with shape torch.Size([32, 2048]) from checkpoint, the shape in current model is torch.Size([4, 2048]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.1.feed_forward.experts.w_2.weight: copying a param with shape torch.Size([32, 512, 2048]) from checkpoint, the shape in current model is torch.Size([4, 512, 2048]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.1.feed_forward.experts.w_2.bias: copying a param with shape torch.Size([32, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).

2022-07-22 02:08:37 UTC -- size mismatch for encoder.blocks.2.feed_forward.router_weights: copying a param with shape torch.Size([1024, 32]) from checkpoint, the shape in current model is torch.Size([1024, 4]).

not a streaming model

i doubt that this model is not a streaming model , is that right？

tencent-ailab / 3m-asr Goto Github PK

3m-asr's People

Contributors

Stargazers

Watchers

Forkers

3m-asr's Issues

Any plan to release pretrained model

CTC loss does not converge when training confomer_base using 2 GPUS

The performance is not as good as what you posted

load model when test

not a streaming model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent