autoliuweijie / fastbert Goto Github PK

View Code? Open in Web Editor NEW

603.0 603.0 91.0 2.36 MB

The score code of FastBERT (ACL2020)

Home Page: https://www.aclweb.org/anthology/2020.acl-main.537/

Python 100.00%

acl2020 bert fastbert

fastbert's People

Contributors

Stargazers

Watchers

Forkers

liuluyeah ccodingre richiesui superrichiesui cdj0311 dawson-chen dapeng2018 askintution xrosliang 18106574249 alan5279 qsong4 da-southampton yuzhongshanyue jkszw2014 arcral elegant-bot ruizewang qianrenjian wentingtseng dendisuhubdy duxiaochao aestheticisma xuehui0725 hmy626 caseware66 undarmaa xd9999 wusongxu yangshao jinshubai gdh756462786 xuyongfu cb1473258684 bsll sumerzhang phychaos liyinchao luojie-roger franklwl colinsongf husheng-liu wuyaoxuehun taogeanton2 xiaming9880 jasperyang rulepack binghuo007 vivianzy1985 ai-mart knowledgecaps amshoreline largefishpku hecongjie luoxukun jangleise panyicheng jxrjxrjxr sunzeyu0591 ronden lrpopeyou tangyuan96 lx-nlp liuyunwu lidhcs zhihao-chen xiedake zhongerqiandan markwjj oftendream reborn-liang yueyedeai greenhanddeco katehuang920909 invokeryu fang98525 xiaoanshi benben2022 jianzhu singaln leileileisa tiffen tianjiansmile dumpmemory icloudsong hylihitic tysjy musum avalonss yhzhang02 fdarkly

fastbert's Issues

关于inference的效率问题

FastBERT/run_fastbert.py

Line 132 in 5f9e98b

# inference

我觉得推理性能慢不是因为nozero。
看代码实现，实际上相当于每过一层transformer encoder，就在当前这个batch剔除掉过于简单的样本 ,也就是batchsize变得更小，然而只要有一个样本到达最后一层，耗时都会比原来bert要多。

有没有办法能够更灵活的调度需要计算的样本，比如建立一个pool，进入到第10层之后的都放到一个池子里，一起调度，让每一层计算的batchsize固定，这样充分利用显卡资源的话推理起来应该会快很多。

I'm curious why you set segment_embedding's first dimension to 3.

First of all, thanks for your kind offer.

Why did you set segment_embedding's first dimension to 3 ??

This is on path [FastBERT/uer/layers/embeddings.py] (line 18)

Is this part flexible depending on the model architecture?
The paper does not have this content, so I asked a question.

Calculation about the FLOP of full-connected layer.

First, thanks for you work, it's very useful for inference BERT-like modes. Hope your paper get published soon.
And something i'm confused about is the FLOP of dense layer, which is in the section 4.1.
As far as i know, the FLOPs of fully connect layer with bias = 2 * I * O

I=input neuron numbers, O=output neuron numbers.

For the Fully-connect layer 128 ->128, FLOPs = 2 * 128 * 128 = 32,768
And in the Table 1, the answer is 4.2M, which is much higher than i got.
Can you release your method to calculate the answer ?

关于batchsize对训练和测试的影响

您好，我做了两个不同的实验。第一个是训练和测试的batchsize大小都为1（这样训练速度较慢）；第二个是训练和测试的batchsize大小都为32；第二个实验的分类准确率比第一个实验低约2个百分点。
我在思考batchsize影响这么大的原因，一般来说batchsize增大可以增加模型泛化能力。但是在fastbert中，是否因为batchsize变大对推理阶段准确率影响较大？
不知道您有没有做过batchsize对训练测试影响的相关实验，或者有什么建议呢？

多分类的效果问题

不知道作者有没有在复杂的分类数据集上尝试过该模型，我尝试在一个40分类的数据集上所有样本的不确定性都在0.95以上。

请问您是怎么做BERT研究的？TPU上吗？

多谢！多谢！
@autoliuweijie

如何确定distill阶段的early stopping

现在distill阶段使用的是固定的speed和epochs，而且没有做early stopping. 对于不同的数据集，如何确定这些超参数，已经如何选取最终的模型？

源码似乎没有加LabelSmooth，是啥原因呢？

ModuleNotFoundError: No module named 'uer.encoders.synt_encoder'

win10+pycharm，运行FastBERT\pypi\examples\single_sentence_classification\test.py，出错：
ModuleNotFoundError: No module named 'uer.encoders.synt_encoder'。
在windows命令行执行没有问题。

What is the Weibo dataset?

Would you clarify what the Weibo dataset (one of the benchmarked task in the paper) is or provide a copy in this repo?

Source code

Could you can provide source code early, we want to try and follow your work, Thanks

关于更换在其他数据集下预训练的BERT模型，效果达不到论文中所宣称的问题

作者大大您好，首先非常感谢您的开源精神和杰出的工作，我有一点疑惑向来咨询一下大大。就是我采用speed=0.5，batch_size=64，预训练加载的是我在其他数据集上已经重新委托过的BERT模型，但是运行fastbert后最终的结果准确度却下降了20%，如下图所示


我有些疑惑，按道理来说不应该出现这样的问题啊，还是说作者大大的模型只对readme中提到的那几个模型有用，在其他数据集上微调的BERT模型不可以使用本文的程序吗？请作者大大解答一下，非常感谢

关于如何得出模型加速速率的问题

大佬您好，首先非常感谢您杰出的工作，和让人非常眼前一亮的论文。我读完论文后有点疑问，就是文章中fastbert相比较bert模型加速速率这个倍数是怎么计算得出的哇，我只看到了flops的计算代码。但是flops和推理速度并没有线性关系，想请大佬解答一下这个问题，非常感谢

uncertainty 的公式怎么理解

感谢您分享出论文的代码与做出的贡献，我在阅读代码与论文的时候有一个问题，在Adaptive inference中的uncertainty公式怎么理解呢，为什么可以这样确定结果的不确定性

Miss attention FLOPS?

Hi,

I found in MultiHeadedAttention, thop only count the FLOPS of linear layer, missing the attention operation.

batch预测

请问安装了fastbert，改如何进行Batch预测？

有没有批量预测的代码啊？

我好想没有找到批量预测的代码块，请问在哪里啊

推理时间

请问CPU上，单个句子的推理时间是多少啊

请问论文中未来工作说的命名实体识别任务实现了吗？

Vocabulary file line 344 has bad format token

这344行是个空行，有啥问题吗

请问你们有没有测试过在GLUE数据集上的效果？效果如何呢？

复现时的问题

你好，我在复现您的实验（没有进行任何修改）的时候在主干网络的训练时准确率是逐渐提高的，在蒸馏阶段验证集和测试集的acc每一个epoch都和主干网络的最后一个epoch相同，请问是我哪里出错了吗？

what does fast_mode mean?

I am curious about fast_mode argument, how and when to use this argument?

多标签分类是否可行？

请问如果我是多标签任务，在每个维度独立做二分类
1.可以直接用KL散度做蒸馏loss吗
2.可以用类别维度的熵表示不确定性吗

数据集报错

用你的数据集thucnews跑多分类是OK的，用自己的数据集一直出现这个错误，请问数据集需要怎么处理吗？
Traceback (most recent call last):
File "run_fastbert.py", line 652, in
main()
File "run_fastbert.py", line 589, in main
result = evaluate(args, False, False)
File "run_fastbert.py", line 445, in evaluate
p = confusion[i,i].item()/confusion[i,:].sum().item()
ZeroDivisionError: division by zero

tensorflow版本

请问一下有没有tf的版本？

DistillBERT (3layer)

Thanks for releasing the great repo.

Could you share the DistillBERT (3layer) model and the DistillBERT (1layer)? It is very helpful for me.

Thanks!

Best,
Deming Ye

关于论文中的FLOPs计算

请问论文中 BERT baseline的FLOPs为什么是21785M？
按照表一列的内容，BERT的FLOPs不应该是1809.9 * 12 + 46.1 = 21765M吗？

Unable to access datasets

Hi,
I am unable to access datasets linked in this repository. Please help me to access them.
Thanks

Share cloud is empty

Hi, it looks like this cloud does not work now https://fastbert-model-file-1257235592.cos.ap-beijing.myqcloud.com/

train students 的时候是否需要固定teacher的参数？

看代码，好像没有固定呢？

https://github.com/BitVoyage/FastBERT 这个实现里面是固定的。

pypi\examples\single_sentence_classification\test.py文件缺少labels变量定义

应该添加labels = labels = ['T', 'F']

关于文章中的FLOPs计算

你好，

请问文章里表2中所列BERT/DistilBERT的FLOPs是否包括最后的classifier，即在CLS后面接的那个MLP？这部分的FLOPs应该和最后的label个数N有关吧？另外，文章中的FLOPs是用什么工具计算的？

谢谢

I'm curious about the reason for making self-attention for each classifier layer.

First of all, thanks for your kind offer.

What do you think is the reason for self-attention for each classifier layer?

The paper also says that it does self-attention in 128 dimensions.

What do you think is the difference from deriving a result without self-attention with a only hidden size of 768 dimensions?

How to load other pretrained bert model?

Hi, I'm trying to load other huggingface pre-trained model, for example, like this one: https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin
but i found i cannot load it as the parameter names are different, or where do you get the pre-trained model?

复现效果中GPU推理加速比较低

你好，我在复现论文效果时遇到两个问题，请教一下。

当我训练子分类器时，得到的效果没有直接用true label训练效果好；
最终推理时，我在CPU上得到了11x的速度提升，但是GPU上只有2x。

下面是我分享复现时的细节，并非全部与所问问题相关：

我用的是中文二分类数据集，40w作为训练集，3w作为测试集，后面的效果都是在测试集上得出的；
teacher分类器和student分类器都是按照论文中的设置，包括降维后的维度128；
我用的loss是hinton蒸馏论文中的经典公式，temperature设置为1；
ps: 我有试过在较浅的层使用较大的temperature来保证论文中Uncertainty是递减的，但是训练效果不太理想就放弃了；
按照论文，真个训练过程分为2步：
1. 训练主干网络和teacher分类器的参数，使用交叉熵作loss；
2. 固定主干网络和teacher分类器上的参数，训练子分类器的参数；
训练结果主干网络和原来的模型效果一致（acc 96%）；第一层子分类器下降4%的acc（92%）；每一层的acc从前到后，整体呈现上升趋势；
推理时我将12层分别切成一个小模型，将上一层的输出当作下一层的输入；
以此来保证整体的计算量没有上升；表一为具体切分规则。
推理时，speed选择为0.2，模型效果几乎没有下降（acc 0.1个百分点）；speed=0.5的时候效果下降明显，（acc 4个百分点）。
如果speed选择为0，既会对每一层进行推理，此时在gpu上推理时间为原始bert的5倍；

Blocks	Which model belongs to
Embeddings	M0
Transformer-0	M0
Stu-Classfier-0	M0
Transformer-1	M1
Stu-Classfier-1	M1
...	...
Transformer11	M11
Tea-Classfier	M11

表一：一共分为12段，M0-M11分别对应12个分类器

根据上述最后一条，我猜想GPU上推理加速效果不显著，是因为多出来的输入输出操作占用太多时间。
再次感谢您的研究成果，希望您能多分享一下推理方面的经验，是否有不用切分模型的自适应推理方法呢？

pypi/fastbert/fastbert.py 一个参数名拼写错误

line 234:

        self._self_distillation(
            sentences_train, batch_size, learning_rate, epochs_num,
            warmup, report_steps, model_saving_pathm, sentences_dev,
            labels_dev, dev_speed, verbose
        )

model_saving_pathm应该是model_saving_path吧

How to calculate Bert FLOPs

Hi,

I have a very rookie question. How can I calculate the FLOPs of BERT model?
I tried to use thop,

macs, params = profile(model, inputs=(input, ), 
                        custom_ops={YourModule: count_your_model})

but I don't know how what is the input and custom_ops={YourModule: count_your_model}

For example, I want to run the models given by Huggingface. https://github.com/huggingface/transformers/tree/master/examples/text-classification

CUDA_VISIBLE_DEVICES=1 python run_glue.py \
  --model_type bert \
  --model_name_or_path /tmp/fintune_CoLA_output-bert/ \

I tried to put the macs, params = profile(model, inputs.....) command line in run_glue.py, but I'm not sure where to put it.
I get errors like:
[WARN] Cannot find rule for <class 'torch.nn.modules.sparse.Embedding'>. Treat it as zero Macs and zero Params.
[WARN] Cannot find rule for <class 'torch.nn.modules.normalization.LayerNorm'>. Treat it as zero Macs and zero Params.

File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/transformers/trainer.py", line 677, in _training_step model, inputs=inputs, custom_ops={ File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/thop/profile.py", line 188, in profile model(*inputs) File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1144, in forward inputs_embeds=inputs_embeds, File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/transformers/modeling_bert.py", line 691, in forward input_shape = input_ids.size() AttributeError: 'str' object has no attribute 'size'

Do you have a general code like this where I can test out the Flops of models such as BERT, RoBERTa, DistilBERT by just changing the --model_type?

Thanks!

Tony

The exact English pretraining data and Chinese pretraining data that are exact same to the BERT paper's pretraining data.

Any one know where to get them?
Thank you and thank you.

实验效果疑惑

Hello，感谢你杰出的工作。
我在glue的蚂蚁金服语义相似度语料上进行试验，finetune_epochs取20，distill_epochs取10，learning_rate取2e-5，dev_speed取0.5，最终蒸馏后在dev上的dev_acc始终在0.725徘徊。
若想让蒸馏后的dev_acc达到0.9，是不是要增大训练epoch，还是有别的影响因素呢？
感谢解答！