Giter VIP home page Giter VIP logo

fastbert's People

Contributors

aestheticisma avatar autoliuweijie avatar bsll avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastbert's Issues

关于inference的效率问题

# inference

我觉得推理性能慢不是因为nozero。
看代码实现,实际上相当于每过一层transformer encoder,就在当前这个batch剔除掉过于简单的样本 ,也就是batchsize变得更小,然而只要有一个样本到达最后一层,耗时都会比原来bert要多。

有没有办法能够更灵活的调度需要计算的样本,比如建立一个pool,进入到第10层之后的都放到一个池子里,一起调度,让每一层计算的batchsize固定,这样充分利用显卡资源的话推理起来应该会快很多。

I'm curious why you set segment_embedding's first dimension to 3.

First of all, thanks for your kind offer.

Why did you set segment_embedding's first dimension to 3 ??

This is on path [FastBERT/uer/layers/embeddings.py] (line 18)

Is this part flexible depending on the model architecture?
The paper does not have this content, so I asked a question.

Calculation about the FLOP of full-connected layer.

First, thanks for you work, it's very useful for inference BERT-like modes. Hope your paper get published soon.
And something i'm confused about is the FLOP of dense layer, which is in the section 4.1.
As far as i know, the FLOPs of fully connect layer with bias = 2 * I * O

I=input neuron numbers, O=output neuron numbers.

For the Fully-connect layer 128 ->128, FLOPs = 2 * 128 * 128 = 32,768
And in the Table 1, the answer is 4.2M, which is much higher than i got.
Can you release your method to calculate the answer ?

关于batchsize对训练和测试的影响

您好,我做了两个不同的实验。第一个是训练和测试的batchsize大小都为1(这样训练速度较慢);第二个是训练和测试的batchsize大小都为32;第二个实验的分类准确率比第一个实验低约2个百分点。
我在思考batchsize影响这么大的原因,一般来说batchsize增大可以增加模型泛化能力。但是在fastbert中,是否因为batchsize变大对推理阶段准确率影响较大?
不知道您有没有做过batchsize对训练测试影响的相关实验,或者有什么建议呢?

多分类的效果问题

不知道作者有没有在复杂的分类数据集上尝试过该模型,我尝试在一个40分类的数据集上所有样本的不确定性都在0.95以上。

如何确定distill阶段的early stopping

现在distill阶段使用的是固定的speed和epochs, 而且没有做early stopping. 对于不同的数据集, 如何确定这些超参数,已经如何选取最终的模型?

What is the Weibo dataset?

Would you clarify what the Weibo dataset (one of the benchmarked task in the paper) is or provide a copy in this repo?

Source code

Could you can provide source code early, we want to try and follow your work, Thanks

关于更换在其他数据集下预训练的BERT模型,效果达不到论文中所宣称的问题

作者大大您好,首先非常感谢您的开源精神和杰出的工作,我有一点疑惑向来咨询一下大大。就是我采用speed=0.5,batch_size=64,预训练加载的是我在其他数据集上已经重新委托过的BERT模型,但是运行fastbert后最终的结果准确度却下降了20%,如下图所示
屏幕截图 2023-04-22 182241
屏幕截图 2023-04-22 181936
我有些疑惑,按道理来说不应该出现这样的问题啊,还是说作者大大的模型只对readme中提到的那几个模型有用,在其他数据集上微调的BERT模型不可以使用本文的程序吗?请作者大大解答一下,非常感谢

关于如何得出模型加速速率的问题

大佬您好,首先非常感谢您杰出的工作,和让人非常眼前一亮的论文。我读完论文后有点疑问,就是文章中fastbert相比较bert模型加速速率这个倍数是怎么计算得出的哇,我只看到了flops的计算代码。但是flops和推理速度并没有线性关系,想请大佬解答一下这个问题,非常感谢

uncertainty 的公式怎么理解

感谢您分享出论文的代码与做出的贡献,我在阅读代码与论文的时候有一个问题,在Adaptive inference中的uncertainty公式怎么理解呢,为什么可以这样确定结果的不确定性

Miss attention FLOPS?

Hi,

I found in MultiHeadedAttention, thop only count the FLOPS of linear layer, missing the attention operation.

batch预测

请问安装了fastbert,改如何进行Batch预测?

推理时间

请问CPU上,单个句子的推理时间是多少啊

复现时的问题

你好,我在复现您的实验(没有进行任何修改)的时候在主干网络的训练时准确率是逐渐提高的,在蒸馏阶段验证集和测试集的acc每一个epoch都和主干网络的最后一个epoch相同,请问是我哪里出错了吗?

多标签分类是否可行?

请问如果我是多标签任务,在每个维度独立做二分类
1.可以直接用KL散度做蒸馏loss吗
2.可以用类别维度的熵表示不确定性吗

数据集报错

用你的数据集thucnews跑多分类是OK的,用自己的数据集一直出现这个错误,请问数据集需要怎么处理吗?
Traceback (most recent call last):
File "run_fastbert.py", line 652, in
main()
File "run_fastbert.py", line 589, in main
result = evaluate(args, False, False)
File "run_fastbert.py", line 445, in evaluate
p = confusion[i,i].item()/confusion[i,:].sum().item()
ZeroDivisionError: division by zero

DistillBERT (3layer)

Thanks for releasing the great repo.

Could you share the DistillBERT (3layer) model and the DistillBERT (1layer)? It is very helpful for me.

Thanks!

Best,
Deming Ye

关于论文中的FLOPs计算

请问论文中 BERT baseline的FLOPs为什么是21785M?
按照表一列的内容,BERT的FLOPs不应该是1809.9 * 12 + 46.1 = 21765M吗?

Unable to access datasets

Hi,
I am unable to access datasets linked in this repository. Please help me to access them.
Thanks

关于文章中的FLOPs计算

你好,

请问文章里表2中所列BERT/DistilBERT的FLOPs是否包括最后的classifier,即在CLS后面接的那个MLP?这部分的FLOPs应该和最后的label个数N有关吧?另外,文章中的FLOPs是用什么工具计算的?

谢谢

复现效果中GPU推理加速比较低

你好,我在复现论文效果时遇到两个问题,请教一下。

  1. 当我训练子分类器时,得到的效果没有直接用true label训练效果好;
  2. 最终推理时,我在CPU上得到了11x的速度提升,但是GPU上只有2x。

下面是我分享复现时的细节,并非全部与所问问题相关:

  • 我用的是中文二分类数据集,40w作为训练集,3w作为测试集,后面的效果都是在测试集上得出的;
  • teacher分类器和student分类器都是按照论文中的设置,包括降维后的维度128;
  • 我用的loss是hinton蒸馏论文中的经典公式,temperature设置为1;
    ps: 我有试过在较浅的层使用较大的temperature来保证论文中Uncertainty是递减的,但是训练效果不太理想就放弃了;
  • 按照论文,真个训练过程分为2步:
    1. 训练主干网络和teacher分类器的参数,使用交叉熵作loss;
    2. 固定主干网络和teacher分类器上的参数,训练子分类器的参数;
  • 训练结果主干网络和原来的模型效果一致(acc 96%);第一层子分类器下降4%的acc(92%);每一层的acc从前到后,整体呈现上升趋势;
  • 推理时我将12层分别切成一个小模型,将上一层的输出当作下一层的输入;
    以此来保证整体的计算量没有上升;表一为具体切分规则。
  • 推理时,speed选择为0.2,模型效果几乎没有下降(acc 0.1个百分点);speed=0.5的时候效果下降明显,(acc 4个百分点)。
  • 如果speed选择为0,既会对每一层进行推理,此时在gpu上推理时间为原始bert的5倍;
Blocks Which model belongs to
Embeddings M0
Transformer-0 M0
Stu-Classfier-0 M0
Transformer-1 M1
Stu-Classfier-1 M1
... ...
Transformer11 M11
Tea-Classfier M11

表一:一共分为12段,M0-M11分别对应12个分类器

根据上述最后一条,我猜想GPU上推理加速效果不显著,是因为多出来的输入输出操作占用太多时间。
再次感谢您的研究成果,希望您能多分享一下推理方面的经验,是否有不用切分模型的自适应推理方法呢?

pypi/fastbert/fastbert.py 一个参数名拼写错误

line 234:

        self._self_distillation(
            sentences_train, batch_size, learning_rate, epochs_num,
            warmup, report_steps, model_saving_pathm, sentences_dev,
            labels_dev, dev_speed, verbose
        )

model_saving_pathm应该是model_saving_path

How to calculate Bert FLOPs

Hi,

I have a very rookie question. How can I calculate the FLOPs of BERT model?
I tried to use thop,

macs, params = profile(model, inputs=(input, ), 
                        custom_ops={YourModule: count_your_model})

but I don't know how what is the input and custom_ops={YourModule: count_your_model}

For example, I want to run the models given by Huggingface. https://github.com/huggingface/transformers/tree/master/examples/text-classification

CUDA_VISIBLE_DEVICES=1 python run_glue.py \
  --model_type bert \
  --model_name_or_path /tmp/fintune_CoLA_output-bert/ \

I tried to put the macs, params = profile(model, inputs.....) command line in run_glue.py, but I'm not sure where to put it.
I get errors like:
[WARN] Cannot find rule for <class 'torch.nn.modules.sparse.Embedding'>. Treat it as zero Macs and zero Params.
[WARN] Cannot find rule for <class 'torch.nn.modules.normalization.LayerNorm'>. Treat it as zero Macs and zero Params.

File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/transformers/trainer.py", line 677, in _training_step model, inputs=inputs, custom_ops={ File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/thop/profile.py", line 188, in profile model(*inputs) File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1144, in forward inputs_embeds=inputs_embeds, File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhk20002/anaconda2/envs/Py3.6/lib/python3.6/site-packages/transformers/modeling_bert.py", line 691, in forward input_shape = input_ids.size() AttributeError: 'str' object has no attribute 'size'

Do you have a general code like this where I can test out the Flops of models such as BERT, RoBERTa, DistilBERT by just changing the --model_type?

Thanks!

Tony

实验效果疑惑

Hello,感谢你杰出的工作。
我在glue的蚂蚁金服语义相似度语料上进行试验,finetune_epochs取20,distill_epochs取10,learning_rate取2e-5,dev_speed取0.5,最终蒸馏后在dev上的dev_acc始终在0.725徘徊。
若想让蒸馏后的dev_acc达到0.9,是不是要增大训练epoch,还是有别的影响因素呢?
感谢解答!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.