Giter VIP home page Giter VIP logo

albert_zh's Introduction

albert_zh

An Implementation of A Lite Bert For Self-Supervised Learning Language Representations with TensorFlow

ALBert is based on Bert, but with some improvements. It achieves state of the art performance on main benchmarks with 30% parameters less.

For albert_base_zh it only has ten percentage parameters compare of original bert model, and main accuracy is retained.

Different version of ALBERT pre-trained model for Chinese, including TensorFlow, PyTorch and Keras, is available now.

海量中文语料上预训练ALBERT模型:参数更少,效果更好。预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准

clueai工具包: 三行代码,三分钟定制一个NLP的API(零样本学习)

一键运行10个数据集、9个基线模型、不同任务上模型效果的详细对比,见CLUE benchmark

一键运行CLUE中文任务:6个中文分类或句子对任务(新)

使用方式:
1、克隆项目
   git clone https://github.com/brightmart/albert_zh.git
2、运行一键运行脚本(GPU方式): 会自动下载模型和所有任务数据并开始运行。
   bash run_classifier_clue.sh
   执行该一键运行脚本将会自动下载所有任务数据,并为所有任务找到最优模型,然后测试得到提交结果

模型下载 Download Pre-trained Models of Chinese

1、albert_tiny_zh, albert_tiny_zh(训练更久,累积学习20亿个样本),文件大小16M、参数为4M

训练和推理预测速度提升约10倍,精度基本保留,模型大小为bert的1/25;语义相似度数据集LCQMC测试集上达到85.4%,相比bert_base仅下降1.5个点。

lcqmc训练使用如下参数: --max_seq_length=128 --train_batch_size=64   --learning_rate=1e-4   --num_train_epochs=5 

albert_tiny使用同样的大规模中文语料数据,层数仅为4层、hidden size等向量维度大幅减少; 尝试使用如下学习率来获得更好效果:{2e-5, 6e-5, 1e-4} 

【使用场景】任务相对比较简单一些或实时性要求高的任务,如语义相似度等句子对任务、分类任务;比较难的任务如阅读理解等,可以使用其他大模型。

 例如,可以使用[Tensorflow Lite](https://www.tensorflow.org/lite)在移动端进行部署,本文[随后](#use_tflite)针对这一点进行了介绍,包括如何把模型转换成Tensorflow Lite格式和对其进行性能测试等。
 
 一键运行albert_tiny_zh(linux,lcqmc任务):
 1) git clone https://github.com/brightmart/albert_zh
 2) cd albert_zh
 3) bash run_classifier_lcqmc.sh

1.1、albert_tiny_google_zh(累积学习10亿个样本,google版本),模型大小16M、性能与albert_tiny_zh一致

1.2、albert_small_google_zh(累积学习10亿个样本,google版本)

 速度比bert_base快4倍;LCQMC测试集上比Bert下降仅0.9个点;去掉adam后模型大小18.5M;使用方法,见 #下游任务 Fine-tuning on Downstream Task     

2、albert_large_zh,参数量,层数24,文件大小为64M

参数量和模型大小为bert_base的六分之一;在口语化描述相似性数据集LCQMC的测试集上相比bert_base上升0.2个点

3、albert_base_zh(额外训练了1.5亿个实例即 36k steps * batch_size 4096); albert_base_zh(小模型体验版), 参数量12M, 层数12,大小为40M

参数量为bert_base的十分之一,模型大小也十分之一;在口语化描述相似性数据集LCQMC的测试集上相比bert_base下降约0.6~1个点;
相比未预训练,albert_base提升14个点

4、albert_xlarge_zh_177k ; albert_xlarge_zh_183k(优先尝试)参数量,层数24,文件大小为230M

参数量和模型大小为bert_base的二分之一;需要一张大的显卡;完整测试对比将后续添加;batch_size不能太小,否则可能影响精度

快速加载

依托于Huggingface-Transformers 2.2.2,可轻松调用以上模型。

tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")

其中MODEL_NAME对应列表如下:

模型名 MODEL_NAME
albert_tiny_google_zh voidful/albert_chinese_tiny
albert_small_google_zh voidful/albert_chinese_small
albert_base_zh (from google) voidful/albert_chinese_base
albert_large_zh (from google) voidful/albert_chinese_large
albert_xlarge_zh (from google) voidful/albert_chinese_xlarge
albert_xxlarge_zh (from google) voidful/albert_chinese_xxlarge

更多通过transformers使用albert的示例

预训练 Pre-training

生成特定格式的文件(tfrecords) Generate tfrecords Files

Run following command 运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt)

   bash create_pretrain_data.sh

如果你有很多文本文件,可以通过传入参数的方式,生成多个特定格式的文件(tfrecords)

Support English and Other Non-Chinese Language:
If you are doing pre-train for english or other language,which is not chinese, 
you should set hyperparameter of non_chinese to True on create_pretraining_data.py; 
otherwise, by default it is doing chinese pre-train using whole word mask of chinese.

执行预训练 pre-training on GPU/TPU using the command

GPU(brightmart版, tiny模型):
export BERT_BASE_DIR=./albert_tiny_zh
nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord  \
--output_dir=./my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_tiny.json \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=51 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176    \
--save_checkpoints_steps=2000  --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &

GPU(Google版本, small模型):
export BERT_BASE_DIR=./albert_small_zh_google
nohup python3 run_pretraining_google.py --input_file=./data/tf*.tfrecord --eval_batch_size=64 \
--output_dir=./my_new_model_path --do_train=True --do_eval=True --albert_config_file=$BERT_BASE_DIR/albert_config_small_google.json  --export_dir=./my_new_model_path_export \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=20 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176   \
--save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt

TPU, add something like this:
    --use_tpu=True  --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a
    
注:如果你重头开始训练,可以不指定init_checkpoint;
如果你从现有的模型基础上训练,指定一下BERT_BASE_DIR的路径,并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上;
领域上的预训练,根据数据的大小,可以不用训练特别久。

环境 Environment

Use Python3 + Tensorflow 1.x

e.g. Tensorflow 1.4 or 1.5

下游任务 Fine-tuning on Downstream Task

使用TensorFlow:

以使用albert_base做LCQMC任务为例。LCQMC任务是在口语化描述的数据集上做文本的相似性预测。

We will use LCQMC dataset for fine-tuning, it is oral language corpus, it is used to train and predict semantic similarity of a pair of sentences.

下载LCQMC数据集,包含训练、验证和测试集,训练集包含24万口语化描述的中文句子对,标签为1或0。1为句子语义相似,0为语义不相似。

通过运行下列命令做LCQMC数据集上的fine-tuning:

1. Clone this project:
      
      git clone https://github.com/brightmart/albert_zh.git
      
2. Fine-tuning by running the following command.
    brightmart版本的tiny模型
    export BERT_BASE_DIR=./albert_tiny_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier.py   --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --bert_config_file=./albert_config/albert_config_tiny.json --max_seq_length=128 --train_batch_size=64   --learning_rate=1e-4  --num_train_epochs=5 \
    --output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &
    
    google版本的small模型
    export BERT_BASE_DIR=./albert_small_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier_sp_google.py --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --albert_config_file=./$BERT_BASE_DIR/albert_config_small_google.json --max_seq_length=128 --train_batch_size=64   --learning_rate=1e-4   --num_train_epochs=5 \
    --output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &

Notice/注:
    1) you need to download pre-trained chinese albert model, and also download LCQMC dataset 
    你需要下载预训练的模型,并放入到项目当前项目,假设目录名称为albert_tiny_zh; 需要下载LCQMC数据集,并放入到当前项目,
    假设数据集目录名称为lcqmc

    2) for Fine-tuning, you can try to add small percentage of dropout(e.g. 0.1) by changing parameters of 
      attention_probs_dropout_prob & hidden_dropout_prob on albert_config_xxx.json. By default, we set dropout as zero. 
    
    3) you can try different learning rate {2e-5, 6e-5, 1e-4} for better performance 

Updates

******* 2019-11-03: add google version of albert_small, albert_tiny;

add method to deploy ablert_tiny to mobile devices with only 0.1 second inference time for sequence length 128, 60M memory *******

***** 2019-10-30: add a simple guide about converting the model to Tensorflow Lite for edge deployment *****

***** 2019-10-15: albert_tiny_zh, 10 times fast than bert base for training and inference, accuracy remains *****

***** 2019-10-07: more models of albert *****

add albert_xlarge_zh; albert_base_zh_additional_steps, training with more instances

***** 2019-10-04: PyTorch and Keras versions of albert were supported *****

a.Convert to PyTorch version and do your tasks through albert_pytorch

b.Load pre-trained model with keras using one line of codes through bert4keras

c.Use albert with TensorFlow 2.0: Use or load pre-trained model with tf2.0 through bert-for-tf2

Releasing albert_xlarge on 6th Oct

***** 2019-10-02: albert_large_zh,albert_base_zh *****

Relesed albert_base_zh with only 10% parameters of bert_base, a small model(40M) & training can be very fast.

Relased albert_large_zh with only 16% parameters of bert_base(64M)

***** 2019-09-28: codes and test functions *****

Add codes and test functions for three main changes of albert from bert

ALBERT模型介绍 Introduction of ALBERT

ALBERT模型是BERT的改进版,与最近其他State of the art的模型不同的是,这次是预训练小模型,效果更好、参数更少。

它对BERT进行了三个改造 Three main changes of ALBert from Bert:

1)词嵌入向量参数的因式分解 Factorized embedding parameterization

 O(V * H) to O(V * E + E * H)
 
 如以ALBert_xxlarge为例,V=30000, H=4096, E=128
   
 那么原先参数为V * H= 30000 * 4096 = 1.23亿个参数,现在则为V * E + E * H = 30000*128+128*4096 = 384万 + 52万 = 436万,
   
 词嵌入相关的参数变化前是变换后的28倍。

2)跨层参数共享 Cross-Layer Parameter Sharing

 参数共享能显著减少参数。共享可以分为全连接层、注意力层的参数共享;注意力层的参数对效果的减弱影响小一点。

3)段落连续性任务 Inter-sentence coherence loss.

 使用段落连续性任务。正例,使用从一个文档中连续的两个文本段落;负例,使用从一个文档中连续的两个文本段落,但位置调换了。
 
 避免使用原有的NSP任务,原有的任务包含隐含了预测主题这类过于简单的任务。

  We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss 
  based primarily on coherence. That is, for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic 
  prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the 
  same technique as BERT (two consecutive segments from the same document), and as negative examples the same two 
  consecutive segments but with their order swapped. This forces the model to learn finer-grained distinctions about
  discourse-level coherence properties. 

其他变化,还有 Other changes:

1)去掉了dropout  Remove dropout to enlarge capacity of model.
    最大的模型,训练了1百万步后,还是没有过拟合训练数据。说明模型的容量还可以更大,就移除了dropout
    (dropout可以认为是随机的去掉网络中的一部分,同时使网络变小一些)
    We also note that, even after training for 1M steps, our largest models still do not overfit to their training data. 
    As a result, we decide to remove dropout to further increase our model capacity.
    其他型号的模型,在我们的实现中我们还是会保留原始的dropout的比例,防止模型对训练数据的过拟合。
    
2)为加快训练速度,使用LAMB做为优化器 Use LAMB as optimizer, to train with big batch size
  使用了大的batch_size来训练(4096)。 LAMB优化器使得我们可以训练,特别大的批次batch_size,如高达6万。

3)使用n-gram(uni-gram,bi-gram, tri-gram)来做遮蔽语言模型 Use n-gram as make language model
   即以不同的概率使用n-gram,uni-gram的概率最大,bi-gram其次,tri-gram概率最小。
   本项目中目前使用的是在中文上做whole word mask,稍后会更新一下与n-gram mask的效果对比。n-gram从spanBERT中来。

训练语料/训练配置 Training Data & Configuration

30g中文语料,超过100亿汉字,包括多个百科、新闻、互动社区。

预训练序列长度sequence_length设置为512,批次batch_size为4096,训练产生了3.5亿个训练数据(instance);每一个模型默认会训练125k步,albert_xxlarge将训练更久。

作为比较,roberta_zh预训练产生了2.5亿个训练数据、序列长度为256。由于albert_zh预训练生成的训练数据更多、使用的序列长度更长,

我们预计albert_zh会有比roberta_zh更好的性能表现,并且能更好处理较长的文本。

训练使用TPU v3 Pod,我们使用的是v3-256,它包含32个v3-8。每个v3-8机器,含有128G的显存。

模型性能与对比(英文) Performance and Comparision

中文任务集上效果对比测试 Performance on Chinese datasets

问题匹配语任务:LCQMC(Sentence Pair Matching)

模型 开发集(Dev) 测试集(Test)
BERT 89.4(88.4) 86.9(86.4)
ERNIE 89.8 (89.6) 87.2 (87.0)
BERT-wwm 89.4 (89.2) 87.0 (86.8)
BERT-wwm-ext - -
RoBERTa-zh-base 88.7 87.0
RoBERTa-zh-Large 89.9(89.6) 87.2(86.7)
RoBERTa-zh-Large(20w_steps) 89.7 87.0
ALBERT-zh-tiny -- 85.4
ALBERT-zh-small -- 86.0
ALBERT-zh-small(Pytorch) -- 86.8
ALBERT-zh-base-additional-36k-steps 87.8 86.3
ALBERT-zh-base 87.2 86.3
ALBERT-large 88.7 87.1
ALBERT-xlarge 87.3 87.7

注:只跑了一次ALBERT-xlarge,效果还可能提升

自然语言推断:XNLI of Chinese Version

模型 开发集 测试集
BERT 77.8 (77.4) 77.8 (77.5)
ERNIE 79.7 (79.4) 78.6 (78.2)
BERT-wwm 79.0 (78.4) 78.2 (78.0)
BERT-wwm-ext 79.4 (78.6) 78.7 (78.3)
XLNet 79.2 78.7
RoBERTa-zh-base 79.8 78.8
RoBERTa-zh-Large 80.2 (80.0) 79.9 (79.5)
ALBERT-base 77.0 77.1
ALBERT-large 78.0 77.5
ALBERT-xlarge ? ?

注:BERT-wwm-ext来自于这里;XLNet来自于这里; RoBERTa-zh-base,指12层RoBERTa中文模型

阅读理解任务:CRMC2018

语言模型、文本段预测准确性、训练时间 Mask Language Model Accuarcy & Training Time

Model MLM eval acc SOP eval acc Training(Hours) Loss eval
albert_zh_base 79.1% 99.0% 6h 1.01
albert_zh_large 80.9% 98.6% 22.5h 0.93
albert_zh_xlarge ? ? 53h(预估) ?
albert_zh_xxlarge ? ? 106h(预估) ?

注:? 将很快替换

模型参数和配置 Configuration of Models

代码实现和测试 Implementation and Code Testing

通过运行以下命令测试主要的改进点,包括但不限于词嵌入向量参数的因式分解、跨层参数共享、段落连续性任务等。

python test_changes.py
使用TensorFlow Lite(TFLite)在移动端进行部署:

这里我们主要介绍TFLite模型格式转换和性能测试。转换成TFLite模型后,对于如何在移 动端使用该模型,可以参考TFLite提供的Android/iOS应用完整开发案例教程页面。 该页面目前已经包含了文本分类文本问答两个Android案例。

下面以albert_tiny_zh 为例来介绍TFLite模型格式转换和性能测试:

  1. Freeze graph from the checkpoint

Ensure to have >=1.14 1.x installed to use the freeze_graph tool as it is removed from 2.x distribution

pip install tensorflow==1.15

freeze_graph --input_checkpoint=./albert_model.ckpt \
  --output_graph=/tmp/albert_tiny_zh.pb \
  --output_node_names=cls/predictions/truediv \
  --checkpoint_version=1 --input_meta_graph=./albert_model.ckpt.meta --input_binary=true
  1. Convert to TFLite format

We are going to use the new experimental tf->tflite converter that's distributed with the Tensorflow nightly build.

pip install tf-nightly

tflite_convert --graph_def_file=/tmp/albert_tiny_zh.pb \
  --input_arrays='input_ids,input_mask,segment_ids,masked_lm_positions,masked_lm_ids,masked_lm_weights' \
  --output_arrays='cls/predictions/truediv' \
  --input_shapes=1,128:1,128:128:1,128:1,128:1,128 \
  --output_file=/tmp/albert_tiny_zh.tflite \
  --enable_v1_converter --experimental_new_converter
  1. Benchmark the performance of the TFLite model

See here for details about the performance benchmark tools in TFLite. For example: after building the benchmark tool binary for an Android phone, do the following to get an idea of how the TFLite model performs on the phone

adb push /tmp/albert_tiny_zh.tflite /data/local/tmp/
adb shell /data/local/tmp/benchmark_model_performance_options --graph=/data/local/tmp/albert_tiny_zh.tflite --perf_options_list=cpu

On an Android phone w/ Qualcomm's SD845 SoC, via the above benchmark tool, as of 2019/11/01, the inference latency is ~120ms w/ this converted TFLite model using 4 threads on CPU, and the memory usage is ~60MB for the model during inference. Note the performance will improve further with future TFLite implementation optimizations.

使用PyTorch版本:
download pre-trained model, and convert to PyTorch using:
 
  python convert_albert_tf_checkpoint_to_pytorch.py     

using albert_pytorch

使用Keras加载:

bert4keras 适配albert,能成功加载albert_zh的权重,只需要在load_pretrained_model函数里加上albert=True

load pre-trained model with bert4keras

使用tf2.0加载:

bert-for-tf2

使用案例-基于用户输入预测文本相似性 Use Case-Text Similarity Based on User Input

功能说明:用户可以通过本例了解如何加载训训练集实现基于用户输入的短文本相似度判断。可以基于该代码将程序灵活地拓展为后台服务或增加文本分类等示例。

涉及代码:similarity.py、args.py

步骤:

1、使用本模型进行文本相似性训练,保存模型文件至相应目录下

2、根据实际情况,修改args.py中的参数,参数说明如下:

#模型目录,存放ckpt文件
model_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/')

#config文件,存放模型的json文件
config_name = os.path.join(file_path, 'albert_config/albert_config_tiny.json')

#ckpt文件名称
ckpt_name = os.path.join(model_dir, 'model.ckpt')

#输出文件目录,训练时的模型输出目录
output_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/')

#vocab文件目录
vocab_file = os.path.join(file_path, 'albert_config/vocab.txt')

#数据目录,训练使用的数据集存放目录
data_dir = os.path.join(file_path, 'data/')

本例中的文件结构为:

|__args.py

|__similarity.py

|__data

|__albert_config

|__albert_lcqmc_checkpoints

|__lcqmc

3、修改用户输入单词

打开similarity.py,最底部如下代码:

if __name__ == '__main__':
    sim = BertSim()
    sim.start_model()
    sim.predict_sentences([("我喜欢妈妈做的汤", "妈妈做的汤我很喜欢喝")])

其中sim.start_model()表示加载模型,sim.predict_sentences的输入为一个元组数组,元组中包含两个元素分别为需要判定相似的句子。

4、运行python文件:similarity.py

支持的序列长度与批次大小的关系,12G显存 Trade off between batch Size and sequence length

System Seq Length Max Batch Size
albert-base 64 64
... 128 32
... 256 16
... 320 14
... 384 12
... 512 6
albert-large 64 12
... 128 6
... 256 2
... 320 1
... 384 0
... 512 0
albert-xlarge - -

学习曲线 Training Loss of xlarge of albert_zh

所有的参数 Parameters of albert_xlarge

技术交流与问题讨论QQ群: 836811304 Join us on QQ group

If you have any question, you can raise an issue, or send me an email: [email protected];

Currently how to use PyTorch version of albert is not clear yet, if you know how to do that, just email us or open an issue.

You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.

If you have ideas for generate best performance pre-training Chinese model, please also let me know.

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Cite Us

Bright Liang Xu, albert_zh, (2019), GitHub repository, https://github.com/brightmart/albert_zh

Reference

1、ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations

2、BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

3、SpanBERT: Improving Pre-training by Representing and Predicting Spans

4、RoBERTa: A Robustly Optimized BERT Pretraining Approach

5、Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(LAMB)

6、LAMB Optimizer,TensorFlow version

7、预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准

8、 albert_pytorch

9、load albert with keras

10、load albert with tf2.0

11、repo of albert from google

12、chineseGLUE-中文任务基准测评:公开可用多个任务、基线模型、广泛测评与效果对比

albert_zh's People

Contributors

brightmart avatar bringtree avatar multiverse-tf avatar solumilken avatar stopit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

albert_zh's Issues

albert使用的时候,显存减少了吗?训练速度加快了吗?

albert使用的时候,显存减少了吗?训练速度加快了吗?

我现在最直观的是觉得模型文件大小比bert小了很多,但是好像使用的时候显存和bert差不多.但是我看有些文章里写的是 1 解决了内存限制问题 2 训练速度加快. 我怎么没感觉到啊. 谁能帮忙解释下?

请教一下输入token如果是词内部的字,使用的token是'字'还是'##字'?

我注意到bert官方提供的中文vocab.txt里,每个汉字都有两个token,一个带有'##'前缀,一个不带前缀,我的理解是不带前缀的表示词的首字,带前缀的是非首字。由于两者转换为id后并不相同,我想请教一下对应词内非首字,预训练数据的输入是否使用带前缀的token(给模型输入分词信息)?另外,MLM的label是否使用带前缀的版本?不胜感激!

similarity脚本报错

以albert_tiny_zh预训练模型作为输出文件夹的模型报错(即不经过fine tune这一步 直接以albert_tiny_zh模型运行similarity脚本)

错误信息:tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_resto re_v2_ops.cc:184 : Not found: Key global_step not found in checkpoint

备注:经过fine tune lcmqc数据集这步后 运行没出现报错

关于超参数intermediate_size

代码中设置的intermediate_size=4096。按我的理解,论文(第4页第2行)里说intermediate_size是4倍的hidden size,应该是16384。不知道是不是我理解错了。感谢解答!

albert预训练问题

请问在训练层数比较深的albert时,需要像论文里说的先训6层的,然后12层的参数从6层的finetune吗?还是可以直接train from scratch?这两种方式预训练结果差异会大吗?

create_pretraining_data.py line 322

Hello,brightmart:
Thank you for Implementation of albert. When I see ine 322 of create_pretraining_data.py
if len(tokens_a)==0 or len(tokens_b)==0: continue
I have a problem:
if list current_chunk only have the last sentence of document, current_chunk will have
two same sentences. Then tokens_a is last sentence and tokens_b is also last sentence.

6层albert模型的发布问题

目前发布的albert_tiny模型仅有4层,虽然模型体量小,但模型效果与其他模型还是有差距。albert_base有12层,但模型整体规模比较大,预测效率还是与bert_base、roberta_base预测效率相近。所以希望作者可以发布6层的albert模型,以适应更多的任务需求。谢谢。

How to create vocab for other language?

Thanks for the implementation!
I would like to know how you have created vocab for training ALBERT.
I am using Sentencepiece for this but which model_type to choose from bpe,word or char

缺少基于用户输入的预测

现有的用例展示的是对于文本的批量运算,缺少基于用户输入的预测。使用他开源bert工具(基于tf)调用tiny_bert模型进行用户输入预测时总是报:部分变量与checkpoint中的shape不同的错误,无法正常运行。

SOP data preparation

Hi,
When generating SOP train instances, if there is only one sentence in the document. Or other extreme case: every sentences in the doc are very long (i.e. more than the target_token_num),
The generated output is
[CLS] tokens_a [SEP] tokens_a[SEP]. This is due to the continue statement without increasing index i in create_instances_from_document_albert().

If there is only one sentence in the current chunk, in order to do SOP, how about we randomly find a split position in the tokens_a? all tokens after that position will go to tokens_b. Tokens before that position will assign to tokens_a.

 if len(tokens_b) == 0 and len(current_chunk) == 1:
                        if len(tokens_a) > 1:
                            #There is only one sentence in the chunk. The sentence could be very long. Or it is the last one in the document.
                            #In order to make SOP, we need to split the sentence into 2 parts.
                            index = rng.randint(1, len(tokens_a) - 1)
                            # index = int(len(tokens_a)/2)
                            tokens_b = tokens_a[index:]
                            tokens_a = tokens_a[0:index]
                        else:
                            print("only 1 token in tokens_a, can't split it into 2 parts. just skip this sentence.")
                            break

使用albert_large运行报错,感觉是word embedding的dim不对,求帮助

ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((21128, 1024)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([21128, 128]) from checkpoint reader.
报错如上
我是直接跑了bert官方的分类脚本,用bert的时候没有问题,不知道是不是embedding的dim设置不一样,作者你在跑分类的时候有修改其他 代码吗?感谢

Could I run pretraining on multi gpu?

Hi,
I've been taking a look at the codes implemented and couldn't find the way to train on multi gpu.
Is there some way to do that? or you have any plans to implement this feature.
Please let me know and thanks in advance :)

预训练无法使用gpu

你好,我用自己的数据预训练tiny,看起来只使用cpu在跑,环境设置如下:
1、device_list 可以看到1块cpu和4块gpu
2、tf版本:只有gpu版本 和 gpu/cpu版本共存(保证gpu版本>=cpu),都试过
3、CUDA_VISIBLE_DEVICES 设置为已有gpu index

结果如下-cpu大量使用、gpu只使用100多M:
/Users/zhangyang/Documents/Albert cpu使用情况.png

预训练无法使用gpu

你好,我用自己的数据预训练tiny,看起来只使用cpu在跑,环境设置如下:
1、device_list 可以看到1块cpu和4块gpu
2、tf版本:只有gpu版本 和 gpu/cpu版本共存(保证gpu版本>=cpu),都试过
3、CUDA_VISIBLE_DEVICES 设置为已有gpu index

结果如下-cpu大量使用、gpu只使用100多M:
/Users/zhangyang/Documents/Albert cpu使用情况.png

关于xlarge模型的batch_size和学习率

您好,我最近在使用xlarge-albert在自己任务上微调,起初我设置的batch_size是16,学习率是2e-5,然后训练过程中发现loss震荡的厉害,验证集效果极差。
然后,我把学习率调低到2e-6,发现效果好一些,但是验证集精度仍然和原始bert有差距。
最后,我又继续把学习率调低到2e-7,发现效果又会好一些,但是和原始bert还是有差距。另外和使用albert-base相比也有差距,所以我觉得是训练出了问题。
所有我想请教下您,使用xlarge-albert微调时,学习率和batch_size需要设置成多少合适呢?我看到您说batch_size不能太小,否则可能影响精度,我16的batch_size是否过小了?

cmrc2018任务

可以发下cmrc2018任务的代码吗 想自己验证一下

请问使用TPU预训练好的模型怎么把adam相关的variable去掉?

我使用下面的代码加载训练好的模型,这是因为GPU和TPU不一样吗?

# tf.__version__ 1.15.0
sess = tf.Session()
imported_meta = tf.train.import_meta_graph('model.ckpt-250000.meta')
imported_meta.restore(sess,  'model.ckpt-250000.data-00000-of-00001') --->> 这里抛异常

my_vars = []
for var in tf.all_variables():
    if 'adam_v' not in var.name and 'adam_m' not in var.name:
        my_vars.append(var)
saver = tf.train.Saver(my_vars)
saver.save(sess, './model.ckpt')
INFO:tensorflow:Restoring parameters from gs://medical_bert/medical/model.ckpt-250000
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
   1364     try:
-> 1365       return fn(*args)
   1366     except errors.OpError as e:

8 frames
InvalidArgumentError: No OpKernel was registered to support Op 'TPUReplicatedInput' used by {{node input0}}with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
InvalidArgumentError: No OpKernel was registered to support Op 'TPUReplicatedInput' used by node input0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
   1324       # We add a more reasonable error message here to help users (b/110263146)
   1325       raise _wrap_restore_error_with_msg(
-> 1326           err, "a mismatch between the current graph and the graph")
   1327 
   1328   @staticmethod

InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'TPUReplicatedInput' used by node input0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

发现的一个bug 及提出一个问题:按字符级和按词语级分割那种更好?

修复的一个bug:
在create_pretraining_data.py中,masked_lm_labels 中的label是含有前缀##的而tokens是不含这个在做convert_tokens_to_ids转换时会报错。解决方法在convert_by_vocab加入删除##的逻辑就可以了。

问题:
看到大神的代码在create_pretraining_data.py中是将所有的prefix '##' 去除干净在convert_tokens_to_ids。但是vocab.txt中其实是含有很多带有##为前缀的token的,比如 '##好', '##在' 。以此引出一个问题是按字符级和按词语级分割那种更好?

请问如何在训练好的albert checkpoint使用自己的语料继续训练?

我使用run_pretrain.py代码,指定好训练好的model_dir,但是会抛异常Key bert/embedding/Layernorm/beta/lamb_m not found,

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/lamb_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

关于lamb_m的优化参数会在训练完成后去掉,想了解一下如果正确加载预训练checkpoint继续训练?

您好,关于保存的参数问题

您提供的预训练中,只有Bert部分的模型参数,缺少“cls/predictions”部分和“"cls/seq_relationship"部分的参数,方便开源一下完整的模型参数吗?(主要是tiny-albert和albert-base)万分感谢。

final layer normalization

The Pre-LN Transformer
puts the layer normalization inside the residual
connection and equips with an additional finallayer normalization before prediction

在prelln_transformer_model中我没找到这个final layer normalization
请问是我遗漏了还是您没有实现这层呢

Failed to find any matching files for ./albert_large_zh/bert_model.ckpt

在尝试用albert_large_zh模型跑fine-tuning时出错,按照以下命令执行

 export BERT_BASE_DIR=./albert_large_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier.py   --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --bert_config_file=./albert_config/albert_config_large.json --max_seq_length=128 --train_batch_size=64   --learning_rate=2e-5  --num_train_epochs=3 \
    --output_dir=albert_large_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &

得到ERROR信息如下

ERROR:tensorflow:Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt
E1013 17:11:41.890616 140328154277632 error_handling.py:70] Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt
INFO:tensorflow:training_loop marked as finished
I1013 17:11:41.890880 140328154277632 error_handling.py:96] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1013 17:11:41.891006 140328154277632 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
  File "run_classifier.py", line 947, in <module>
    tf.app.run()
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_classifier.py", line 819, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2709, in _call_model_fn
    config)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2967, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1549, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1867, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "run_classifier.py", line 498, in model_fn
    ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
  File "/home/lcy/nlp/SecretProject/ALBERT/albert_zh/modeling.py", line 352, in get_assignment_map_from_checkpoint
    init_vars = tf.train.list_variables(init_checkpoint)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 97, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 636, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 648, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt

请问该如何解决?非常感谢

什么时间开源模型代码以及发布预训模型参数?

请问下,模型代码以及发布预训模型参数会按照这个时间发布吗?
1、albert_base, 参数量12M, 层数12,10月5号

2、albert_large, 参数量18M, 层数24,10月13号

3、albert_xlarge, 参数量59M, 层数24,10月6号

4、albert_xxlarge, 参数量233M, 层数12,10月7号(效果最佳的模型

albert模型预测速度远小于bert

您好,请问下我用Albert模型精调的分类模型,主要参数如下,使用了GitHub上放出来的两个xlarge模型 albert_xlarge_zh_177k和albert_xlarge_zh_183k
--do_train=true
--do_eval=true
--do_predict=false
--do_export=false
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/albert_config_xlarge.json
--init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt
--max_seq_length=32
--train_batch_size=64
--save_checkpoints_steps=500
--max_steps_without_decrease=15000
--learning_rate=1e-6
--num_train_epochs=10 \

python:3.6.4
TensorFlow:1.14
训练:GPU V100
预测:戴尔笔记本
使用 tornado做web框架,都在我自己本地笔记本电脑上测试,测试代码完全一样
bert 预测耗时220ms,
Albert 预测耗时2200ms
这个差距有点大,不是说模型更小,参数更少,预测的速度不会更快是吗,请大佬赐教。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.