sinovation / zen Goto Github PK

View Code? Open in Web Editor NEW

641.0 641.0 104.0 600 KB

A BERT-based Chinese Text Encoder Enhanced by N-gram Representations

License: Apache License 2.0

Python 100.00%

zen's People

Contributors

Stargazers

Watchers

Forkers

awesome-archive qianrenjian zhp510730568 nstats colinsongf jiapeijia haojiepan1 lijun20 templeblock youngshingjun barryzm perfmjs fairycloudsi xiaodanjiao gdh756462786 chapzq77 chinaeamonn coffee1993 supuchun buracagyang qianyiwei wj573510848 janciswang chris-paul-li husin123 171860596 zxyscz simolx liannice nick-2008 yinxx whoolly shengzhang90 fireoil shadowkun meibaotai wenyu332 koala-good langyayue99 ares2013 ykhahahaha zhangluoyang dakelq yjfiejd da-southampton githubmyk chenny0808 ylyinzju cdyangbo kunato mikuh jiaojie666 frankchu0229 bamaao kiminh tauriel27 xiongjun19 lcctju deyfh huangyuexin zilinly zhangjiekui wenbowong humdingers askintution wuz12345 datefinde sbxlm ruizewang zhihao-chen aiedward jiameng0 sjyttkl marcos0318 dongcin foreveryaoge shizhediao wings876 sidney1994 lindgew sunshuoying gumin2020 chanchimin xrosliang ishine snlam shunxing1234 tianhaofu xiaoanshi cosmoshua zys711 techthiyanes break-wu simeng-liu zenshantan 13114848878 clksong august-huang abunlp vcip2015

zen's Issues

总线错误(吐核)

小数量语料（25w行）没有出问题，在跑大规模语料（840w）是出现总线错误(吐核)，日志如下：

麻烦问一下是否知道原因以及可能的解决方案？

Performance of ZEN on ChineseGLUE benchmark

Great job!

What's the performance of ZEN on ChineseGLUE benchmark?

It has a wide range of tasks, now around 10 tasks with baselines available.

https://github.com/chineseGLUE/chineseGLUE

数据集预处理

您好，感谢您的付出！我想请问一下，当我下载了一个中文数据集THUCNew后，我应该怎么做（或者说使用什么命令）才能让./examples/create_pre_train_data.py正常运行并生成正确的训练集呢？

期待您的回复！感激不尽！

pregenerate_training_data not found

我们有相同的配置NVIDIA Tesla V100 GPUs with 16GB memory，打算换切换为百度百科进行预训练，请问 1epoch 大概需要多久？
We have the same configuration of NVIDIA Tesla V100 GPUs with 16GB memory, we plan to switch to baidu baike for pre-training, may I ask how long it will take for 1 epoch?

size mismatch for classifier.bias: copying a param with shape torch.Size

请问可以直接执行分类任务吗？还是必须finetun.
我下载了所有数据，直接执行这个报错：

python run_sequence_level_classification.py
--task_name ChnSentiCorp
--do_train
--do_eval
--do_lower_case
--data_dir /path/to/dataset/ChnSentiCorp
--bert_model /path/to/zen_model
--max_seq_length 512
--train_batch_size 32
--learning_rate 2e-5
--num_train_epochs 30.0

07/20/2020 22:14:06 - INFO - ZEN.tokenization - loading vocabulary file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/vocab.txt
07/20/2020 22:14:06 - INFO - ZEN.ngram_utils - loading ngram frequency file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/ngram.txt
07/20/2020 22:14:08 - INFO - ZEN.modeling - loading weights file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/pytorch_model.bin
07/20/2020 22:14:08 - INFO - ZEN.modeling - loading configuration file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/config.json
07/20/2020 22:14:08 - INFO - ZEN.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_hidden_word_layers": 6,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"word_size": 104089
}

Traceback (most recent call last):
File "examples/run_sequence_level_classification.py", line 396, in
main()
File "examples/run_sequence_level_classification.py", line 361, in main
if task_name not in processors:
File "/data/anaconda3/lib/python3.6/site-packages/ZEN-0.1.0-py3.6.egg/ZEN/modeling.py", line 839, in from_pretrained
RuntimeError: Error(s) in loading state_dict for ZenForSequenceClassification:
size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).
sh-4.2$

thanks a lot.

构建ngram字典

您好，我想问一下ZEN模型在构建ngram字典是使用了什么工具？我想在自己的领域的文本上构建一个ngram字典，但不知道如何构建比较好。

你好，请问ngram_dict是如何构造的？

是只在预训练预料和下游任务预料中得到还是说有其他的无监督语料？

ngram字典问题

请问在哪里可以得到ngram字典啊？是否可以提供一个链接？多谢！

Please upload the model to Google Drive or other eternal storage.

The original website http://zen.chuangxin.com/ZEN/models/ZEN_pretrain_base_v0.1.0.zip is too slow to download.

NER datasets

Excese me, could you please release the MSRA data, because I cann't download from the http://sighan.cs.uchicago.edu/bakeoff2006
Thanks!!!

finetune的数据格式是否是官方的格式，能否直接提供一下，否则个人获取不太方便

python run_token_level_classification.py
--task_name cwsmsra
--do_train
--do_eval
--do_lower_case
--data_dir data/msra_ner
--bert_model data/ZEN_pretrain_base_v0.1.0
--max_seq_length 256
--do_train
--do_eval
--train_batch_size 96
--num_train_epochs 30
--warmup_proportion 0.1

比如，想进行上面的finetune，但是这个任务cwsmsra，使用的训练数据格式应该是怎样的，从哪里能比较方便获取到？

ModuleNotFoundError: No module named ‘ZEN’

运行python run_pre_train.py时出错：
——————————————————————————
Traceback (most recent call last):
File "run_pre_train.py", line 33, in
from ZEN import WEIGHTS_NAME, CONFIG_NAME
ModuleNotFoundError: No module named 'ZEN'

下载太慢

如题，下载了好久都没成功

How to locate cached pre-trained weights?

Args.bert_model need a path of pre_trained bert weights. But I don't know how to locate it because it is cached automatically.

how to initialize n-gram tower and emb?

Hi~

1、Is ZEN trained from any base bert（e.g. google） or trained from scratch? If from scrach, I guess the n-gram emb is randomly initialized, If from base bert, the n-gram emb maybe the average of characters included?

2、According to "We use the same parameter setting for the n-gram encoder as in BERT" in the paper，I want to know that the params of n-gram encoder is shared and the same with bert tower（maybe the bottom six layer?），or is initialized and trained independently?

thank you~

Fine-tuning datasets preparation

Firstly, thanks a lot for your open source contribution.
Could you please provide some Python scripts for converting the originally official datasets format to the TSV format ? For example, XML to TSV for the NER task of MSRA, ...... therefore, we can use your project much more conveniently.

Thanks a lot again.

hyperparameters for pre-training

Hi, this is a nice work!

Could you give some more details about the hyperparameters used in pre-training?

ZEN (P) is trained based on Google BERT. How many epochs used in the additional pre-training?

Thanks!

ERROR: No document breaks were found in the input file! These are necessary to allow the script to ensure that random NextSentences are not sampled from the same document. Please add blank lines to indicate breaks between documents in your input file. If your dataset does not contain multiple documents, blank lines can be inserted at any natural boundary, such as the ends of chapters, sections or paragraphs.

感谢您的开源
我想知道我怎么将我自己的数据集处理成N-gram.txt

how to use this in my own NER training

as i said,3ks

Can you evaluate ZEN model in the benchmark of CLUE?

Thank you for ZEN! Now researchers have another great choice in NLP pretrained model. We have witnessed that ZEN compares favorably with BERT in many NLP tasks. Would you like to evaluate ZEN in benchmark of CLUE?

Our group CLUE is also devoted to promoting the progress of Chinese NLP, we chosen 9 representative Chinese tasks and the leaderboard is open now (including human's performance).

We hope to see ZEN in this leaderboard : )

CLUE Group: https://github.com/CLUEbenchmark/CLUE
CLUE Benchmark: https://www.cluebenchmarks.com/

a question about fine-tuned model

Excuse me,is the ''fine-tuned model for NER'' mean, I can directly test the dataset(ontonote、resume、msra...) without train or only test the msra without train?
Thanks!

BERT的原版实现是哪个？

您好，

非常喜欢你们的工作！
因为想要follow你们的工作，所以想知道 ZEN 基于的 BERT 实现是哪一个？这样也好方便我们后续的使用 BERT 来进行比较。
能否给出你们所使用的 BERT 的实现（implementation）和链接？

谢谢！