visualjoyce / chengyubert Goto Github PK

View Code? Open in Web Editor NEW

17.0 2.0 3.0 1.02 MB

[COLING 2020] BERT-based Models for Chengyu

License: MIT License

Python 97.92% Dockerfile 0.31% Shell 1.77%

bert-embeddings bert-model chengyu idioms multiple-choice-question-answering question-answering

chengyubert's People

Contributors

Stargazers

Watchers

Forkers

luoyangen joyychen ahahah001

chengyubert's Issues

How many steps does it save the model?

I have trained 19552 steps, and only have outputs of log.
There are no ckpts.
Is that right?

file:///home/chen/mydisk/2021-01-05%2012-53-53%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE.png

Which document is used to get enlarged candidate set ?

couldn't find 'idioms_pretrain.json' in ChID-Dataset, besides .csv files in competition directory

Where is "idoimList.txt"?

embedding training config file

Thanks for your work!

I can not find this file train-embeddings-base-1gpu.json mentioned in ReadMe.md, but found bert-wwm-ext_literature file. Does the bert-wwm-ext_literature file replace the former file?

Thanks a lot!

Can we remove docker and horovod ?

For some special reasons, I can't use docker and horovod.

Can I remove them?

About the huggingface pretrained model

Hi, how can I use this huggingface pretrained model to produce chengyu embeddings? https://huggingface.co/visualjoyce/chengyubert_2stage_stage1_wwm_ext ,
since chinese-BERT-wwm only produces token based embedding.

Error when train Bert-chid

Traceback (most recent call last):                                                                                      | 0/24822 [00:00<?, ?it/s]
  File "train_official.py", line 470, in <module>
    main(args)
  File "train_official.py", line 317, in main
    best_ckpt = train(model, dataloaders, opts)
  File "train_official.py", line 145, in train
    opts, global_step)
  File "train_official.py", line 235, in evaluation
    log.update(validate(opts, model, loader, split, global_step))
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "train_official.py", line 177, in validate
    loss = F.cross_entropy(logits, targets, reduction='sum')
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2228, in nll_loss
    out_size, target.size()))
ValueError: Expected target size (72, 1), got torch.Size([72])

How to handle OOV?

Problems met when trying the code

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown.

About Two-stage

I'm sorry to bother you again.

I wanna know whether the codes of paper ( ' A BERT-based two-stage model for Chinese Chengyu recommendation ' about two-stage) are only using ' train_pretrain.py ' and ' train_official.py '?
What's the difference between the stage-1-pretain and using 'train_pretrain. py'?

What's more, What's the difference among w/o Pre-Training 、w/o Fine-Tuning 、 w/o 𝐿V and w/o 𝐿A. (I don't quite understand what you're showing in your paper.)

Could you describe more details? Thanks very much.

请问TxtLmdb和TxtTokLmdb这两个类是用来做什么的呢

accuracy

你好，我用你放出来的2stage_stage1_wwm_ext训练了第二阶段的official ‘chengyubert-2stage-stage2’，结果只有77。按照您_A BERT-based Two-Stage Model for Chinese Chengyu Recommendation_Table4的结果，应该是85.43吧。请问为什么会差这么多呢，是我哪里没有注意到吗？

About parameters

I used the parameters showed on your paper.

pre-trained BERT:Chinese with Whole Word Masking (WWM)
the maximum length:128
batch size:40 (4X10 GPU cards)
initial learning rate: 0.00005
warm-up steps:1000
optimizer:AdamW
scheduler:WarmupLinearSchedule
epoch:5 (num_train_steps about 80800)

Because of my device (1 * GTX2080Ti), I set train_batch_size = 6000, num_train_steps about 80800. The epoch of the experiment is just 5. The batch size is just 40.

But I can not approach your accuracy, the following picture shows my experiment accuracy.

That's a difference of nearly 3~6 %.

That's my trainning config json:
{ "train_txt_db": "official_train.db", "val_txt_db": "official_dev.db", "test_txt_db": "official_test.db", "out_txt_db": "official_out.db", "sim_txt_db": "official_sim.db", "ran_txt_db": "official_ran.db", "pretrained_model_name_or_path": "hfl/chinese-bert-wwm-ext", "model": "chengyubert-dual", "dataset_cls": "chengyu-masked", "eval_dataset_cls": "chengyu-masked-eval", "output_dir": "storage", "candidates": "combined", "len_idiom_vocab": 3848, "max_txt_len": 128, "train_batch_size": 6000, "val_batch_size": 20000, "gradient_accumulation_steps": 1, "learning_rate": 0.00005, "valid_steps": 100, "num_train_steps": 80800, "optim": "adamw", "betas": [ 0.9, 0.98 ], "adam_epsilon": 1e-08, "dropout": 0.1, "weight_decay": 0.01, "grad_norm": 1.0, "warmup_steps": 1000, "seed": 77, "fp16": true, "n_workers": 0, "pin_mem": true, "location_only": false }

What's wrong with the parameters?

训练时报错，请问下competition_train.db是做什么的，

请问下competition_train.db是做什么的呢？
我在熟读您的代码的时候，有几个疑问：
1、Preprocessing中：

这些official_*.db是干嘛的？可以替换吗？
└── txt_db
├── hfl
│   └── chinese-bert-wwm-ext
│   ├── external_pretrain.db
│   ├── official_dev.db
│   ├── official_out.db
│   ├── official_ran.db
│   ├── official_sim.db
│   ├── official_test.db
│   └── official_train.db
└── visualjoyce
└── chengyubert_2stage_stage1_wwm_ext -> ../hfl/chinese-bert-wwm-ext
这些db文件没有下载路径，麻烦解答下哈，感谢

The config of chengyubert_2stage_stage1

Hello! I want to load the model in https://huggingface.co/visualjoyce/chengyubert_2stage_stage1_wwm_ext/tree/main
However, the config says that the len_idiom_vocab is 33237, the vocab.txt supported in the link isn't the idioms' vocab and the size of the vocab.txt isn't 33237. I find in the Google Drive you in README, and find a file "idioms_pretrain.json", but the size of this file is 33238. So can you tell me, to load the model in the huggingface, what vocab should I use?

About modeling.

Sry, I have one more question about the codes.

What's the difference or purpose among the following model classes:

@register_model('chengyubert-2stage-stage2-mask')
@register_model('chengyubert-2stage-stage2-cls')
@register_model('chengyubert-2stage-stage2-window')
@register_model('chengyubert-2stage-stage2-mask-window')

What is "scope", "num" columns in the corpus?

Hi, may I ask what those "scope", "num" columns stand for?

In "idioms_pretrain.json" ,

idiom num explanation
偃武崇文 0 停息武备，崇尚文教。
洪乔捎书 0 指言而无信的人。
南郭先生 103 比喻无才而占据其位的人。

In "idioms_scopes.tsv",

scope idiom id
Scope I 见义勇为 0
Scope II 偃武崇文 3848
Scope III 亏于一篑 33237

In "idiom_synonyms.tsv",

query synonym query_id synonym_id overlapping
黯然销魂六神无主 14726 1333 0
黯然销魂丧魂失魄 14726 2704 1
塞翁失马，焉知非福塞翁失马，安知非福 24524 32175 8

I thought "overlapping" is related with the number of Chinese character overlapped, but the last one shows 8, which is presumably 7.

Thanks!

The network structure Questions

你好，预先感谢您的热情回答。

有几个问题关于框架的细节想咨询您。

第一个问题是，下图两个embedding是否是随机初始化的，只初始化一次还是说每次取batch的时候也初始化呢？

第二个问题是，对于右边的embedding，无法对应每一个成语的embedding，因为每次取一条数据只有7个选项，并且这7个选项每次都在变化。（左边的我能理解是对应了每个成语的embedding在修正，因为范围是3848，但右边范围似乎只有局部的7选项范围），我不知道这么理解是否正确？您方便指教一下吗？

embedding evaluate dataset

Hi, where can i find the dataset for embedding evaluation~
thx!

Where is the idiomDict.json/sample_submission.csv in competition and test_data_ord.txt in official?

In your structure, the files above are all included, however, no matter in google drive or huggingface, i cant find them.

dropout

请问在modeling_bert.py，ChengyuBertForClozeChid类的前向传播中，pooled_output = self.dropout(multiply_result)，这个地方为什么要对乘积之后的结果进行dropout呢，能谈谈您的想法吗？

error when training chengyubert-twostage

`
[1,0]:
[1,0]:Traceback (most recent call last):
[1,0]: File "train_official.py", line 468, in
[1,0]: main(args)
[1,0]: File "train_official.py", line 317, in main
[1,0]: best_ckpt = train(model, dataloaders, opts)
[1,0]: File "train_official.py", line 145, in train
[1,0]: opts, global_step)
[1,0]: File "train_official.py", line 235, in evaluation
[1,0]: log.update(validate(opts, model, loader, split, global_step))
[1,0]: File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,0]: return func(*args, **kwargs)
[1,0]: File "train_official.py", line 176, in validate
[1,0]: logits, over_logits, cond_logits = model(**batch, targets=None, compute_loss=False)
[1,0]:ValueError: not enough values to unpack (expected 3, got 2)

如何获取test_answer.csv？在比赛界面没有开放数据

About the prediction layer weight

Recently I want to use the prediction layer weight in chengyuBERT as the initial idiom embedding in my work, however im struggling to it. Can you give an instruction? Thank you.