PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching

PromptEM is a novel low-resource GEM (Generalized Entity Matching) solution powered by prompt-tuning and self-training. To address the gap between pre-training and fine-tuning, we cast GEM as a cloze-style task and design the GEM-specific prompt-tuning, which can stimulate the rich knowledge distributed in LMs. To select high-quality pseudo-labels, we develop a lightweight uncertainty-aware self-training method to boost performance. To further avoid expensive self-training, we prune useless training data dynamically using the proposed MC-EL2N, making the self-training process more lightweight and efficient.

For more technical details, see PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching.

Datasets

We use eight real-world benchmark datasets with different structures from Machamp and Geo-ER.

Quick Start

To train and evaluate with PromptEM.

python main.py [<args>] [-h | --help]

e.g.

python main.py -d=rel-heter -k=0.1 -st -dd=8 -ur=0.05 -er=0.05

The meaning of the flags:

--model_name_or_path: the name or local path of the pre-trained language model. e.g. roberta-base
--data_name: the name of the dataset. options: [rel-heter, rel-text, semi-heter, semi-homo, semi-rel, semi-text-c,semi-text-w, geo-heter, all]
--k: the proportion of training data used. e.g. 0.1
--num_iter: the number of iterations. e.g. 1
--template_no: the number of templates used in PromptEM. options: [0,1,2,3]
--self_training: the flag to enable self-training of PromptEM.
--dynamic_dataset: the frequency of dynamic dataset pruning. e.g. 8 (pruning for every 8 epochs)
--pseudo_label_method: the method of generating pseudo labels. e.g. uncertainty
--mc_dropout_pass: the number of MC-Dropout passes. e.g. 10
--uncertainty_ratio: the proportion of the unlabeled samples for generating pseudo labels. e.g. 0.05
--el2n_ratio: the proportion of the labeled samples for dynamic dataset pruning. e.g. 0.1
--text_summarize: the flag to enable text summarization in entity serialization.
--add_token: the flag to add special token in entity serialization.
--max_length: the maximum length (in number of tokens) for the inputs to the transformer model. e.g. 512
--teacher_epochs: the number of epochs for training teacher model. e.g. 20
--student_epochs : the number of epochs for training student model. e.g. 30
--batch_size: batch size. e.g. 32
--lr: learning rate. e.g.1e-5

Reproduction

All the experiments are conducted on an Ubuntu Server with an Intel Xeon Silver 4216 CPU and an NVIDIA A100 GPU.

Initial the environment

conda create -n promptem python=3.7
const activate promptem
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers==4.16.2
pip install scikit-learn==1.0.2

Note that you do not need to install OpenPrompt by pip manually.

We notice that the best hyper-parameters can be sensitive to your server environment and package version. If you do not have the same environment, we highly recommend you run the search for hyper-parameters in your environment.

We provide an example search script in search.sh

Download the PLM [Optional]

We use RoBERTa-base as the backbone structure of our model in all the experiments.

You can download the pre-trained checkpoint from huggingface manually.

Reproduce the result

You can train the model using the best hyper-parameters we provided in low_configs.json.

We also provide the corresponding logs in logs.

See Quick Start for more details of training parameters.

zju-daily / promptem Goto Github PK

promptem's Introduction

PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching

Datasets

Quick Start

Reproduction

Initial the environment

Download the PLM [Optional]

Reproduce the result

promptem's People

Contributors

Stargazers

Forkers

promptem's Issues

what's the difference between only_plm and not only_plm?

classes in dataset and verbalizer are different

论文发布了吗

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent