dreasysnail / pointer Goto Github PK

License: MIT License

Shell 0.08% Python 99.92%

pointer's Introduction

POINTER

This repository contains the implementation of the EMNLP 2020 paper: "POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training", a progressive and non-autoregressive text generation pre-training approach. POINTER generates fluent text in a progressive and parallel manner. With empirical logarithmic time, POINTER outperforms existing non-autoregressive text generation approaches in hard-constrained text generation.

Figure: Illustration of the generation process (blue arrow) of the proposed POINTER model. At each stage, the module generates either a or a special NOI token for each gap between two existing tokens . The generation stops when all the gaps predict NOI. The data preparation process (orange arrow) reverses the above generative process.

Figure: Example of the progressive generation process

News

[Major Update 03/29/2021] The inference function and live demo now support phrases and short sentences as lexical constraints. When performing inference, an additional "--sep" command can be added to specific the a user-specific separating token such as ";", to identify the boundries of the constraints.

LIVE DEMO

The live demo can be found at here. Please expect delay and crash as it is running on a single GPU machine.

Setup Conda Environment

Please use the below commandlines to clone, install the requirements and load the Conda environment (Note that Cuda 10 is required):

sudo apt-get install -y make wget gzip bzip2 xz-utils zstd

We provide 3 ways to setup the python environments:

(Recommend) you can directly install the packages by running

bash env_setup.sh

or you can copy the exact same environment setup from us

conda create --prefix /pointer_env --file pointer-spec-file.txt

or using yaml file to reproduce the environment

conda env create -f pointer_env.yml -n pointer_env
conda activate pointer_env

Docker environment

To start, first install the docker and Nvidia-docker from their official repos. The image environment for running the code can be loaded as below:

Nvidia-docker v2.*

docker run --gpus all --ipc=host --rm -it -v $PWD:/workspace --network=host icaruszyz/large-scale-training:ins_v4 bash

Nvidia-docker v1.*

$ nvidia-docker --rm -it -v $PWD:/workspace --network=host icaruszyz/large-scale-training:ins_v4 bash

Rawdata

Link to the data files can be downloaded as follows

Dataset	Download link
News	[link]
Restaurant review	[link]
Wiki data for pre-training	[link]

POINTER model checkpoints

Link to the model and config files can be downloaded as follows (345M models)

Model	Download link
Wiki pretrained model	[link]
Restaurant review fine-tuned model	[link]
News fine-tuned model	[link]

To continue, please decompress the file and move the ckpt folder into the main directory of this repo

tar -xzvf ckpt.tar.gz

Generate from POINTER model with your own input

Quick start (TL;DR): Run the demo in our repo as

./demo.sh

Decoding script: Please put an test.key.txt (see the input/test.key.txt in this repo for an example) into the input folder of this code, with \t seperating the constraints. The generation can be done using following command:

conda activate pointer_env
python inference.py \
--keyfile ./input/test.key.txt  \
--bert_model $model_path \
--output_dir $result_path \

The generation will be at the $result_path folder.

Data preparation

Data generation:

python ./generate_training_data.py \
--train_corpus ./data/training.dummy.txt \
--bert_model bert-base-uncased \
--output_dir ./data/yelp_processed/ \
--clean  \
--task_name yelp

Model training

Dependency requirement:

Please run bash ./requirement.sh to install the dependency required. The bert-large-uncased model can be found at here. Please also install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.

Pre-training: Below is an example of pretraining a model on wikipedia

python -m torch.distributed.launch  --nproc_per_node 16 training.py \
--pregenerated_data ./data/wikipedia_processed \
--bert_model bert-large-uncased \
--output_dir $WIKI_MODEL \
--epochs 40  \
--train_batch_size 64 \
--output_step 100000 \
--learning_rate 1e-5

Fine-tuning: Below is an example of finetuning a model with pretraining model ($WIKI_MODEL) here

python -m torch.distributed.launch  --nproc_per_node 16 training.py \
--pregenerated_data ./data/yelp_processed \
--bert_model $WIKI_MODEL \
--output_dir $finetune_model_path \
--epochs 40 \
--train_batch_size 64 \
--output_step 100000 \
--learning_rate 1e-5\

Model decoding

Keywords extraction: First, you can use the following script to generate a bunch of keywords for the test file.

python keyword_extraction.py \
--n_keys 7 \
--file ./data/yelp_test.txt

The keywords file will be saved in the same folder as the input file (in this example, the keywords file will be ./data/yelp_test.key.txt)

Generating from the keywords With the trained model, you can generate a sentence from a give keywords file. The keywords file can be obtained by the previous keywords extraction step, or by a custom user input file. The following commands show an example of how to decode from the keywords file generated from last step:

python inference.py \
--keyfile ./data/yelp_test.key.txt  \
--bert_model $finetune_model_path \
--output_dir $result_path

The inference function and live demo now support phrases and short sentences as lexical constraints. When performing inference, an additional "--sep" command can be added to specific the a user-specific separating token such as ";", to identify the boundries of the constraints. The default separator is white space " ".

NOTE THAT, if using default white space separator, the input keywords file will be tokenized by a BERT tokenizer. A less common word will likely be parsed into subwords, for example, cheesecake will be split into cheese, ##cake. As a result the final generation may not contain the whole word cheesecake.

Citation

If you use this code in your research, you can cite our paper:

@inproceedings{zhang2020pointer,
  title={POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training},
  author={Zhang, Yizhe and Wang, Guoyin and Li, Chunyuan and Gan, Zhe and Brockett, Chris and Dolan, Bill},
  booktitle={EMNLP},
  year={2020}
}

pointer's People

Contributors

Stargazers

Watchers

Forkers

chensongcan yyht ywb2018 kiminh renhaocui zzsfornlp thedarkzeno siv2r xuan-zw chenyangh misoknisky airysen wangxuekui zurichrain sejoonoh githubxuexixuexi

pointer's Issues

Why is the calculation of token importance score in generate_training_data.py so inconsistent with that in the paper?

about constructing data

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

Restaurant review fine-tuned model link problem

The link points to the dataset instead of model checkcpoint?

How to define the model's usage scenario ?

About pre-training wiki dataset

Thanks for this interesting project! I am wondering:

Could you please share the link of the pre-training dataset (i.e. wiki data)?
Your paper mentioned that the sentences in wiki data are 1.99 million and about 12.6GB, but we found that the original wiki data (90 million sentences) is about 23GB. They look inconsistent, do I misunderstand something?

slot nums

i didn't understand that for sequences {sos , x1, x2, ..., xt, eos}, shouldn't it be t+1 slots as described in INSERTION TRANSFORMER? which means (sos ,x1) (x1,x2),...(xt, eos), but in your words, there is always t slots, which confused me a lot.

How to restrict some key words as a whole instead of seperating them?

For example, I have the following key words:

Hello kitty blanket Fleece

I want to treat the whole of "Hello kitty" as a key word, and generates text similar to the following:

Cartoon Hello Kitty Printing Throw Blanket Soft Cover Flannel Cozy Plush Fleece Blanket for Boys Girls Kids

Sentences like '* hello * kitty ...' are not corrected.

When will you open source your code and data sets

When will you open source your code and data sets？

How to reduce the token of [UNK] symbol in generated text

About open train code and datasets

Will you please open the source code for the train ?

Add `.gitignore` file to the repo

The pycache directory present in the repo is not necessary. Creating a .gitignore file and adding __pychache__ will do the trick.

add check for conda installation on env_setup.sh

There is no check to see if the user has conda install on the computer in env_setup.sh. Hence, all the pip install commands inside env_setup.sh will install the packages in the base environment if the user does not have conda installed.

How to generate shorter sentences?

Hi,

first, thanks for your amazing work on POINTER!

I am working on an application to paraphrase generation from the source sentence keywords for my PhD, but in my experience the paraphrases generated tend to be 100 to 160 words which is 3-4 times longer than my sources, even after fine-tuning.

In your opinion, what would be the best way to generate shorter paraphrases? The [No insertion] probability knob (with the risk of falling out of the pre-training domain), retraining from scratch on shorter sentences, or any other idea?

Thanks!

Yelp dataset cannot be downloaded.

Sorry to be a bother. The yelp dataset(restaurant review) cannot be downloaded. When I visited that url, an error came out.
AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:936f42d0-c01e-002b-3009-cad595000000 Time:2020-12-04T06:49:53.3054174ZSignature not valid in the specified time frame: Start [Wed, 02 Dec 2020 18:29:28 GMT] - Expiry [Thu, 03 Dec 2020 18:29:28 GMT] - Current [Fri, 04 Dec 2020 06:49:53 GMT]
I'd appreciate if you could fix it.

About automatic evaluation

Thanks for this interesting project! Could you share the scripts for automatic evaluation?

Hi There, i got some Q and suggestions.

thanks for your code =.=

Q: the model faces the problem with Error Accumulation, which i suggest use '[DEL]' token to let model know when to delete which words. just like the NAT method: levenshtein. the different is levenshtein do [where to insert / insert / delete] in different steps, yous can do in one step. But one step also get the problem with Unstable, you can got a result that insert and delete all happens in same step. after my test, '[DEL]' get PPL 10 scores decrease.
Q: lack of knowledge. this happens when i try the constraned the train data in some metaphor or parallelism data. but the result shows the output doesn't have the strong logic inside compared to GPT-2, it trends to generate [no, can't, doesn't; etc ] words which can completely change the means of sentences and bring the problem with different means in different part of sentences. i don't know how to fix it. maybe something like knowledge-bert ? maybe this is the disadvantages of NAT methods compare to AR model like GPT-2 which can't be solved, because of the unstable generate pattern?
your inference code in greedy_search maybe toooooo slow WHEN inference a batch data? i suggest a torch-mask version[1. get the index to mask , 2. use torch.scatter or torch.mask_fill etc to inference a batch data], after somedays I'll take a push requests and please check it.

About open source code

Will you publish the training code and the training strategy？

training on a new dataset

About open source

why Inner-Layer beam search and evaluate criterion not open source？

Checkpoint Binary Hosting seems too slow

Hi, I was deeply impressed by your work POINTER.
So I tried to finetune your wiki-pretrained checkpoint on my custom dataset.
But downloading from your provided link seems too slow or not working. Can I get this checkpoint from other route? Thanks in advance.

About Yelp dataset

Could you provide the yelp dataset or how it is extracted, please? The yelp dataset in Paper《Towards coherent and cohesive long-form text generation》is different from which is used in this paper.

How to generate multiple samples from a group of key words using "--type sampling"

For a set of given key words, how to create diverse text generation like the live demo?

The generated text is not as good as in the examples

Hello, I'm trying to reproduce the POINTER model behaviour using provided in the repository checkpoints and keys, however the generated text I'm receiving, is not as attractive as the text examples generated with the deployed demo at http://52.247.25.3:8900/

I wonder what parameters are passed in args in order to get the same generation behaviour as from the http://52.247.25.3:8900/?

About News Dataset

In your paper, The EMNLP2017 WMT News dataset5 contains 268,586 sentences, but there are lots of datasets in url http://www.statmt.org/wmt17/ and I have no sense which one is the dataset used in experiments. I'd be appreciated if you provide some details.