Giter VIP home page Giter VIP logo

pointer's Introduction

POINTER

This repository contains the implementation of the EMNLP 2020 paper: "POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training", a progressive and non-autoregressive text generation pre-training approach. POINTER generates fluent text in a progressive and parallel manner. With empirical logarithmic time, POINTER outperforms existing non-autoregressive text generation approaches in hard-constrained text generation.

Screenshot Figure: Illustration of the generation process (blue arrow) of the proposed POINTER model. At each stage, the module generates either a or a special NOI token for each gap between two existing tokens . The generation stops when all the gaps predict NOI. The data preparation process (orange arrow) reverses the above generative process.

Screenshot Figure: Example of the progressive generation process

News

[Major Update 03/29/2021] The inference function and live demo now support phrases and short sentences as lexical constraints. When performing inference, an additional "--sep" command can be added to specific the a user-specific separating token such as ";", to identify the boundries of the constraints.

LIVE DEMO

The live demo can be found at here. Please expect delay and crash as it is running on a single GPU machine.

Setup Conda Environment

Please use the below commandlines to clone, install the requirements and load the Conda environment (Note that Cuda 10 is required):

sudo apt-get install -y make wget gzip bzip2 xz-utils zstd

We provide 3 ways to setup the python environments:

  1. (Recommend) you can directly install the packages by running
bash env_setup.sh
  1. or you can copy the exact same environment setup from us
conda create --prefix /pointer_env --file pointer-spec-file.txt
  1. or using yaml file to reproduce the environment
conda env create -f pointer_env.yml -n pointer_env
conda activate pointer_env

Docker environment

To start, first install the docker and Nvidia-docker from their official repos. The image environment for running the code can be loaded as below:

Nvidia-docker v2.*

docker run --gpus all --ipc=host --rm -it -v $PWD:/workspace --network=host icaruszyz/large-scale-training:ins_v4 bash

Nvidia-docker v1.*

$ nvidia-docker --rm -it -v $PWD:/workspace --network=host icaruszyz/large-scale-training:ins_v4 bash

Rawdata

Link to the data files can be downloaded as follows

Dataset Download link
News [link]
Restaurant review [link]
Wiki data for pre-training [link]

POINTER model checkpoints

Link to the model and config files can be downloaded as follows (345M models)

Model Download link
Wiki pretrained model [link]
Restaurant review fine-tuned model [link]
News fine-tuned model [link]

To continue, please decompress the file and move the ckpt folder into the main directory of this repo

tar -xzvf ckpt.tar.gz

Generate from POINTER model with your own input

Quick start (TL;DR): Run the demo in our repo as

./demo.sh

Decoding script: Please put an test.key.txt (see the input/test.key.txt in this repo for an example) into the input folder of this code, with \t seperating the constraints. The generation can be done using following command:

conda activate pointer_env
python inference.py \
--keyfile ./input/test.key.txt  \
--bert_model $model_path \
--output_dir $result_path \

The generation will be at the $result_path folder.

Data preparation

Data generation:

python ./generate_training_data.py \
--train_corpus ./data/training.dummy.txt \
--bert_model bert-base-uncased \
--output_dir ./data/yelp_processed/ \
--clean  \
--task_name yelp

Model training

Dependency requirement:

Please run bash ./requirement.sh to install the dependency required. The bert-large-uncased model can be found at here. Please also install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.

Pre-training: Below is an example of pretraining a model on wikipedia

python -m torch.distributed.launch  --nproc_per_node 16 training.py \
--pregenerated_data ./data/wikipedia_processed \
--bert_model bert-large-uncased \
--output_dir $WIKI_MODEL \
--epochs 40  \
--train_batch_size 64 \
--output_step 100000 \
--learning_rate 1e-5 

Fine-tuning: Below is an example of finetuning a model with pretraining model ($WIKI_MODEL) here

python -m torch.distributed.launch  --nproc_per_node 16 training.py \
--pregenerated_data ./data/yelp_processed \
--bert_model $WIKI_MODEL \
--output_dir $finetune_model_path \
--epochs 40 \
--train_batch_size 64 \
--output_step 100000 \
--learning_rate 1e-5\

Model decoding

Keywords extraction: First, you can use the following script to generate a bunch of keywords for the test file.

python keyword_extraction.py \
--n_keys 7 \
--file ./data/yelp_test.txt

The keywords file will be saved in the same folder as the input file (in this example, the keywords file will be ./data/yelp_test.key.txt)

Generating from the keywords With the trained model, you can generate a sentence from a give keywords file. The keywords file can be obtained by the previous keywords extraction step, or by a custom user input file. The following commands show an example of how to decode from the keywords file generated from last step:

python inference.py \
--keyfile ./data/yelp_test.key.txt  \
--bert_model $finetune_model_path \
--output_dir $result_path 

The inference function and live demo now support phrases and short sentences as lexical constraints. When performing inference, an additional "--sep" command can be added to specific the a user-specific separating token such as ";", to identify the boundries of the constraints. The default separator is white space " ".

NOTE THAT, if using default white space separator, the input keywords file will be tokenized by a BERT tokenizer. A less common word will likely be parsed into subwords, for example, cheesecake will be split into cheese, ##cake. As a result the final generation may not contain the whole word cheesecake.

Citation

If you use this code in your research, you can cite our paper:

@inproceedings{zhang2020pointer,
  title={POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training},
  author={Zhang, Yizhe and Wang, Guoyin and Li, Chunyuan and Gan, Zhe and Brockett, Chris and Dolan, Bill},
  booktitle={EMNLP},
  year={2020}
}

pointer's People

Contributors

dreasysnail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pointer's Issues

about constructing data

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

About pre-training wiki dataset

Thanks for this interesting project! I am wondering:

  1. Could you please share the link of the pre-training dataset (i.e. wiki data)?
  2. Your paper mentioned that the sentences in wiki data are 1.99 million and about 12.6GB, but we found that the original wiki data (90 million sentences) is about 23GB. They look inconsistent, do I misunderstand something?

slot nums

i didn't understand that for sequences {sos , x1, x2, ..., xt, eos}, shouldn't it be t+1 slots as described in INSERTION TRANSFORMER? which means (sos ,x1) (x1,x2),...(xt, eos), but in your words, there is always t slots, which confused me a lot.

How to restrict some key words as a whole instead of seperating them?

For example, I have the following key words:

Hello kitty blanket Fleece

I want to treat the whole of "Hello kitty" as a key word, and generates text similar to the following:

Cartoon Hello Kitty Printing Throw Blanket Soft Cover Flannel Cozy Plush Fleece Blanket for Boys Girls Kids

Sentences like '* hello * kitty ...' are not corrected.

Add `.gitignore` file to the repo

The pycache directory present in the repo is not necessary. Creating a .gitignore file and adding __pychache__ will do the trick.

add check for conda installation on env_setup.sh

There is no check to see if the user has conda install on the computer in env_setup.sh. Hence, all the pip install commands inside env_setup.sh will install the packages in the base environment if the user does not have conda installed.

How to generate shorter sentences?

Hi,

first, thanks for your amazing work on POINTER!

I am working on an application to paraphrase generation from the source sentence keywords for my PhD, but in my experience the paraphrases generated tend to be 100 to 160 words which is 3-4 times longer than my sources, even after fine-tuning.

In your opinion, what would be the best way to generate shorter paraphrases? The [No insertion] probability knob (with the risk of falling out of the pre-training domain), retraining from scratch on shorter sentences, or any other idea?

Thanks!

Yelp dataset cannot be downloaded.

Sorry to be a bother. The yelp dataset(restaurant review) cannot be downloaded. When I visited that url, an error came out.
AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:936f42d0-c01e-002b-3009-cad595000000 Time:2020-12-04T06:49:53.3054174ZSignature not valid in the specified time frame: Start [Wed, 02 Dec 2020 18:29:28 GMT] - Expiry [Thu, 03 Dec 2020 18:29:28 GMT] - Current [Fri, 04 Dec 2020 06:49:53 GMT]
I'd appreciate if you could fix it.

Hi There, i got some Q and suggestions.

thanks for your code =.=

  1. Q: the model faces the problem with Error Accumulation, which i suggest use '[DEL]' token to let model know when to delete which words. just like the NAT method: levenshtein. the different is levenshtein do [where to insert / insert / delete] in different steps, yous can do in one step. But one step also get the problem with Unstable, you can got a result that insert and delete all happens in same step. after my test, '[DEL]' get PPL 10 scores decrease.
  2. Q: lack of knowledge. this happens when i try the constraned the train data in some metaphor or parallelism data. but the result shows the output doesn't have the strong logic inside compared to GPT-2, it trends to generate [no, can't, doesn't; etc ] words which can completely change the means of sentences and bring the problem with different means in different part of sentences. i don't know how to fix it. maybe something like knowledge-bert ? maybe this is the disadvantages of NAT methods compare to AR model like GPT-2 which can't be solved, because of the unstable generate pattern?
  3. your inference code in greedy_search maybe toooooo slow WHEN inference a batch data? i suggest a torch-mask version[1. get the index to mask , 2. use torch.scatter or torch.mask_fill etc to inference a batch data], after somedays I'll take a push requests and please check it.

Checkpoint Binary Hosting seems too slow

Hi, I was deeply impressed by your work POINTER.
So I tried to finetune your wiki-pretrained checkpoint on my custom dataset.
But downloading from your provided link seems too slow or not working. Can I get this checkpoint from other route? Thanks in advance.

About Yelp dataset

Could you provide the yelp dataset or how it is extracted, please? The yelp dataset in Paper《Towards coherent and cohesive long-form text generation》is different from which is used in this paper.

About News Dataset

In your paper, The EMNLP2017 WMT News dataset5 contains 268,586 sentences, but there are lots of datasets in url http://www.statmt.org/wmt17/ and I have no sense which one is the dataset used in experiments. I'd be appreciated if you provide some details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.