Giter VIP home page Giter VIP logo

interpoetry's Introduction

Interpoetry

This repository contains the original implementation of the unsupervised poem translation models presented in
Generating Classical Chinese Poems from Vernacular Chinese (EMNLP 2019).

A fancy demonstration could be found here.

Thanks to FacebookResearch for opensourcing Phrase-Based & Neural Unsupervised Machine Translation project.

Dependencies

  • Python 3
  • NumPy
  • PyTorch (currently tested on version 1.0.1)
  • TQDM (4.31.1, for preprocess.py only)

Download / preprocess data

Quickroutes are provided to save time, you could only download processed data ( BaiduYun with code: sxqt, GDrive ) and unzip in interpoetry folder. Then continue to Train section. However, if you are interested in detailed steps or would like to run on your own dataset, please download and unzip all raw data ( BaiduYun with code: wz7c, GDrive ) , rename the folder as "data", and place it inside "interpoetry" folder.

Vernaculars

Training data are collected from 281 sanwens and fictions written by more than 40 famous Chinese authors (鲁迅, 金庸, 毕淑敏, 余秋雨, 张小娴, 温世仁 etc.). The dataset includes more than 500K short paragraphs. To form such paragraph, we pad sentences until it reaches no more than 130 words. See this short example for more detail.

Poems

Classical poem data for training are collected from here. We further gather seven-syllable Jueju from all Tang poems and Song poems. The dataset includes more than 270K seven-syllable Jueju. See this short example for more detail.

Parallel data (poems and thier translation)

From online resources, we collected 487 seven-character quatrain poems from Tang Poems and Song Poems, as well as their corresponding high quality vernacular translations. These poems could be used as gold standards for poems generated from their corresponding vernacular translations. This is also included in the processed data zip file.

Preprocess

After downloading raw data or creating your own data in the format of raw data, you could start preprocessing.

preprocess.py will process raw data by:

  • splitting training and validation data
  • checking if Jueju meets proper rythm constaint(押韵)
  • shrinking the length of sanwen input
  • padding Jueju 2 by 2 (see paper for more detail)
  • converting tokens to ids and save it as .pth file
  • matching vocab to rythm and save it as vocab_rytm.json

Run with following commands to generate preprocessed sanwen data.

VOCAB_FILEPATH = data/vocab.txt
RAW_DATA_FILEPATH = data/sanwen/sanwen
python preprocess.py $VOCAB_FILEPATH $RAW_DATA_FILEPATH sanwen sanwen nopmpad 7

Run with following commands to generate preprocessed poems data.

VOCAB_FILEPATH = data/vocab.txt
RAW_DATA_FILEPATH = data/jueju7_out
python preprocess.py $VOCAB_FILEPATH $RAW_DATA_FILEPATH juejue juejue pmpad 7 

Commands to process parallel data are similar. Replace RAW_DATA_FILEPATH to the actual file you would like to process.

Train

Given binarized monolingual (poem and poem) training data, parallel evaluation (poem and its tranlation) data, you can train the model using the following command:

python main.py 

## main parameters
--exp_name test 

## network architecture and parameters sharing
--transformer True 
--n_enc_layers 4 --n_dec_layers 4 --share_enc 2 --share_dec 2 
--share_lang_emb True --share_output_emb True 

## datasets location, denoising auto-encoder parameters, and back-translation directions
--langs 'pm,sw' 
--n_mono -1                                 # number of monolingual sentences (-1 for everything)
--mono_dataset $MONO_DATASET                # monolingual dataset
--para_dataset $PARA_DATASET                # parallel dataset
--mono_directions 'pm,sw' --max_len '70,110' --word_shuffle 2 --word_dropout 0.05 --word_blank 0.1 
--pivo_directions 'sw-pm-sw,pm-sw-pm' 

## pretrained embeddings
--pretrained_emb $PRETRAINED 
--pretrained_out False 

## dynamic loss coefficients
--lambda_xe_mono '0:1,100000:0.1,300000:0' --lambda_xe_otfd 1 

## CPU on-the-fly generation
--otf_num_processes 8 --otf_sync_params_every 1000 

## optimization and training steps
--enc_optimizer adam,lr=0.00006 
--epoch_size 210000 
--batch_size 32 
--emb_dim 768 
--max_epoch 35 

## saving models and evaluation length (could set eval_length to -1 to eval every sentence, but took really long time)
--save_periodic True --eval_length 960 

## pad parmas
--pad_weight 0.1 
--do_bos True 

## rl parmas
--use_rl True 
--reward_gamma_ap 0.0 
--reward_gamma_ar 0.4 
--reward_type_ar punish 
--reward_thresh_ar 0.85 
--rl_start_epoch 0 

## With
MONO_DATASET='pm:./data/data_pad/jueju7_out.tr.pth,./data/data_pad/jueju7_out.vl.pth,,./data/data_pad/poem_jueju7_para.pm.pth;sw:./data/data_pad/sanwen.tr.pth,./data/data_pad/sanwen.vl.pth,./data/data_pad/sanwen.te.pth,./data/data_pad/poem_jueju7_para.sw.pth' 
PARA_DATASET='pm-sw:,,./data/data_pad/poem_jueju7_para.XX.pth'
PRETRAINED='./data/word_embeddings_weight.pt'

A trained model could be downloaded here. ( BaiduYun with code: 42br, GDrive ) Unzip it and place it under "interpoetry" folder.

Evaluation

If you would like to run evaluations only, append these three lines to training params above.

python main.py 

...(training params like shown above)...

--eval_only True 
--model_path ./dumped/test/4949781/periodic-24.pth  # path of model to load from
--dump_path ./dumped/test/eval_result               # result files to save to

Citation

Please cite the following if you find this repo useful.

Generating Classical Chinese Poems from Vernacular Chinese

@inproceedings{yangcai2019interpoetry,
  title={Generating Classical Chinese Poems from Vernacular Chinese},
  author={Yang, Zhichao and Cai, Pengshan and Feng, Yansong and Li, Fei and Feng, Weijiang and Chiu, Suet-Ying and Yu, Hong},
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
  month = nov,
  year = "2019",
  address = "Hong Kong, China",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/D19-1637",
  doi = "10.18653/v1/D19-1637"
}

License

See the LICENSE file for more details.

interpoetry's People

Contributors

pengshancai avatar whaleloops avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

interpoetry's Issues

训练过程 还没开始就无端终止

src/trainer.py line #552
def otf_sync_params(self)
...
encoder_params = get_flat_params(self.encoder).cpu().share_memory_()
decoder_params = get_flat_params(self.decoder).cpu().share_memory_()

控台显示如下:

^C
Process finished with exit code -1

大家有没有遇到这个问题 ?

demo script missing

Wonder if you could supply a demo.py which works like

# python demo.py --model_path ./dumped/test/4949781/periodic-24.pth --input_text ''青山隐隐约约绿水千里迢迢,秋时已尽江南草木还未枯凋。二十四桥明月映照幽幽清夜,你这美人现在何处教人吹箫"

青山隐隐绿水光,千里秋时已尽藏。江南草木还未枯,二十四桥幽夜香。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.