This repository contains code accompanying the paper "AN END-TO-END CHINESE TEXT NORMALIZATION MODEL BASED ON RULE-GUIDED FLAT-LATTICE TRANSFORMER" which is submitted to ICASSP 2022.
Python: 3.7.3
PyTorch: 1.2.0
FastNLP: 0.5.0
Numpy: 1.16.4
you can go here to know more about FastNLP.
Chinese Text Normalization Dataset can be available at https://www.data-baker.com/en/#/data/index/TNtts.
To browse the Chinese version of the download page, please click https://www.data-baker.com/data/index/TNtts.
The raw dataset in jsonl format are saved at :
dataset/cleaned_dataset_by_myself/CN_TN_epoch-01-28645_2.jsonl
The raw dataset are in jsonl format as follows:
Preprocessed data are saved at :
dataset/cleaned_dataset_by_myself/shuffled_BMES
The proposed data are in BMES format as follows:
We divided data into train
、dev
、test
by 8:1:1.
You can also run our code to prepocess and divide the raw dataset again
python /dataset/cleaned_dataset_by_myself/get_json.py
You can run the following code to see the number of all NSW categories
python /dataset/cleaned_dataset_by_myself/sep_data.py
Our code are in version V1, run training code
python /V1/flat_main.py --dataset databaker
Our proposed rule base are saved in a python file:
/V1/add_rule.py