hankcs / multi-criteria-cws Goto Github PK

View Code? Open in Web Editor NEW

299.0 17.0 84.0 73.34 MB

Simple Solution for Multi-Criteria Chinese Word Segmentation

Home Page: http://www.hankcs.com/nlp/segment/multi-criteria-cws.html

License: GNU General Public License v3.0

Python 90.00% Shell 0.63% Perl 9.37%

cws multi-criteria-cws nlp bi-lstm-crf dynet

multi-criteria-cws's Introduction

multi-criteria-cws

Codes and corpora for paper "Effective Neural Solution for Multi-Criteria Word Segmentation" (accepted & forthcoming at SCI-2018).

Dependency

Python3
dynet

Quick Start

Run following command to prepare corpora, split them into train/dev/test sets etc.:

python3 convert_corpus.py

Then convert a corpus $dataset into pickle file:

./script/make.sh $dataset

$dataset can be one of the following corpora: pku, msr, as, cityu, sxu, ctb, zx, cnc, udc and wtb.
$dataset can also be a joint corpus like joint-sighan2005 or joint-10in1.
If you have access to sighan2008 corpora, you can also make joint-sighan2008 as your $dataset.

Finally, one command performs both training and test on the fly:

./script/train.sh $dataset

Performance

sighan2005

sighan2008

10-in-1

Since SIGHAN bakeoff 2008 datasets are proprietary and difficult to obtain, we decide to conduct additional experiments on more freely available datasets, for the public to test and verify the efficiency of our method. We applied our solution on 6 additional freely available datasets together with the 4 sighan2005 datasets.

Corpora

In this section, we will briefly introduce those corpora used in this paper.

10 corpora in this repo

Those 10 corpora are either from official sighan2005 website, or collected from open-source project, or from researchers' homepage. Licenses are listed in following table.

sighan2008

As sighan2008 corpora are proprietary, we are unable to distribute them. If you have a legal copy, you can replicate our scores following these instructions.

Firstly, link the sighan2008 to data folder in this project.

ln -s /path/to/your/sighan2008/data data/sighan2008

Then, use HanLP for Traditional Chinese to Simplified Chinese conversion, as shown in the following Java code snippets:

        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(
            "data/sighan2008/ckip_seg_truth&resource/ckip_truth_utf16.seg"
        ), "UTF-16"));
        String line;
        BufferedWriter bw = IOUtil.newBufferedWriter(
            "data/sighan2008/ckip_seg_truth&resource/ckip_truth_utf8.seg");
        while ((line = br.readLine()) != null)
        {
            for (String word : line.split("\\s"))
            {
                if (word.length() == 0) continue;
                bw.write(HanLP.convertToSimplifiedChinese(word));
                bw.write(" ");
            }
            bw.newLine();
        }
        br.close();
        bw.close();

You need to repeat this for the following 4 files:

ckip_train_utf16.seg
ckip_truth_utf16.seg
cityu_train_utf16.seg
cityu_truth_utf16.seg

Then, uncomment following codes in convert_corpus.py:

    # For researchers who have access to sighan2008 corpus, use official corpora please.
    print('Converting sighan2008 Simplified Chinese corpus')
    datasets = 'ctb', 'ckip', 'cityu', 'ncc', 'sxu'
    convert_all_sighan2008(datasets)
    print('Combining those 8 sighan corpora to one joint corpus')
    datasets = 'pku', 'msr', 'as', 'ctb', 'ckip', 'cityu', 'ncc', 'sxu'
    make_joint_corpus(datasets, 'joint-sighan2008')
    make_bmes('joint-sighan2008')

Finally, you are ready to go:

python3 convert_corpus.py
./script/make.sh joint-sighan2008
./script/train.sh joint-sighan2008

Acknowledgments

Thanks for those friends who helped us with the experiments.
Credits should also be given to those generous researchers who shared their corpora with the public, as listed in license table. Your datasets indeed helped those small groups (like us) without any funding.
Model implementation modified from a Dynet-1.x version by rguthrie3.

multi-criteria-cws's People

Contributors

Stargazers

Watchers

multi-criteria-cws's Issues

您好，想问一下训练好模型后，加载预测结果是按哪种标准标注分词的

请问make.sh的数据集文件夹bmes在哪里可以下载

请问make.sh的数据集文件夹bmes在哪里可以下载？谢谢

“Effective Neural Solution for Multi-Criteria Word Segmentation” baseline部分实验重现

最近在做NLP分词部分的复现，用tensorflow搭建了Bi-LSTM-CRF的baseline实验，但是在bakeoff2005 MSR数据集上模型拟合很差，排查了几遍都没有发现问题。在查找资料的过程中看到了“Effective Neural Solution for Multi-Criteria Word Segmentation”这篇文章，其中清晰地列出了各种模型的对比实验结果，baseline实验使用的就是Bi-LSTM-CRF模型，不知可否借阅下baseline实验部分的代码，仅做本人复现错误排查。

ctb数据集的切分

您好，请问data/other/ctb里的数据是sighan 2008的吗？如果不是的话，是按照什么标准切分的呢，和《Chinese Comma Disambiguation for Discourse Analysis》（Yang & Xue 2012）里的切分不太一样

dynet安装

按照dynet教程在Linux和Windows安装不上，您是如何安装的，还是在服务器上运行的

可以提供训练好的模型吗？

你好，我自己训练太慢了，可以提供一些加速方法建议或者是训练好的模型吗？

请问论文中的bigram的et具体是如何计算的

你好感谢分享
请问论文中
‘where ft = [ht; et] is the concatenation of BiLSTM hidden state and bigram feature embedding et’
的et是如何计算的

另外 word embedding是指字的话那character embedding又是对应什么呢

关于实验结果的问题

您好，我最近在做相关工作，想引用您的文章，但是在文章中没有看到OOV的召回率的结果，我十分需要这部分结果，目前由于不可抗力原因可能无法复现模型，看到您是使用的官方脚本程序，那应该对OOV的召回率也有输出，可否提供给我一份OOV的召回率结果呢

报错：RuntimeError: CPU memory allocation failed

root@liangzhiNLP:/home/liangzhi/liangxingzheng/multi-criteria-cws/multi-criteria-cws# ./script/train.sh joint-10in1 --dynet-seed 10364 --python-seed 840868838938890892
[dynet] random seed: 10364
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
model.py --dataset dataset/joint-10in1/dataset.pkl --num-epochs 60 --word-embeddings data/embedding/character.vec --log-dir result/joint-10in1 --dropout 0.2 --learning-rate 0.01 --learning-rate-decay 0.9 --hidden-dim 100 --dynet-seed 22059 --bigram --skip-dev --dynet-seed 10364 --python-seed 840868838938890892

Namespace(always_model=False, batch_size=20, bigram=True, char_embedding_dim=100, char_embeddings=None, char_hidden_dim=100, clip_norm=None, dataset='dataset/joint-10in1/dataset.pkl', debug=False, dropout=0.2, dynet_autobatch=None, dynet_gpus=None, dynet_mem=None, dynet_seed=10364, dynet_weight_decay=None, hidden_dim=100, learning_rate=0.01, learning_rate_decay=0.9, log_dir='result/joint-10in1', lowercase_words=False, lstm_layers=1, no_model=False, no_we=False, no_we_update=False, num_epochs=60, old_model=None, python_seed=840868838938890892, skip_dev=True, subset=None, task_name='2018-01-04-15-01-54', test=False, tie_two_embeddings=False, use_char_rnn=False, word_embeddings='data/embedding/character.vec')
Python random seed: 840868838938890892

Memory pool info for each devices:
Device CPU - FOR Memory 128MB, BACK Memory 128MB, PARAM Memory 128MB, SCRATCH Memory 128MB.
CPU memory allocation failed n=570425344 align=32
Traceback (most recent call last):
File "model.py", line 492, in
tie_two_embeddings=options.tie_two_embeddings
File "model.py", line 56, in init
self.bigram_lookup = self.model.add_lookup_parameters((len(b2i), word_embedding_dim))
File "_dynet.pyx", line 1183, in _dynet.ParameterCollection.add_lookup_parameters
File "_dynet.pyx", line 1210, in _dynet.ParameterCollection.add_lookup_parameters
RuntimeError: CPU memory allocation failed

这个错误是什么原因呀？要改代码吗还是环境问题。。

hankcs / multi-criteria-cws Goto Github PK

multi-criteria-cws's Introduction

multi-criteria-cws

Dependency

Quick Start

Performance

sighan2005

sighan2008

10-in-1

Corpora

10 corpora in this repo

sighan2008

Acknowledgments

multi-criteria-cws's People

Contributors

Stargazers

Watchers

Forkers

multi-criteria-cws's Issues

Recommend Projects

Recommend Topics

Recommend Org