Giter VIP home page Giter VIP logo

multi-criteria-cws's Introduction

multi-criteria-cws

Codes and corpora for paper "Effective Neural Solution for Multi-Criteria Word Segmentation" (accepted & forthcoming at SCI-2018).

Dependency

  • Python3
  • dynet

Quick Start

Run following command to prepare corpora, split them into train/dev/test sets etc.:

python3 convert_corpus.py 

Then convert a corpus $dataset into pickle file:

./script/make.sh $dataset
  • $dataset can be one of the following corpora: pku, msr, as, cityu, sxu, ctb, zx, cnc, udc and wtb.
  • $dataset can also be a joint corpus like joint-sighan2005 or joint-10in1.
  • If you have access to sighan2008 corpora, you can also make joint-sighan2008 as your $dataset.

Finally, one command performs both training and test on the fly:

./script/train.sh $dataset

Performance

sighan2005

sighan2005

sighan2008

sighan2008

10-in-1

Since SIGHAN bakeoff 2008 datasets are proprietary and difficult to obtain, we decide to conduct additional experiments on more freely available datasets, for the public to test and verify the efficiency of our method. We applied our solution on 6 additional freely available datasets together with the 4 sighan2005 datasets.

10in1

Corpora

In this section, we will briefly introduce those corpora used in this paper.

10 corpora in this repo

Those 10 corpora are either from official sighan2005 website, or collected from open-source project, or from researchers' homepage. Licenses are listed in following table.

licence

sighan2008

As sighan2008 corpora are proprietary, we are unable to distribute them. If you have a legal copy, you can replicate our scores following these instructions.

Firstly, link the sighan2008 to data folder in this project.

ln -s /path/to/your/sighan2008/data data/sighan2008

Then, use HanLP for Traditional Chinese to Simplified Chinese conversion, as shown in the following Java code snippets:

        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(
            "data/sighan2008/ckip_seg_truth&resource/ckip_truth_utf16.seg"
        ), "UTF-16"));
        String line;
        BufferedWriter bw = IOUtil.newBufferedWriter(
            "data/sighan2008/ckip_seg_truth&resource/ckip_truth_utf8.seg");
        while ((line = br.readLine()) != null)
        {
            for (String word : line.split("\\s"))
            {
                if (word.length() == 0) continue;
                bw.write(HanLP.convertToSimplifiedChinese(word));
                bw.write(" ");
            }
            bw.newLine();
        }
        br.close();
        bw.close();

You need to repeat this for the following 4 files:

  1. ckip_train_utf16.seg
  2. ckip_truth_utf16.seg
  3. cityu_train_utf16.seg
  4. cityu_truth_utf16.seg

Then, uncomment following codes in convert_corpus.py:

    # For researchers who have access to sighan2008 corpus, use official corpora please.
    print('Converting sighan2008 Simplified Chinese corpus')
    datasets = 'ctb', 'ckip', 'cityu', 'ncc', 'sxu'
    convert_all_sighan2008(datasets)
    print('Combining those 8 sighan corpora to one joint corpus')
    datasets = 'pku', 'msr', 'as', 'ctb', 'ckip', 'cityu', 'ncc', 'sxu'
    make_joint_corpus(datasets, 'joint-sighan2008')
    make_bmes('joint-sighan2008')

Finally, you are ready to go:

python3 convert_corpus.py
./script/make.sh joint-sighan2008
./script/train.sh joint-sighan2008

Acknowledgments

  • Thanks for those friends who helped us with the experiments.
  • Credits should also be given to those generous researchers who shared their corpora with the public, as listed in license table. Your datasets indeed helped those small groups (like us) without any funding.
  • Model implementation modified from a Dynet-1.x version by rguthrie3.

multi-criteria-cws's People

Contributors

hankcs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multi-criteria-cws's Issues

“Effective Neural Solution for Multi-Criteria Word Segmentation” baseline部分实验重现

最近在做NLP分词部分的复现,用tensorflow搭建了Bi-LSTM-CRF的baseline实验,但是在bakeoff2005 MSR数据集上模型拟合很差,排查了几遍都没有发现问题。在查找资料的过程中看到了“Effective Neural Solution for Multi-Criteria Word Segmentation”这篇文章,其中清晰地列出了各种模型的对比实验结果,baseline实验使用的就是Bi-LSTM-CRF模型,不知可否借阅下baseline实验部分的代码,仅做本人复现错误排查。

ctb数据集的切分

您好,请问data/other/ctb里的数据是sighan 2008的吗?如果不是的话,是按照什么标准切分的呢,和《Chinese Comma Disambiguation for Discourse Analysis》(Yang & Xue 2012)里的切分不太一样

dynet安装

按照dynet教程在Linux和Windows安装不上,您是如何安装的,还是在服务器上运行的

请问论文中的bigram的et具体是如何计算的

你好 感谢分享
请问论文中
‘where ft = [ht; et] is the concatenation of BiLSTM hidden state and bigram feature embedding et’
的et是如何计算的

另外 word embedding是指字的话 那character embedding又是对应什么呢

关于实验结果的问题

您好,我最近在做相关工作,想引用您的文章,但是在文章中没有看到OOV的召回率的结果,我十分需要这部分结果,目前由于不可抗力原因可能无法复现模型,看到您是使用的官方脚本程序,那应该对OOV的召回率也有输出,可否提供给我一份OOV的召回率结果呢

报错:RuntimeError: CPU memory allocation failed

root@liangzhiNLP:/home/liangzhi/liangxingzheng/multi-criteria-cws/multi-criteria-cws# ./script/train.sh joint-10in1 --dynet-seed 10364 --python-seed 840868838938890892
[dynet] random seed: 10364
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
model.py --dataset dataset/joint-10in1/dataset.pkl --num-epochs 60 --word-embeddings data/embedding/character.vec --log-dir result/joint-10in1 --dropout 0.2 --learning-rate 0.01 --learning-rate-decay 0.9 --hidden-dim 100 --dynet-seed 22059 --bigram --skip-dev --dynet-seed 10364 --python-seed 840868838938890892

Namespace(always_model=False, batch_size=20, bigram=True, char_embedding_dim=100, char_embeddings=None, char_hidden_dim=100, clip_norm=None, dataset='dataset/joint-10in1/dataset.pkl', debug=False, dropout=0.2, dynet_autobatch=None, dynet_gpus=None, dynet_mem=None, dynet_seed=10364, dynet_weight_decay=None, hidden_dim=100, learning_rate=0.01, learning_rate_decay=0.9, log_dir='result/joint-10in1', lowercase_words=False, lstm_layers=1, no_model=False, no_we=False, no_we_update=False, num_epochs=60, old_model=None, python_seed=840868838938890892, skip_dev=True, subset=None, task_name='2018-01-04-15-01-54', test=False, tie_two_embeddings=False, use_char_rnn=False, word_embeddings='data/embedding/character.vec')
Python random seed: 840868838938890892

Memory pool info for each devices:
Device CPU - FOR Memory 128MB, BACK Memory 128MB, PARAM Memory 128MB, SCRATCH Memory 128MB.
CPU memory allocation failed n=570425344 align=32
Traceback (most recent call last):
File "model.py", line 492, in
tie_two_embeddings=options.tie_two_embeddings
File "model.py", line 56, in init
self.bigram_lookup = self.model.add_lookup_parameters((len(b2i), word_embedding_dim))
File "_dynet.pyx", line 1183, in _dynet.ParameterCollection.add_lookup_parameters
File "_dynet.pyx", line 1210, in _dynet.ParameterCollection.add_lookup_parameters
RuntimeError: CPU memory allocation failed

这个错误是什么原因呀?要改代码吗还是环境问题。。

预训练word-embedding来源

您好,我最近在做bilstm-crf分词实验,使用了您项目中预训练的word-embedding之后结果提升了两个点。所以想问一下您的word-embedding来源是哪,还是自己训练的?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.