Giter VIP home page Giter VIP logo

kdconv's Introduction

KdConv

KdConv is a Chinese multi-domain Knowledge-driven Conversionsation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.

We provide several benchmark models to facilitate the following research on this corpus.

Our paper on arXiv and ACL Anthology. If the corpus is helpful to your research, please kindly cite our paper:

@inproceedings{zhou-etal-2020-kdconv,
    title = "{K}d{C}onv: A {C}hinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation",
    author = "Zhou, Hao  and
      Zheng, Chujie  and
      Huang, Kaili  and
      Huang, Minlie  and
      Zhu, Xiaoyan",
    booktitle = "ACL",
    year = "2020"
}

Example

An example of the conversation with annotations in our corpus:

example

Each utterance in the conversation is annotated with referred knowledge graph triplets. As the discussion deepens, the conversation will also transition between multiple topics.

Data

(You may need need the Tencent Pretrained Word Embedding we used in experiments.)

The data files are in the ./data folder. It contains three domains film/music/travel, and each domain folder includes split sets train/dev/test.json and the corresponding knowledge base file kb_DOMAIN.json that was used to collect and construct the corpus.

We take the music domain for instance. After loading train.json, you will get a list of conversations. Each conversation looks like the following:

{
  "messages": [
    {
      "message": "对《我喜欢上你时的内心活动》这首歌有了解吗?"
    },
    {
      "attrs": [
        {
          "attrname": "Information",
          "attrvalue": "《我喜欢上你时的内心活动》是由韩寒填词,陈光荣作曲,陈绮贞演唱的歌曲,作为电影《喜欢你》的主题曲于2017年4月10日首发。2018年,该曲先后提名第37届香港电影金像奖最佳原创电影歌曲奖、第7届阿比鹿音乐奖流行单曲奖。",
          "name": "我喜欢上你时的内心活动"
        }
      ],
      "message": "有些了解,是电影《喜欢你》的主题曲。"
    },
    ...
    {
      "attrs": [
        {
          "attrname": "代表作品",
          "attrvalue": "旅行的意义",
          "name": "陈绮贞"
        },
        {
          "attrname": "代表作品",
          "attrvalue": "时间的歌",
          "name": "陈绮贞"
        }
      ],
      "message": "我还知道《旅行的意义》与《时间的歌》,都算是她的代表作。"
    },
    {
      "message": "好,有时间我找出来听听。"
    }
  ],
  "name": "我喜欢上你时的内心活动"
}
  • name is the starting topic (entity) of the conversation

  • messages is a list of all the turns in the dialogue. For each turn:

    • message is the utterance

    • attrs is a list of knowledge graph triplets referred by the utterance. For each triplet:

      • name is the head entity
      • attrname is the relation
      • attrvalue is the tail entity

      Note that the triplets where attrname is 'information' are the unstructured knowledge about the head entity.

After loading kb_music.json, you will get a dictionary. Each item looks like the following:

"忽然之间": [
  [
    "忽然之间",
    "Information",
    "《忽然之间》是歌手 莫文蔚演唱的歌曲,由 周耀辉, 李卓雄填词, 林健华谱曲,收录在莫文蔚1999年发行专辑《 就是莫文蔚》里。"
  ],
  [
    "忽然之间",
    "谱曲",
    "林健华"
  ]
  ...
]

The key is a head entity, and the value is a list of corresponding triplets.

kdconv's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kdconv's Issues

请问TypeError: missing a required argument: 'file_id'这个问题可以解决吗?

Traceback (most recent call last):
File "/home/xhc/xuhc/KdConv-master/benchmark/seq2seq/run.py", line 84, in
run(*sys.argv[1:])
File "/home/xhc/xuhc/KdConv-master/benchmark/seq2seq/run.py", line 80, in run
main(args)
File "/home/xhc/xuhc/KdConv-master/benchmark/seq2seq/main.py", line 61, in main
data = try_cache(data_class, (args.datapath,), args.cache_dir)
File "/home/xhc/xuhc/KdConv-master/benchmark/seq2seq/utils/cache_helper.py", line 21, in try_cache
obj = module(*args)
File "/home/xhc/xuhc/KdConv-master/benchmark/seq2seq/myCoTK/dataloader/single_turn_dialog.py", line 192, in init
super(MySeq2Seq, self).init()
File "/home/xhc/anaconda3/envs/xhc/lib/python3.6/site-packages/cotk/_utils/hooks.py", line 49, in wrapped
bound = sign.bind(*args, **kwargs)
File "/home/xhc/anaconda3/envs/xhc/lib/python3.6/inspect.py", line 2997, in bind
return args[0]._bind(args[1:], kwargs)
File "/home/xhc/anaconda3/envs/xhc/lib/python3.6/inspect.py", line 2912, in _bind
raise TypeError(msg) from None
TypeError: missing a required argument: 'file_id'
INFO: local path: ../data/film
INFO: processor type: Default

Process finished with exit code 1

About testset

Hi, congratulations for the acceptance of your ACL paper and thank you for your contributions.
I noticed that there was only training data available, do you have official test set?

Didn't find several files

When I run the membertret, I dont find several files.

1、/home/zhengchujie/bert_torch/chinese_wwm_pytorch/bert_config.json&vocab.txt&pytorch_model.bin
As a result, I download an alternative in https://github.com/ymcui/Chinese-BERT-wwm.
image
However, there are several warnings.
INFO - pytorch_transformers.tokenization_utils Model name 'KdConv/benchmark/_bert_chinese_wwm_pytorch/vocab.txt' not found in model shortcut name list
INFO - pytorch_transformers.tokenization_utils - Didn't find file /KdConv/benchmark/_bert_chinese_wwm_pytorch/added_tokens.json&special_tokens_map.json&tokenizer_config.json. We won't load it.

2、FileNotFoundError: [Errno 2] No such file or directory: '../data/resources/chinese_stop_words.txt'
As a result, I git clone https://github.com/goto456/stopwords, and mv cn_stopwords.txt chinese_stop_words.txt.

Please give the corresponing url of those files.

Thanks

论文中有写测试时的PPL值,在测试代码中找不到PPL值

首先很感谢作者公开的代码以及数据集,仔细拜读了您的论文,本人十分受启迪,再次谢谢您~
但是在复现作者代码时出现些小问题,HRED模型训练后,在测试时,困惑度(PPL)指标值并没有找到代码实现部分,看到了Blue值,distinct值实现部分,并能得到其具体值。
本人有两个问题:
1)论文中提到的PPL是指验证时,在验证数据集得到结果值,还是在测试数据集上得到。
2)如果是在测试时得到,困惑度指标值代码可以公开嘛?
恳请您的回复,谢谢!

关于数据集具体信息的了解

您好。请问KdConv数据集表格当中,Avg. # tokens per utterance是指"分词"后的词数吗?另外,Avg. # characters per uttenrace是指按字符切分的话,是指比如出现英文utterance,则统计为长度是9吗?谢谢!

How to find the corresponding KG for each dialogue

Hi,
Very interesting task. We find out that the knowledge bases are separated from the dialogues, while each turn is associated with a specific piece of knowledge.
Are the dev/test sets also associated with the specific knowledge piece?

Lastly, please upload the devset, it shouldn't been held out.

Cannot find chinese_wwm_pytorch

Hi, I encountered a problem when running the code in benchmark/bertret and want to seek your help. It seems that the 'chinese_wwm_pytorch' cannot be found, including all related files (/vocab.txt, /added_tokens.json etc.):
Screenshot 2021-06-22 093836

Another weird problem is when running ./train_film(music, travel) in other models except bert, like LM andseq2seq, I always encountered the segmentation fault and didnt figure out the reason:
image

Thanks so much!

list index out of range

02/21/2022 15:34:22 - INFO - main - ***** Running training *****
02/21/2022 15:34:22 - INFO - main - Num post-response pairs = 27550
02/21/2022 15:34:22 - INFO - main - Batch size = 8
02/21/2022 15:34:22 - INFO - main - Num steps = 10331
Epoch: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:/Users/25687/Desktop/NLP_Paper/KdConv-master/benchmark/bertret/run_BERTRetrieval.py", line 384, in
main()
File "C:/Users/25687/Desktop/NLP_Paper/KdConv-master/benchmark/bertret/run_BERTRetrieval.py", line 255, in main
data = dataManager.get_next_batch(key='train')
File "D:\Anaconda\envs\KdConv-master\lib\site-packages\cotk\dataloader\dataloader.py", line 195, in get_next_batch
res = self.get_batch(key, index)
File "C:\Users\25687\Desktop\NLP_Paper\KdConv-master\benchmark\bertret\myCoTK\dataloader\bert_dataloader.py", line 163, in get_batch
resp_distractors_bert = self.data[key]['resp_distractors_bert'][idx]
IndexError: list index out of range
train set restart, 3443 batches and 6 left
在windows上跑bertret代码时出现了这个问题,一直说索引超出界限,请问怎么解决呢?

Comfused about the size of the datasets

Hi, first of all, thanks for your wonderful work.

After processing the datasets, I found that the size of the dataset is different from the claim in the paper. In the paper, you mentioned that each domain contains 1.5k dialogs, but I can only obtain 1.2k for each domain.

Maybe I did something wrong, can you help me troubleshoot the issue?

Thank you so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.