chatopera / insuranceqa-corpus-zh Goto Github PK

:helicopter: 保险行业语料库，聊天机器人

License: Other

Python 93.91% Shell 6.09%

corpus chatbot qasystem natural-language-processing natural-language-understanding machine-learning dataset question-answering insurance insuranceqa-corpus-zh

insuranceqa-corpus-zh's Introduction

保险行业语料库

该语料库包含从网站Insurance Library 收集的问题和答案。

据我们所知，本数据集发布之时，2017 年，这是保险领域首个开放的QA语料库：

该语料库的内容由现实世界的用户提出，高质量的答案由具有深度领域知识的专业人士提供。所以这是一个具有真正价值的语料，而不是玩具。
在上述论文中，语料库用于答复选择任务。另一方面，这种语料库的其他用法也是可能的。例如，通过阅读理解答案，观察学习等自主学习，使系统能够最终拿出自己的看不见的问题的答案。
数据集分为两个部分“问答语料”和“问答对语料”。问答语料是从原始英文数据翻译过来，未经其他处理的。问答对语料是基于问答语料，又做了分词和去标去停，添加label。所以，"问答对语料"可以直接对接机器学习任务。如果对于数据格式不满意或者对分词效果不满意，可以直接对"问答语料"使用其他方法进行处理，获得可以用于训练模型的数据。

安装使用

1/3 依赖

Python: 2.x, 3.x
Pip

2/3 安装脚本包

pip install -U insuranceqa_data

3/3 安装语料包

进入证书商店，购买证书，购买后进入【证书-详情】，点击【复制证书标识】。

然后，设置环境变量 INSQA_DL_LICENSE，比如使用命令行终端：

# Linux / macOS
export INSQA_DL_LICENSE=YOUR_LICENSE
## e.g. if your license id is `FOOBAR`, run `export INSQA_DL_LICENSE=FOOBAR`

# Windows
## 1/2 Command Prompt
set INSQA_DL_LICENSE=YOUR_LICENSE
## 2/2 PowerShell
$env:INSQA_DL_LICENSE='YOUR_LICENSE'

最后，执行以下命令，完成数据的下载。

python -c "import insuranceqa_data; insuranceqa_data.download_corpus()"

数据格式说明

数据分为两种：POOL 格式；PAIR 格式。其中，PAIR 格式更适合用于机器学习训练模型。

加载 POOL 数据

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pool_train() # 训练集
test_data = insuranceqa.load_pool_test()   # 测试集
valid_data = insuranceqa.load_pool_valid() # 验证集

# valid_data, test_data and train_data share the same properties
for x in train_data:                       # 打印数据
    print('index %s value: %s ++$++ %s ++$++ %s' % \
     (x, d[x]['zh'], d[x]['en'], d[x]['answers'], d[x]['negatives']))

answers_data = insuranceqa.load_pool_answers()
for x in answers_data:                     # 答案数据
    print('index %s: %s ++$++ %s' % (x, d[x]['zh'], d[x]['en']))

数据设计

-	问题	答案	词汇（英语）
训练	12,889	21,325	107,889
验证	2,000	3354	16,931
测试	2,000	3308	16,815

每条数据包括问题的中文，英文，答案的正例，答案的负例。案的正例至少1项，基本上在1-5条，都是正确答案。答案的负例有200条，负例根据问题使用检索的方式建立，所以和问题是相关的，但却不是正确答案。

{
    "INDEX": {
        "zh": "中文",
        "en": "英文",
        "domain": "保险种类",
        "answers": [""] # 答案正例列表
        "negatives": [""] # 答案负例列表
    },
    more ...
}

训练：corpus/pool/train.json.gz
验证：corpus/pool/valid.json.gz
测试：corpus/pool/test.json.gz
答案：corpus/pool/answers.json 一共有 27,413 个回答，数据格式为 json:

{
    "INDEX": {
        "zh": "中文",
        "en": "英文"
    },
    more ...
}

中英文对照文件

问答对

格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文

corpus/pool/train.txt.gz, corpus/pool/valid.txt.gz, corpus/pool/test.txt.gz.

答案

格式 INDEX ++$++ 中文 ++$++ 英文

corpus/pool/answers.txt.gz

语料库使用gzip进行压缩以减小体积，可以使用zmore, zless, zcat, zgrep等命令访问数据。

zmore pool/test.txt.gz

加载 PAIR 数据

使用"问答数据"，还需要做很多工作才能进入机器学习的模型，比如分词，去停用词，去标点符号，添加label标记。所以，在"问答数据"的基础上，还可以继续处理，但是在分词等任务中，可以借助不同分词工具，这点对于模型训练而言是有影响的。为了使数据能快速可用，insuranceqa-corpus-zh提供了一个使用HanLP分词和去标，去停，添加label的数据集，这个数据集完全是基于"问答数据"。

加载数据

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pairs_train()
test_data = insuranceqa.load_pairs_test()
valid_data = insuranceqa.load_pairs_valid()

# valid_data, test_data and train_data share the same properties
for x in test_data:
    print('index %s value: %s ++$++ %s ++$++ %s' % \
     (x['qid'], x['question'], x['utterance'], x['label']))

vocab_data = insuranceqa.load_pairs_vocab()
vocab_data['word2id']['UNKNOWN']
vocab_data['id2word'][0]
vocab_data['tf']
vocab_data['total']

数据设计

vocab_data包含word2id(dict, 从word到id), id2word(dict, 从id到word),tf(dict, 词频统计)和total(单词总数)。其中，未登录词的标识为UNKNOWN，未登录词的id为0。

train_data, test_data 和 valid_data 的数据格式一样。qid 是问题Id，question 是问题，utterance 是回复，label 如果是 [1,0] 代表回复是正确答案，[0,1] 代表回复不是正确答案，所以 utterance 包含了正例和负例的数据。每个问题含有10个负例和1个正例。

train_data含有问题12,889条，数据 141779条，正例：负例 = 1:10 test_data含有问题2,000条，数据 22000条，正例：负例 = 1:10 valid_data含有问题2,000条，数据 22000条，正例：负例 = 1:10

句子长度:

max len of valid question : 31, average: 5(max)
max len of valid utterance: 878(max), average: 165(max)
max len of test question : 33, average: 5
max len of test utterance: 878, average: 161
max len of train question : 42(max), average: 5
max len of train utterance: 878, average: 162
vocab size: 24997

机器学习项目

可将本语料库和以下开源码配合使用

deep-qa-1: Baseline model

InsuranceQA TensorFlow: CNN with TensorFlow

n-grams-get-started: N元模型

word2vec-get-started: 词向量模型

声明

声明1 : insuranceqa-corpus-zh

本数据集使用翻译 insuranceQA而生成，代码发布证书Chunsong Public License, version 1.0。数据仅限于研究用途，如果在发布的任何媒体、期刊、杂志或博客等内容时，必须注明引用和地址。

InsuranceQA Corpus, Chatopera Inc., https://github.com/chatopera/insuranceqa-corpus-zh, 07 27, 2017

任何基于insuranceqa-corpus衍生的数据也需要开放并需要声明和“声明1”和“声明2”一致的内容。

声明2 : insuranceQA

此数据集仅作为研究目的提供。如果您使用这些数据发表任何内容，请引用我们的论文：Applying Deep Learning to Answer Selection: A Study and An Open Task。Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015

insuranceqa-corpus-zh's People

Contributors

Stargazers

Watchers

Forkers

fssqawj soulspirit1229 maggie0830 michaelfeng87 jankim allensmile hexiaofeng aliceqin900 kehaowu nifannn seanamax stevenlol nancy921201 mars-wei yuzhanggdut guhaifudeng candlewill quantjia honglei-cong zhangfc lymanyoung eddylapis bookus melody-xiaomi chuyelei liuzp fashtimedotcom shuxiang nlpprj fendouai blankxyz changfengfeng wansuiye09 michaelluk reilf kenye jintan2000 jxfruit leezqcst tonyxia2016 openbruin xuesj amshb001 liuluyeah yangvict cosecant-csc little1tow eight-corner meccy vunb galaxyh pinweihelai yufc2002 tongzhenguo yangzhongwei morindaz kuncle clw87 cnglen pingoogle hyqdido haykinwu copperdong yiwangsun luozhen pengcheng617 kailiwu clustersdata airob venugopalreddyk songyandong mxnaxvex innerface sunxuening zhu7478848 hustpjs sistep cathyhaha coldzoo binkes jhzhou1111 youkpan facingwaller sweetcard xuxiangwen ye-lun yanchaomars xinqiyang cyzhangathit watterzhu kunwangr yinmingjun simmoncn awesome-archive syx528911137 siyuanwei andyrbm lxwithgod charles0-0 cafew

insuranceqa-corpus-zh's Issues

数据集有问题

老哥，你把正例负例搞成1:10，真是太荒谬了，这样跑出来的结果不管正例还是负例，都是判错，导致最后准确率极限是10/11.应该选取符合实际情况的数据集，这样才有说服力。

求解答pair数据集的疑惑，谢谢

您好：
请问能详细说明一下项目insuranceqa-corpus-zh中corpus目录下iqa.train.tokenlized.pair.json文件中的数值的对应关系吗？特别是“question”字段不清楚如何对应到原文本？

    由于近期实验需要参考您这份数据集，还望您能尽快回复，谢谢。

File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 257, in finalize_options
ep.require(installer=self.fetch_build_egg)
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2029, in require
working_set.resolve(self.dist.requires(self.extras),env,installer))
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 579, in resolve
env = Environment(self.entries)
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 748, in init
self.scan(search_path)
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 777, in scan
for dist in find_distributions(item):
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 1757, in find_on_path
path_item,entry,metadata,precedence=DEVELOP_DIST
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2151, in from_location
py_version=py_version, platform=platform, **kw
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2128, in init
self.project_name = safe_name(project_name or 'Unknown')
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 1139, in safe_name
return re.sub('[^A-Za-z0-9.]+', '-', name)
RuntimeError: maximum recursion depth exceeded

visual module?

Hello, what is the visual, is it a package?

为什么数据没有标点？

概述

数据中无标点

通过id2word之后，utterance里的文本没有标点或者断句，是否可以加上标点呢？

理想解决方案

是否可以提供有标点的数据？

Is there any state of art methods used in the insuranceQA (v2)?

The original datasets have two version: version 1 and version 2.
I have check some papers, but they all experiments on version 1 dataset and the state of art model achieved 0.77, is there any paper experiments on version 2 dataset?

训练语料时，模型保存在了哪里？保存后如何使用呢？

描述

功能

环境

操作系统

macOS or Mac OSX
Windows
Linux(Debian, CentOS, Ubuntu, etc.)

代码版本

Git commit hash (git rev-parse HEAD)

OpenSource by Chatopera

负例的选取

您好，十分感谢您的工作。有个小问题想问下，您处理过后的数据集形成了正负比例1:10的qa对，我想知道负例是如何选取的呢？是通过随机sample的还是通过类似Bm-25的算法抽取的呢？

CS224n笔记16 DMN与问答系统

http://www.hankcs.com/nlp/cs224n-dmn-question-answering.html

v2.1 is available

hi, folks

这个语料库在insuranceqa-corpus-zh v1版本中，不是很适合机器学习，因为语料没有分词，去标去停，添加标签。在 v2.1版本中，已经支持了 load_pairs_test, load_pairs_train和 load_pairs_valid，并且支持了 load_pairs_vocab。这个是基于词表的，在test, train和valid中，都使用WordId，并且添加了Label来表明该回复是正例还是负例。基于pairs的数据，可以更方便的利用一些库进行训练：

DeepQA2

InsuranceQA TensorFlow

Chatbot Retrieval

详细文档：https://github.com/Samurais/insuranceqa-corpus-zh/releases/tag/v2.1

快速升级

pip install --upgrade insuranceqa_data

@sjqzhang, @rgtjf, @fssqawj

IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:726)

描述

运行 insuranceqa.load_pairs_train() 时总会报 “IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:726)”，看了下貌似是 “wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.test.json.gz", out = os.path.join(curdir, 'pairs'))”这里出了问题，是我少安装了什么吗？

功能

环境

python 2.7
python 3.6

操作系统

Windows 10

代码版本

Git commit hash (git rev-parse HEAD)

OpenSource by Chatopera

增加基线模型和数据

Good news!

增加baseline model

一个使用insuranceqa-corpus-zh语料库训练问答模型的深度学习网络

程序和结果

@fssqawj
@rgtjf
@sjqzhang
@Samurais
@shibing624
@sizhen
@ax4
@park
@songzi1229
@AdolHong
@bcao
@robotdj
@Allensmile
@behonests
@playniuniu
@xuansage
@zixia
@LucasHood001
@madfrog2047
@greatgeekgrace
@aliray

加载数据能够从本地加载么

请问能够从本地数据加载么，不必要每次都下载

pypi目录中的init代码与实际pip install --upgrade insuranceqa_data中的代码不一样，是否未更新？

回答结果如何输出

如何输出结果

关于使用模型？

作者您好，我想问下该如何使用模型呢，例如输入一个问题，怎么得到相应的回答呢？

直接Git clone，数据集无法用tar打开

命令1：
tar xvf iqa.train.json.gz
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

命令2：
tar -xvzf iqa.train.json.gz
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

命令3：
unzip iqa.train.json.gz
Archive: iqa.train.json.gz
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of iqa.train.json.gz or
iqa.train.json.gz.zip, and cannot find iqa.train.json.gz.ZIP, period.

请检查下数据格式，是不是有问题

您好，请问怎么保存训练好的神经网络模型

无法正常获取数据

使用api无法获取数据，疑似连接失效

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pairs_train()

 [insuranceqa_data] downloading data https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.test.json.gz ... 

...中间其他日志省略...

File /usr/local/lib/python3.8/socket.py:796, in create_connection(address, timeout, source_address)
    794 if source_address:
    795     sock.bind(source_address)
--> 796 sock.connect(sa)
    797 # Break explicitly a reference cycle
    798 err = None

OSError: [Errno socket error] [Errno 110] Connection timed out

chatopera / insuranceqa-corpus-zh Goto Github PK

insuranceqa-corpus-zh's Introduction

保险行业语料库

安装使用

1/3 依赖

2/3 安装脚本包

3/3 安装语料包

数据格式说明

加载 POOL 数据

数据设计

中英文对照文件

问答对

答案

加载 PAIR 数据

加载数据

数据设计

机器学习项目

声明

insuranceqa-corpus-zh's People

Contributors

Stargazers

Watchers

Forkers

insuranceqa-corpus-zh's Issues

概述

理想解决方案

描述

功能

环境

操作系统

代码版本

OpenSource by Chatopera

描述

功能

环境

操作系统

代码版本

OpenSource by Chatopera

增加baseline model

Recommend Projects

Recommend Topics

Recommend Org