dongrixinyu / jionlp Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 378.0 157.8 MB

中文 NLP 预处理、解析工具包，准确、高效、易用 A Chinese NLP Preprocessing & Parsing Package www.jionlp.com

Home Page: http://www.jionlp.com/

License: Apache License 2.0

Python 99.88% Shell 0.12%

apache2 chinese natural-language-processing ner nlp nlp-parse preprocessing python time-parse time-parsing

jionlp's Introduction

Hi there

🔭 working on NLP、AIGC、ffmpeg.
🌱 Python, C and C++ coder.
📫 Email: [email protected]
👋 掘金电子书作者：《人人都能看懂的 ChatGPT 原理课》
🦉 知学堂课程作者：《ChatGPT与AI革命》
🔭 公众号：JioNLP，分享 AI 算法、C语言。

Visitor count

jionlp's People

Contributors

Stargazers

Watchers

Forkers

hello-toufu awoziji xiaokai01 barryzm suifei fighting41love williamdeve buaaliuming jingmouren allensmile ddxx01 tdlist daniellpw peiguijun qypf xrosliang wzf9 arryboom tlntin hai-m tikrgo nightwish-cn raypopo atakey lszxtcdj sumerzhang chenbing-ml mandyyang1989 fresh382227905 illusions-lyy dtmndas lutao0211 zhajiahe shark803 dracenwu nice70 guhaishuo watermelondududu tianfrank jkyang01 cendeavor psyxusheng bo-scnu liuwq168 spxia tianyunzqs kevinjunwei xiaomeng654321 zhao-han-itp leebeep gmmo526 zaynstark dophist edisontu gptcod lijianss sobeited ershijiu sysujayce rogerspy kgofm tangpeng19 lumin115 fashion-john ayoubu houpanpan maxindian wutonghua yuzhang112 znsoftm weibobo2015 rachel2011 fanfanfeng liuqin1bo rfhzhj 15737939656 colionx sibonjia mingyue19951206 idealwei haif-liu xalanq zsctju dumpmemory xiyou1024 kuustudio dddouble zhiqiang-ga0 anshshan brokenax gshan4056 franztao x-hao goodboyyes2009 pk350 xiejinwen113 hoho35 jumppp saler-1 lldfire

jionlp's Issues

[BUG]URL正则匹配错误

描述(Description)
URL正则匹配错误

描述你遇到了什么问题(Please description your issue here)
调用remove_url函数时出现不能匹配的情况

版本(Version):
jionlp的调用代码与输入文本(Code & Text):

sent1 = "抖音知识分享 https://v.douyin.com/RtKFFah/ 复制Ci鏈接，打开Dou音搜索，直接观看視頻"
sent2 = "抖音知识分享https://v.douyin.com/RtKFFah/复制Ci鏈接，打开Dou音搜索，直接观看視頻"
print("1", jionlp.remove_url(sent1))
print("2", jionlp.remove_url(sent2))

调用报错日志如下(Log):

1.抖音知识分享 https://v.douyin.com/RtKFFah/ 复制Ci鏈接，打开Dou音搜索，直接观看視頻
2.抖音知识分享复制Ci鏈接，打开Dou音搜索，直接观看視頻

期望行为(Expectation)

☝️输出【抖音知识分享复制Ci鏈接，打开Dou音搜索，直接观看視頻】才是正确的

请顺手 star 一下右上角的⭐小星星

关键短语抽取例子疑问

如图，使用demo提取关键短语的时候，为什么不是这样的输出：['俄罗斯克里姆林宫', '邀请金正恩访俄', '举行会谈', '朝方转交普京', '最高司令官金正恩']
还是说需要调整什么参数才能得到上面的输出结果？

parse_time 默认参数 time.time() 的实时调整

描述(Description)

描述你遇到了什么问题(Please describe your issue here)
萌新写了提醒自己的机器人, 但是发现使用明天, 后天, 几秒后, 几分钟后等相对时间时, 基准时间都是程序运行时那一刻的时间.
而不是调用时的时间. (抽象成样例如下) (不知道是不是bug, 还是使用的方法不当)

版本(Version):

python 版本: 3.9.12
jionlp 版本: 1.3.53

jionlp的调用代码与输入文本(Code & Text):

e.g.
import time
import jionlp
import re


def analyse(text: str):
    match_rule = r"(?P<time>(.*)?)(提醒我|[和对跟]我说)(?P<something>(.*))"
    result = re.match(pattern=match_rule, string=text)
    print(text)
    if result is not None:
        print(jionlp.parse_time(result.groupdict()['time']))
    else:
        print("匹配失败")
    print('*' * 50)
    time.sleep(2)


text_list = [
    "1秒后提醒我做吃饭",
    "1秒后提醒我做吃饭",
    "1秒后提醒我做吃饭",
    "1秒后提醒我做吃饭",
]

for text in text_list:
    analyse(text=text)

print(jionlp.__version__)

调用报错日志如下(Log):

1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1.3.53

进程已结束,退出代码0

期望行为(Expectation)

若返回结果不理想，描述你期望发生的事情(Please describe your expectation)
期望每次调用都以调用时的时间为基准时间, 如上例中, 秒数分别期望是12,14,16,18

请顺手 star 一下右上角的⭐小星星 (已点, 膜拜大佬~)

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

函数名 function name：
Great Job!

请输入报错的文本，以及代码 please input the text and code

# 复制粘贴此处 copy and paste here

请输入报错信息与日志追踪 please input the bug info and traceback

感谢你的付出，我正在使用中，希望未来可以贡献自己的力量！

时间的正则支持x月x

描述该功能的用处，可以提供相关资料描述该功能
text2 = "1月1至2月10的天气真好"
res = extract_time(text2, with_parsing=True)
print(res)

得到的结果如下：
[{'text': '1至2月', 'offset': [2, 6], 'type': 'time_span', 'detail': {'type': 'time_span', 'definition': 'accurate', 'time': ['2022-01-01 00:00:00', '2022-02-28 23:59:59']}}, {'text': '10的天', 'offset': [6, 10], 'type': 'time_delta', 'detail': {'type': 'time_delta', 'definition': 'accurate', 'time': {'day': 10.0}}}]

描述你期望实现该功能的方式和最终效果
时间正则增加模式：x月x。不强制带日/号

新闻地名识别本地无法正常运行

jionlp版本(Version): 1.3.39
调用报错日志如下(Log):

➜  JioNLP git:(master) ✗ python3.9 index.py
`jio.help()` is provided to search how to use jio functions.
Traceback (most recent call last):
  File "python/JioNLP/index.py", line 6, in <module>
    print(jio.recognize_location(text))
  File "python/JioNLP/jionlp/gadget/location_recognizer.py", line 381, in __call__
    self._prepare()
  File "python/JioNLP/jionlp/gadget/location_recognizer.py", line 111, in _prepare
    self.pkuseg = pkuseg.pkuseg(postag=True)
TypeError: __init__() got an unexpected keyword argument 'postag'

jionlp的调用代码与输入文本(Code & Text):

import jionlp as jio
text = '海洋一号D星。中新网北京6月11日电(郭超凯)记者从**国家航天局获悉，6月11日2时31分，在牛家村，**在太原卫星发射中心用长征二号丙运载火箭成功发射海洋一号D星。该星将与海洋一号C星组成**首个海洋民用业务卫星星座。相比于美国，海洋一号D星是**第四颗海洋水色系列卫星，是国家民用空间基础设施规划的首批海洋业务卫星之一。'
res = jio.recognize_location(text)
print(res)

期望行为(Expect)

可以和样例运行出一样的结果

解析时间parse_time指定了时间类型不管用

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

版本(Version):

python 版本: 3.7.4
jionlp 版本:1.3.53

jionlp的调用代码与输入文本(Code & Text):

res = jio.parse_time('请修改每天18点的提醒', time_base=datetime.now(), time_type='time_point')
print(res)

调用报错日志如下(Log):

无

期望行为(Expectation)
期盼返回：{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-05-11 18:00:00', '2022-05-11 18:59:59'], 'string': '请修改18点的提醒'}
实际返回：{'type': 'time_period', 'definition': 'accurate', 'time': {'delta': {'day': 1}, 'point': {'time': ['2022-05-11 18:00:00', '2022-05-11 18:59:59'], 'string': '请修改18点的提醒'}}}

若返回结果不理想，描述你期望发生的事情(Please describe your expectation)

指定了按照时间类型time_point解析，但是结果并没有按照设定执行

请顺手 star 一下右上角的⭐小星星

[BUG]对于“前两个月”的语境分析

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

版本(Version):

python 版本: 3.8.12
jionlp 版本: 1.3.34

jionlp的调用代码与输入文本(Code & Text):

import jionlp as jio
import time

print(f't = {time.time()}')
res = jio.parse_time('查询销售部门前两个月的业绩', strict=False,)
print(res)

>>> t = 1639467272.000813
>>> {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-01-01 00:00:00', '2021-02-28 23:59:59']}

期望行为(Expectation)

这句话的查询效果应该等同于“查询销售部门过去两个月的业绩”，目前似乎是按照”当年的前两个月“分析的。这里应该加入语境分析，判断前面有没有指定年份？

“上(下)季度"查询异常

ValueError: The given time string 上季度 is illegal.

时间语义解析无法处理这个情况

提 issue 请务必将以下信息写清楚，否则无法解答！！！
描述(Description)

描述你遇到了什么问题(Please description your issue here)

jionlp版本(Version): v1.3.27
调用报错日志如下(Log):
jionlp的调用代码与输入文本(Code & Text):

09-01 20:01 至 12-01 18:07

期望行为(Expect)

起始时间: 2021-09-01 20:01:00
终止时间: 2021-12-01 18:07:00

{"code":11200,"message":"licc failed","sid":"its00082ab4@dx17569fccb5ba11d902"}

{"code":11200,"message":"licc failed","sid":"its00082ab4@dx17569fccb5ba11d902"} 这是什么问题，怎么解决啊

数据增强：同音字替换bug

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

函数名 function name：
dictionary_loader.py -》 chinese_char_dictionary_loader

请输入报错的文本，以及代码 please input the text and code

jio.homophone_substitution("北京市")

请输入报错信息与日志追踪 please input the bug info and traceback

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/ks/vz0z2zk13hx0t6h_pgy1bpfh0000gn/T/jieba.cache
Loading model cost 0.602 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-703a6a7940fb>", line 1, in <module>
    runfile('/Users/mrx/Documents/work/lance/gov_nlp/repo/legal_instrument/corpus/augement.py', wdir='/Users/mrx/Documents/work/lance/gov_nlp/repo/legal_instrument/corpus')
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/mrx/Documents/work/lance/gov_nlp/repo/legal_instrument/corpus/augement.py", line 217, in <module>
    jio.homophone_substitution('北京市')
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/textaug/homophone_substitution.py", line 108, in __call__
    self._prepare(homo_ratio=homo_ratio, seed=seed)
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/textaug/homophone_substitution.py", line 68, in _prepare
    self._construct_word_pinyin_dict()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/textaug/homophone_substitution.py", line 80, in _construct_word_pinyin_dict
    word_pinyin = self.pinyin(word, formater='detail')
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/gadget/pinyin.py", line 164, in __call__
    self._prepare()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/gadget/pinyin.py", line 79, in _prepare
    self.pinyin_char = pinyin_char_loader()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/dictionary/dictionary_loader.py", line 424, in pinyin_char_loader
    char_dict = chinese_char_dictionary_loader()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/dictionary/dictionary_loader.py", line 245, in chinese_char_dictionary_loader
    assert len(segs) == 8
AssertionError

版本信息：
jionlp==1.3.15

其他问题：
word_distribution.zip 这个文件没有包含解压后的文本，需要手动解压才可以

抽取金额字符串功能中对于口语化似乎不太支持

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

例如：十块五八块五毛钱这类口语化的金额表示似乎无法支持
不过想了一下这些说法要想识别确实容易和其他量词产生冲突不知道有没有合适的解决方法

时间周期性问题

描述(Description)

描述你遇到了什么问题(Please description your issue here)

版本(Version):

python 版本: 3.7
jionlp 版本: 1.3.41

jionlp的调用代码与输入文本(Code & Text):

每周四三点和张三在徐家汇开会

3. 调用报错日志如下(Log):

无法解析，返回'time': [None, None]


**期望行为(Expectation)**

> 期望：返回准确的 'delta': {'day': 7}, 'point': {'time': [时间点]}

时间语义解析

提 issue 请务必将以下信息写清楚，否则无法解答！！！
描述(Description)

描述你遇到了什么问题(Please description your issue here)
大年初一解析有问题

对于每个工作日和每个周末的判断

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

版本(Version):

python 版本:3.8
jionlp 版本: 1.3.47

jionlp的调用代码与输入文本(Code & Text):

每个周末九点
{
	"definition": "accurate",
	"time": [
		"2022-03-06 09:00:00",
		"2022-03-06 09:59:59"
	],
	"type": "time_point"
}

每个工作日九点
{
	"definition": "accurate",
	"time": [
		"2022-03-03 09:00:00",
		"2022-03-03 09:59:59"
	],
	"type": "time_point"
}

调用报错日志如下(Log):

期望行为(Expectation)

对于每个工作日和每个周末的判断，期望返回是一个时间周期，而不是一个精确时间点

又想了一下，这个问题可能不太好解决，不知道有没有什么更好的方案~

时间解析问题

提 issue 请务必将以下信息写清楚，否则无法解答！！！
描述(Description)

描述你遇到了什么问题(Please description your issue here)
如果遇到 1. 明天上午8点到9点开会，这种9点会解析成今天的9点
2. 下午3点开会，提前20分钟提醒我，之后的20分钟也会解析成3点

数据增强里面的文本替换增删为啥不是词级别的

现在的random_add_delete等文本增强函数都是char 级别的输出的结果看着比较诡异

原始文本包含的时间人名等都会被变了

时间解析问题

提 issue 请务必将以下信息写清楚，否则无法解答！！！
描述(Description)

描述你遇到了什么问题(Please description your issue here)

jionlp版本(Version): xxxxxx 通过 jionlp.__version__ 可查
调用报错日志如下(Log):

无法识别时间当中的刻，如：三点一刻，三点三刻

jionlp的调用代码与输入文本(Code & Text):

今天下午三点一刻过来写作业：输出是”今天下午三点“

期望行为(Expect)

今天下午三点一刻

时间解析问题

昨天，前天，前15天，都解析不出来

已经将text转换为utf-8格式，但总是报错the text is not legal.

“中午”和“中午的”解析结果不一

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

版本(Version):

python 版本: 3.7.4
jionlp 版本: 1.3.53

jionlp的调用代码与输入文本(Code & Text):

import jionlp as jio
from datetime import datetime


res = jio.parse_time('中午的两点一刻定一个闹铃', time_base=datetime.now())
print(res)
res = jio.parse_time('中午两点一刻定一个闹铃', time_base=datetime.now())
print(res)

调用报错日志如下(Log):

> {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-05-13 02:15:00', '2022-05-13 02:15:59']}
> {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-05-13 14:15:00', '2022-05-13 14:15:59']}

期望行为(Expectation)

期盼返回结果是一致的, 但是两个返回结果不一

请顺手 star 一下右上角的⭐小星星

已⭐，大赞

是否有可能手动选择不进行农历的转换？

描述该功能的用处，可以提供相关资料描述该功能
在使用parse_time()的时候，提供一个参数用来设定不进行农历的转换

该功能是否用于改进项目缺陷，如果是，请描述现有缺陷
如 #43 #48 提到的问题，因为原本农历和阳历会容易混淆，但 "X月X" 这种用法其实算是蛮常用到的
因此希望若使用者确定其文字不会有农历日期需要转换，在呼叫parse_time()的时候另外提供的参数用来将 X月X 视为阳历

# 现有情况
jio.parse_time("四月十三")

{'type': 'time_point',
 'definition': 'accurate',
 'time': ['2022-05-13 00:00:00', '2022-05-13 23:59:59']}

描述你期望实现该功能的方式和最终效果

# 期望效果
jio.parse_time("四月十三",  lunar_date=False) # lunar_date预设为True，不影响原本的执行结果

{'type': 'time_point',
 'definition': 'accurate',
 'time': ['2022-04-13 00:00:00', '2022-04-13 23:59:59']}

请顺手 star 一下右上角的⭐小星星
Star了，真的很棒的套件！

extract summary doesn't work

尝试了抽取式文本摘要例子，返回的值和原值一样

UnboundLocalError: local variable 'final_res' referenced before assignment

import jionlp as jio
jio.recognize_location('你好吗')
UnboundLocalError: local variable 'final_res' referenced before assignment

when target text no address info, reponse UnboundLocalError

remove_url不完全符合预期

输入：
这是一个链接http://t.cn/zQaVuHD这是一个链接http://t.cn/zQaVuHD
这是一个链接http://t.cn/zQaVuHD这是一个链接http://t.cn/zQaVuHD这是一个链接

预期：
这是一个链接这是一个链接
这是一个链接这是一个链接这是一个链接

实际输出：
这是一个链接这是一个链接http://t.cn/zqavuhd
这是一个链接这是一个链接这是一个链接

链接出现在句末的时候没有被清理掉。

多时间范围提取问题

text = '周一到周三早上九点到晚上十点的日程'
res1 = jio.ner.extract_time(text, time_base=time.time())
res1：[{'text': '周三早上九点到晚上十点', 'offset': [3, 14], 'type': 'time_span', 'detail': {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-10-13 09:00:00', '2021-10-13 22:00:00']}}]

描述：日期范围只能提取出一个日期
期望：期望能提取出多个日期

“4月23”和“4月23之后”都无法正确解析

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

版本(Version):
pyhton: 3.9.12
jionlp: 1.3.53
jionlp的调用代码与输入文本(Code & Text):
jio.ner.extract_time("4月23之后", time_base=time.time(), with_parsing=True)
jio.parse_time("4月23之后", ret_future=True, time_base=time.time())
调用报错日志如下(Log):
只能解析成4月份，无法解析到23号，5点之后也是，只能解析成5点，“之后”这个关键词丢了，“之前”反而没有这个问题，但时间点包含了5:00:00~5:59:59，正常应该不包含这个小时点的

python 3.9.10安装失败

python版本：3.9.10
pip版本：22.0.4
操作系统：windows10
下载源：清华（https://pypi.tuna.tsinghua.edu.cn/simple）

错误日志：
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting jionlp
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/54/81/72112e67f4de08db3b701e36f69318c79540f67916fc6ab26c91995725fd/jionlp-1.3.47-py2.py3-none-any.whl (19.0 MB)
Collecting pkuseg
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB)
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: jieba in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (0.42.1)
Requirement already satisfied: numpy in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (1.22.2)
Requirement already satisfied: requests in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (2.27.1)
Requirement already satisfied: zipfile36 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (0.1.3)
Requirement already satisfied: cython in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from pkuseg->jionlp) (0.29.28)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (1.26.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (2.0.12)
Using legacy 'setup.py install' for pkuseg, since package 'wheel' is not installed.
Installing collected packages: pkuseg, jionlp
Running setup.py install for pkuseg: started
Running setup.py install for pkuseg: finished with status 'error'
error: subprocess-exited-with-error

Running setup.py install for pkuseg did not run successfully.
exit code: 1

[63 lines of output]
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.9
creating build\lib.win-amd64-3.9\pkuseg
copying pkuseg\config.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\data.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\download.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\gradient.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\model.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\optimizer.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\res_summarize.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\scorer.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\trainer.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg_init_.py -> build\lib.win-amd64-3.9\pkuseg
creating build\lib.win-amd64-3.9\pkuseg\dicts
copying pkuseg\dicts_init_.py -> build\lib.win-amd64-3.9\pkuseg\dicts
creating build\lib.win-amd64-3.9\pkuseg\models
copying pkuseg\models_init_.py -> build\lib.win-amd64-3.9\pkuseg\models
creating build\lib.win-amd64-3.9\pkuseg\postag
copying pkuseg\postag\model.py -> build\lib.win-amd64-3.9\pkuseg\postag
copying pkuseg\postag_init_.py -> build\lib.win-amd64-3.9\pkuseg\postag
creating build\lib.win-amd64-3.9\pkuseg\models\default
copying pkuseg\models\default_init_.py -> build\lib.win-amd64-3.9\pkuseg\models\default
copying pkuseg\feature_extractor.pyx -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\inference.pyx -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\dicts\default.pkl -> build\lib.win-amd64-3.9\pkuseg\dicts
copying pkuseg\postag\feature_extractor.pyx -> build\lib.win-amd64-3.9\pkuseg\postag
copying pkuseg\models\default\features.pkl -> build\lib.win-amd64-3.9\pkuseg\models\default
copying pkuseg\models\default\weights.npz -> build\lib.win-amd64-3.9\pkuseg\models\default
running build_ext
skipping 'pkuseg\inference.cpp' Cython extension (up-to-date)
cythoning pkuseg/feature_extractor.pyx to pkuseg\feature_extractor.c
cythoning pkuseg/postag/feature_extractor.pyx to pkuseg/postag\feature_extractor.c
building 'pkuseg.inference' extension
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\pkuseg
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\JieBrother\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\include -IC:\Users\JieBrother\AppData\Local\Programs\Python\Python39\include -IC:\Users\JieBrother\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt /EHsc /Tppkuseg\inference.cpp /Fobuild\temp.win-amd64-3.9\Release\pkuseg\inference.obj
inference.cpp
c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
pkuseg\inference.cpp(3118): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
pkuseg\inference.cpp(4284): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
pkuseg\inference.cpp(4285): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
pkuseg\inference.cpp(5108): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
pkuseg\inference.cpp(6219): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
pkuseg\inference.cpp(6807): warning C4244: 'argument': conversion from 'Py_ssize_t' to 'int', possible loss of data
pkuseg\inference.cpp(23619): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(23624): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(23639): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(23652): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(24323): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
pkuseg\inference.cpp(24339): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
pkuseg\inference.cpp(26222): warning C4996: 'PyUnicode_FromUnicode': deprecated in 3.3
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/unicodeobject.h(551): note: see declaration of 'PyUnicode_FromUnicode'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

Encountered error while trying to install package.

pkuseg

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

entity2tag()有些局限

entity2tag() 函数还可以提升一下，看了一下源码标注是按照offset顺序来的，如果ner_entities里面多个entities的offset是乱序的

比如：
before：
[{'text': '胡静静', 'offset': [0, 3], 'type': 'Person'},{'text': '水利局', 'offset': [4, 7], 'type': 'Orgnization'}]]
after：
ner_entities =[{'text': '水利局', 'offset': [4, 7], 'type': 'Orgnization'},{'text': '胡静静', 'offset': [0, 3], 'type': 'Person'}]
最后的结果将会变成：
['O', 'O', 'O', 'O', 'B-Orgnization', 'I-Orgnization', 'E-Orgnization', 'O', 'O', 'O']
非常感谢您的工具，受益很多

函数名 function name：
entity2tag()

请输入报错的文本，以及代码 please input the text and code

# 复制粘贴此处 copy and paste here

请输入报错信息与日志追踪 please input the bug info and traceback

Exception: Http请求失败，状态码：403，错误信息： {"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}

使用讯飞api时处理多条数据生成时出现：
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/back_translation.py", line 164, in iter_api_by_language
tmp = mt_api(text, from_lang=chinese_lang, to_lang=foreign_lang)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 43, in wrapper
f = func(self, *args, **kargs)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 53, in wrapper
f = func(self, *args, **kargs)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 89, in wrapper
raise Exception(err)
Exception: Http请求失败，状态码：403，错误信息：
{"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}
2020-10-09 13:48:27 ERROR wrapper: Http请求失败，状态码：403，错误信息：
{"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}
Traceback (most recent call last):
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 70, in wrapper
f = func(self, *args, **kargs)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 670, in call
raise Exception(exception_string)
Exception: Http请求失败，状态码：403，错误信息：
{"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}

location_parser中的报错问题

提 issue 请务必将以下信息写清楚，否则无法解答！！！
描述(Description)
使用地址解析功能，如果text为“上门西湖区蒋村花园小区管局农贸市场高高兴兴”,会报错TypeError: sequence item 0: expected str instance, NoneType found

需要把location_parser.py中的第279行修改为 key_name = ''.join( [str(prov), str(city), str(county)])
已解决

网页版的代码是不是可以开源

货币金额抽取解析反馈

版本：
Python 3.9
jionlp-py39 1.3.45

问题描述：
使用货币金额抽取，如：”2.2本计划投资3541.07万元2.3本项目……“会抽取到”3541.07万元2“，解析出的结果是：
{'num': '200000.20', 'case': '元', 'definition': 'accurate'}
也就是万元前面是数字，后面也带数字的话，解析出的结果就不对，好像是前面的数字相加：3+5+4+1+0+7=20, 结果就是200000.2

在在线版的测试结果：

主要是反馈一下自己在这种情况下遇到的解析结果，看对迭代有无帮助，感谢作者大大:)

python>3.9有没有办法安装呢

安装时还是会报pkuseg错误，有没有办法绕过嘞，其实自己使用的功能也没用到分词相关的，而且环境已经装了很多包，不方便再重新搭了T T

地址解析不准确

{
"province":"黑龙江省",
"city":"齐齐哈尔市",
"county":"龙江县",
"detail":"省富裕县富裕镇三社区五委2组",
"full_location":"黑龙江省齐齐哈尔市龙江县省富裕县富裕镇三社区五委2组",
"orig_location":"黑龙江省富裕县富裕镇三社区五委2组"
}

jio.summary.extract_summary报错

Traceback (most recent call last):
File "split_test.py", line 59, in
File "D:\software\anaconda\envs\torch13\lib\site-packages\jionlp\algorithm\summary\extract_summary.py", line 140, in call
sen_segs[3] = len([w for w in sen_segs_weights if w != 0]) / len(sen_segs_weights)
ZeroDivisionError: division by zero

地址识别模糊匹配

感谢作者开发这么棒的工具，我现在遇到个问题，在地址识别的过程中，请问，现在的工具支持模糊匹配吗，比如北京***，当前的工具是无法识别的，只有精准到北京市才行，请问这种应该怎么解决？

ValueError: the given string `早` is illegal

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

函数名 function name：
parse_time

请输入报错的文本，以及代码 please input the text and code

（* 一定要写清楚是具体哪一条文本数据造成了报错！！！ *）


# 复制粘贴此处 copy and paste here

```市政协十三届四次会议举行第二次全体会议|漳州新闻网讯(记者 张志鹏)1月3日下午，市政协十三届四次会议举行第二次全体会议，听取委员大会发言。10位政协委员分别代表有关专委会、**党派、人民团体作大会发言。市政协十三届四次会议执行主席张祯锦、柳建聪、黄井南、杨胜华、吴芳华、陈跃鸿、何伟燕、卢力、李扬真及值日常委在主席台就座。市领导邵玉龙、刘远、阮开森、张慧德、李珊珊、谢毅泰、张翼腾、兰万安、陈水树、吴卫红应邀出席会议并在主席台就座，在漳省政协委员、市直及驻漳有关单位领导、漳州异地商会和异地漳州商会会长等应邀列席会议，听取委员发言。10位委员就全市经济社会发展及民生改善等领域提出意见和建议。刘志明代表漳州市政协经济委员会发言，提出关于加快漳州智能制造发展的建议；王金泉代表九三学社漳州市委员会发言，提出关于进一步激活我市“夜间经济”的建议；蔡晓洁代表民盟漳州市委员会发言，提出打造龙头深化融合、推动职教高质量发展的建议；张建国代表农工党漳州市委员会发言，提出关于进一步推进我市县域紧密型医共体建设的建议；颜小燕代表民进漳州市委员会发言，提出持续优化营商环境、加快漳州台资工业发展的建议；苏美华代表漳州市政协教科卫体委员会发言，提出关于把**女排漳州体训基地建设成城市新形象标志性区域的建议；刘丽贞代表漳州市政协提案委员会发言，提出关于试行“时间银行”互助养老模式的建议；杨栋代表漳州市政协特邀(二)界发言，提出关于加强漳州古城运营与管理的几点建议；陆銮眉代表漳州市政协农业和农村委员会发言，提出关于扶持我市休闲食品制造产业的建议；陈婉儿代表共青团漳州市委员会发言，提出关于进一步推进漳台青年交流融合的建议。会上还有15个单位分别围绕优化钢铁产业布局推动高质量发展；发起“太平洋海上丝绸之路”联合申遗倡议；破解医养结合瓶颈；把握新机遇、提振民营企业发展信心；加快推进漳州儒学遗址保护与合理利用；推进0-3岁儿童早期发展项目工作；加强食品“三小”监管工作；推动我市小流域治理；推进厦门湾南岸交通环境建设；推动漳州工业高质量发展的路径与对策；落实好支持漳州民营企业发展政策增强实体经济竞争力；加强漳州市青少年科技教育；落实“两岸一家亲”理念、提升两岸婚姻家庭的服务水平；加强涉侨文物的保护与活化利用助力“一带一路”建设；推进文明城市“妈妈小屋”建设、保障妇女合法权益等方面作书面发言。 http://www.zznews.cn/news/system/2020/01/04/030187660.shtml|新闻|||20200104

## 请输入报错信息与日志追踪 please input the bug info and traceback


Traceback (most recent call last):
  File "D:/论文代码/yidong/test.py", line 32, in <module>
    time = jio.parse_time(line)
  File "D:\compiler\anaconda\lib\site-packages\jionlp\gadget\time_parser.py", line 605, in __call__
    _, second_full_time_handler, _, blur_time = self.parse_time_point(
  File "D:\compiler\anaconda\lib\site-packages\jionlp\gadget\time_parser.py", line 985, in parse_time_point
    cur_hms_func(cur_hms_string)
  File "D:\compiler\anaconda\lib\site-packages\jionlp\gadget\time_parser.py", line 3146, in normalize_blur_hour
    raise ValueError('the given string `{}` is illegal'.format(time_string))
ValueError: the given string `早` is illegal

汉字转拼音部分测试，长时间没有响应，不清楚是否为异常？

e:\anaconda3\lib\site-packages\jionlp\gadget\pinyin.py(62)_prepare()
-> self.trie_tree_obj = TrieTree()
(Pdb)

回译的API没用了

import jionlp as jio
google_api = jio.GoogleApi()
baidu_api = jio.BaiduApi(
[{'appid': '',
'secretKey': ''}], gap_time=0.5
)
apis = [baidu_api,google_api]
back_trans = jio.BackTranslation(mt_apis=apis)
text = '饿了么凌晨发文将推出新功能，用户可选择是否愿意多等外卖员 5 分钟，你愿意多等这 5 分钟吗？'
print(baidu_api(text)) # 使用接口做单次调用
result = back_trans(text)
print(result)
报一下的错，这里我隐藏了appid和secret
``jio.help() is provided to search how to use jio functions. Traceback (most recent call last): File "E:/python/code/nlp/学术论文分类/LGB推特情感分析.py", line 123, in <module> print(baidu_api(text)) # 使用接口做单次调用 File "E:\install\pythonEnv\lib\site-packages\jionlp\textaug\back_translation\translation_api.py", line 43, in wrapper from_lang = kargs['from_lang'] KeyError: 'from_lang'

AttributeError: partially initialized module 'jionlp' has no attribute 'recognize_location' (most likely due to a circular import)

安装好jionlp后，进入python交互环境，import jionlp报错，如下所示：
python3版本，所有依赖包完成安装
AttributeError: partially initialized module 'jionlp' has no attribute 'recognize_location' (most likely due to a circular import)

兼容控制台所默认输出的帮助信息

描述该功能的用处，可以提供相关资料描述该功能
禁用默认的帮助信息输出功能, 如 jio.disable_help()

该功能是否用于改进项目缺陷，如果是，请描述现有缺陷
只要import 了, 控制台必然会输出

`jio.help()` is provided to search how to use jio functions.
Or browse `https://github.com/dongrixinyu/JioNLP` to get help.

每次都输出这一行呢, 感觉没法让我通过命令行调用的方式和其他非python的程序拼接起来

是否有可用的时间实体提取工具

缺少时间实体提取工具

时间语义分析对于“和”的判断[BUG]

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

版本(Version):

python 版本:3.8
jionlp 版本: 1.3.47

jionlp的调用代码与输入文本(Code & Text):

对于“每周一9点和14点”的，返回如下：
[
	{
		"detail": {
			"definition": "accurate",
			"time": {
				"delta": {
					"day": 7
				},
				"point": {
					"string": "周一9点",
					"time": [
						"2022-02-28 09:00:00",
						"2022-02-28 09:59:59"
					]
				}
			},
			"type": "time_period"
		},
		"offset": [
			0,
			5
		],
		"text": "每周一9点",
		"type": "time_period"
	},
	{
		"detail": {
			"definition": "accurate",
			"time": [
				"2022-03-05 14:00:00",
				"2022-03-05 14:59:59"
			],
			"type": "time_point"
		},
		"offset": [
			6,
			9
		],
		"text": "14点",
		"type": "time_point"
	}
]

调用报错日志如下(Log):

对于这个“和14点”的解析貌似不正确，求修复或者有什么方案能解决？感谢！

期望行为(Expectation)

返回正确的解析结果
参考时间语义解析-关于和字的解析

WARNING add_node: `速食面` belongs to both `tra` and `sim`.

每次都会出现这个警告

初始化时的预加载

如果把该系统集成到线上系统，为了确保模型的响应速度，最好能显式在整个后台跑起来时就能预加载会用到的模块，如何实现？
类似于jieba提供的：jieba.initialize()，会直接加载模型，确保后续调用时不需要重新加载，更快响应。

remove_exception_char 中的正则不起作用

ASCII_EXCEPTION_PATTERN = '[^\x09-\x0d\x20-\x7e\xa0£¥©®°±×÷]'
UNICODE_EXCEPTION_PATTERN = '[^‐-”•…‰※℃℉Ⅰ-ⅹ①-⒛\u3000-】〔-〞㈠-㈩一-龥﹐-﹫！-～￠￡￥]'
EXCEPTION_PATTERN = ASCII_EXCEPTION_PATTERN[:-1] + UNICODE_EXCEPTION_PATTERN[2:]

print(EXCEPTION_PATTERN)

 -~ £¥©®°±×÷‐-”•…‰※℃℉Ⅰ-ⅹ①-⒛　-】〔-〞㈠-㈩一-龥﹐-﹫！-～￠￡￥]

调用方法时，无法清除文本中的异常字符

关键短语抽取部分的异常，缺少一个权重文件

[Errno 2] No such file or directory: 'E:\Anaconda3\Lib\site-packages\jionlp\algorithm\keyphrase\pos_combine_weights.json'
key_phrases_05topic: []

回译时报错

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

函数名 function name：
jio.BackTranslation

请输入报错的文本，以及代码 please input the text and code

# 复制粘贴此处 copy and paste here
国家卫生健康委今天5月25日通报5月24日024时31个省自治区直辖市和**生产建设兵团报告新增新冠肺炎确诊病例15例其中境外输入病例13例本土病例2例新增无症状感染者18例其中境外输入16例本土2例截至5月24日24时现有确诊病例319例截至5月24日各地累计报告接种新冠病毒疫苗527253万剂次

请输入报错信息与日志追踪 please input the bug info and traceback

Traceback (most recent call last):
File "back_transformation_xwlb_data.py", line 41, in
back_transformation_xwlb_data(bt, data_path, save_path, gap)
File "back_transformation_xwlb_data.py", line 22, in back_transformation_xwlb_data
res = back_translation.back_translation(line)
File "/source/code/zhaoyhy/AI/src/DataAugment/back_translation.py", line 57, in back_translation
return self.back_translation_api(text)
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 115, in call
back_tran_result = self.filter_results(text, back_tran_result)
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 197, in filter_results
back_tran_results = [line for line in back_tran_results
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 198, in
if _length_filter(text, line)]
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 192, in _length_filter
if (orig_len / tran_len) < 1 / 3 or (orig_len / tran_len) > 3:
ZeroDivisionError: division by zero

dongrixinyu / jionlp Goto Github PK

jionlp's Introduction

Hi there

jionlp's People

Contributors

Stargazers

Watchers

Forkers

jionlp's Issues

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

请输入报错的文本，以及代码 please input the text and code

请输入报错信息与日志追踪 please input the bug info and traceback

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

请输入报错的文本，以及代码 please input the text and code

请输入报错信息与日志追踪 please input the bug info and traceback

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

entity2tag() 函数还可以提升一下，看了一下源码标注是按照offset顺序来的，如果ner_entities里面多个entities的offset是乱序的

请输入报错的文本，以及代码 please input the text and code

请输入报错信息与日志追踪 please input the bug info and traceback

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

请输入报错的文本，以及代码 please input the text and code

（*** 一定要写清楚是具体哪一条文本数据造成了报错！！！ ***）

请输入您的问题描述，或您预期的功能 please describe the bug or the function you expect

请输入报错的文本，以及代码 please input the text and code

请输入报错信息与日志追踪 please input the bug info and traceback

Recommend Projects

Recommend Topics

Recommend Org

（* 一定要写清楚是具体哪一条文本数据造成了报错！！！ *）