thunlp / thulac-python Goto Github PK
View Code? Open in Web Editor NEWAn Efficient Lexical Analyzer for Chinese
License: MIT License
An Efficient Lexical Analyzer for Chinese
License: MIT License
tag_only模式下tag为空字符
-> seg_only模式下tag为空字符
文科生学文本挖掘见笑。。。。。
说明文件只提到“将thulac放到目录下,通过 import thulac 来引用”,通过几次测试,我把thulac文件放在了Python安装目录lib文件下,在Python35 import成功,但是主页提供的命令不可用,后来注意到thulac只支持Python2环境,于是换了Python27 。以同样的方式复制到lib文件夹却无法import。idle的提示是:
import thulac
Traceback (most recent call last):
File "<pyshell#0>", line 1, in
import thulac
File "E:\Python27\lib\thulac__init__.py", line 4, in
from character.CBTaggingDecoder import CBTaggingDecoder
File "E:\Python27\lib\thulac\character\CBTaggingDecoder.py", line 10, in
import numpy
ImportError: No module named numpy
系统环境win10 64bit,Python2.7.12 32bit
Hi,
使用cut_f对输入的文本进行分词、词性标注,获取的结果可以提供句子之间的间隔信息吗(断句)?还是我需要自己实现对句子进行断句,在调用cut函数对句子进行分词?
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import thulac
thu = thulac.thulac()
Model loaded succeed
thu.cut('我们中出了一个叛徒')
[['\xe6\x88\x91\xe4\xbb\xac', 'r'], ['\xe4\xb8\xad', 'f'], ['\xe5\x87\xba', 'v'], ['\xe4\xba\x86', 'u'], ['\xe4\xb8\x80\xe4\xb8\xaa', 'm'], ['\xe5\x8f\x9b\xe5\xbe\x92', 'n']]
cut后返回的list里的中午变成byte字符,求解,,我看别人用的都是好好的
结果:
来自_v 全国_n 的_u 情侣_n 们_k 在_p 海南_ns 猴岛海誓_nz 山盟_n ,_w 见证_n 爱情_n 。_w 吴天军_np 摄_g 背_v 老婆_n ,_w 抢_v 椰子_
2.花甲之年的她与老伴在儿子和儿媳的陪同下,和亲家母亲家公共同乘船游览南湾猴岛的海上风光。
结果:
花甲之年_id 的_u 她_r 与_p 老伴_n 在_p 儿子_n 和_c 儿媳_n 的_u 陪同_v 下_f ,_w 和_p 亲家_v 母亲_n 家公_np 共同_d 乘船_v 游览_v 南湾猴岛_ns 的_u 海上_s 风光_n 。_w
jieba分词却能很好切分. 我真的是随机选的.
import thulac
thu = thulac.thulac(seg_only=True)
def thu_segment(sent):
'''
:type sent: str
:return:
'''
text = thu.cut(sent, text=True)
return text # 返回的编码是UTF-8
if __name__ == '__main__':
sent = thu_segment(" ") # 中文空格
print(sent)
I run the above code, then I got following errors:
Traceback (most recent call last):
File "E:/python/text_classification/preprocess/segment.py", line 35, in
sent = thu_segment(" ")
File "E:/python/text_classification/preprocess/segment.py", line 20, in thu_segment
text = thu.cut(sent, text=True)
File "D:\Python27\lib\site-packages\thulac_init_.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self._cutline, text = text)
File "D:\Python27\lib\site-packages\thulac_init.py", line 80, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
I think that you should add an initial value in the reduce function to solve the problem. THX~
在运行词性标注模式的时候,init.py文件cut函数中最后一行:
def cut(self, oiraw):
if(self.version == 2):
oiraw = oiraw.decode(self.coding)
if(self.useT2S):
traw, poc_cands = self.preprocesser.clean(oiraw)
raw = self.preprocesser.T2S(traw)
else:
raw, poc_cands = self.preprocesser.clean(oiraw)
if(len(raw) > 0):
if(self.seg_only):
tmp, tagged = self.cws_tagging_decoder.segmentTag(raw, poc_cands)
segged = self.cws_tagging_decoder.get_seg_result()
# if(self.userDict is not None):
# self.userDict.adjustSeg(segged)
if(self.useFilter):
self.myfilter.adjustSeg(segged)
self.nsDict.adjustSeg(segged)
self.idiomDict.adjustSeg(segged)
return list(map(lambda x: self.encode(x), segged))
else:
tmp, tagged = self.tagging_decoder.segmentTag(raw, poc_cands)
# if(self.userDict is not None):
# self.userDict.adjustTag(tagged)
if(self.useFilter):
self.myfilter.adjustTag(tagged)
self.nsDict.adjustTag(tagged)
self.idiomDict.adjustTag(tagged)
return list(map(lambda x: self.encode(x), segged)) <-请问这个变量是否该写成tagged?
UnboundLocalError: local variable 'segged' referenced before assignment
File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/utils.py", line 249, in album2json_handle
data = json.dumps(album.to_ml_json())
File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/models.py", line 114, in to_ml_json
"content": self.segment(self.album_desc, engine="thu"),
File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/models.py", line 165, in segment
words = splitor.run(text)
File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/utils.py", line 167, in run
sentence_splited = unicode(self.splitor.cut(sentence.encode("utf-8"), True), "utf-8")
File "/Users/xiaotaop/Documents/pyenvs/dj18/lib/python2.7/site-packages/thulac/__init__.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "/Users/xiaotaop/Documents/pyenvs/dj18/lib/python2.7/site-packages/thulac/__init__.py", line 80, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
TypeError: reduce() of empty sequence with no initial value
查了下文本内容如下:
游戏名:恶灵附身 千万不要跳着看噢
精彩就在一瞬间如果你笑了,记得帮我点个赞并且分享给你的小伙伴们,让我们一起把欢乐传递下去!不要忘了订阅更多欢乐等着你!
代码如下:
class THUSplit(object):
def __init__(self):
self.splitor = thulac.thulac()
def run(self, sentence):
"""
:param sentence: unicode string
:return: array of {"word": string, "flag": string}
"""
data = []
sentence_splited = unicode(self.splitor.cut(sentence.encode("utf-8"), True), "utf-8")
entries = sentence_splited.split(" ")
for entry in entries:
tmp = entry.split("_")
word = tmp[0]
flag = tmp[1]
data.append({"word": word, "flag": flag})
return data
请问新版是取消了 -deli 参数吗?
-deli delimeter 设置词与词性间的分隔符,默认为下划线_
你好,当出现 “//” 这个网址中常见的符号时分词会报错,但是网页demo版可以分词
报错如下:
content = "我们很//开心"
text = thu1.cut(content, text=True) #进行一句话分词
File "C:\Anaconda2\lib\site-packages\thulac_init_.py", line 78, in cut
txt += reduce(lambda x, y: x + ' ' + y, self.cutline(line)) + '\n'
File "C:\Anaconda2\lib\site-packages\thulac_init_.py", line 133, in cutline
self.punctuation.adjustTag(tagged)
File "C:\Anaconda2\lib\site-packages\thulac\manage\Punctuation.py", line 39, in adjustTag
tmp = sentence[i][0]
IndexError: list index out of range
[Finished in 4.9s with exit code 1]
如题
Hi, 我尝试使用用户自定义字典时候,发现代码中打开文件没有指定编码格式,在windows下会自动调用gbk来进行解码。是否可以指定utf-8 来进行读取txt文件,避免出现使用gbk解编码的情况?
系统的报错信息如下:
C:\Anaconda3\envs\nlp_py3\lib\site-packages\thulac\manage\Postprocesser.py in __init__(self, filename, tag, isTxt)
11 lexicon = []
12 f = open(filename, "r")
---> 13 for i, line in enumerate(f):
14 line = line.split()
15 lexicon.append([decode(line[0]), i])
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa1 in position 188: illegal multibyte sequence
当我运行thu1 = thulac.thulac()时python会一下子占用3G的内存资源,并最后显示MemoryError,麻烦问一下这是我内存不够的原因还是其他原因。
system:centos7_mini_x86_64_1708
python:2.7.5
该模型是由多语料联合训练训练得到(语料包括来自多文体的标注文本和人民日报标注文本等)
模型三的训练语料能共享吗?
Hi,
thulac 处理某些文本的时会出错
Traceback (most recent call last):
File "test_thulac.py", line 9, in <module>
tokens = thul.cut(text)
File "/usr/local/lib/python2.7/dist-packages/thulac/__init__.py", line 90, in cut
array += (reduce(lambda x, y: x + [y.split(self.separator)], self.cutline(line), []))
File "/usr/local/lib/python2.7/dist-packages/thulac/__init__.py", line 133, in cutline
self.punctuation.adjustTag(tagged)
File "/usr/local/lib/python2.7/dist-packages/thulac/manage/Punctuation.py", line 39, in adjustTag
tmp = sentence[i][0]
IndexError: list index out of range
是否提供训练语料接口?
Hi,
今天发现一个问题,如下
In [29]: t.cut('过了三十五天我们就去广州')
Out[29]:
[['过', 'u'],
['了', 'u'],
['三十五', 'm'],
['天', 'q'],
['我们', 'r'],
['就', 'd'],
['去', 'v'],
['广州', 'ns']]
In [30]:
In [30]: t.cut('过了三十三天我们就去广州')
Out[30]:
[['过', 'u'],
['了', 'u'],
['三十三天', 'i'],
['我们', 'r'],
['就', 'd'],
['去', 'v'],
['广州', 'ns']]
三十五天 被切开了,但是三十三天没有。
测试了一下中长文本的分词效率,1+秒,全部使用默认参数,seg_only的效率明显提升,平均大概20ms(但感觉还是无法满足工业级大规模分词场景,还是有些慢)
环境:MacOS
Python 2.7.13 (default, Apr 4 2017, 08:47:57)
Type "copyright", "credits" or "license" for more information.
IPython 5.3.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import thulac
In [2]: import sys
...:
...: reload(sys)
...: sys.setdefaultencoding('utf-8')
...:
In [3]: t = thulac.thulac()
Model loaded succeed
In [4]: msg = u"""北京时间7月26日凌晨,在布达佩斯进行的2017年世界游泳锦标赛结束了男子100米仰泳决赛争夺,**选手徐嘉余以52秒44获得冠军!这是**选手首次在男子仰泳项目上获得世界大
...: 赛(奥运会和世界锦标赛)冠军!美国选手格雷维斯以52秒48获得亚军,两人差距仅仅0.04秒,世界纪录保持者、美国人默菲以52秒59获得第三。
...: 由于在刚刚结束的女子100米仰泳决赛上,加拿大选手马塞打破了世界纪录,男子100米仰泳的气氛也相应被掀了起来。为了给身体保暖,徐嘉余穿大衣入场,他如同国王一般,双手扬起接受全场
...: 欢呼。
...: 比赛开始!徐嘉余在这场比赛尽展绝对优势,他出水就领先!头50米徐嘉余用时25秒12排名第一!进入后半程争夺,徐嘉余继续全面领先!最终冲刺开始,嘉余不再象预赛和半决赛时那样多少放
...: 点,而是全力开冲。但两侧的美国选手格雷维斯和默菲对他形成了强力夹击,特别是格雷维斯,看上去都要追上了,但徐嘉余顽强的把优势保持到了终点,52秒44,徐嘉余获得冠军!"""
In [5]: %timeit t.cut(msg)
1 loop, best of 3: 1.09 s per loop
In [6]: t2 = thulac.thulac(seg_only=True)
Model loaded succeed
In [7]: %timeit t2.cut(msg)
10 loops, best of 3: 22.4 ms per loop
”为了更好地生活“,切分成 “为了 更 好 地 生活"
这个粒度切得太小了吧,应该切成”为了 更好 地 生活“,结巴分词的是切成这个结果,求解释
我导入了外边词典,然后说我格式不对 所以自己把f = open(filename, "r")中的r修改成了rb ,但是心的问题出现了如下: File "D:\Python\lib\site-packages\thulac\manage\Postprocesser.py", line 20, in init
dm.makeDat(lexicon, 0)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 220, in makeDat
base = self.assign(0, children, True)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 196, in assign
base = self.alloc(offsets)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 159, in alloc
while (2 * (base + ord(offsets[size - 1])) >= self.datSize):
TypeError: ord() expected string of length 1, but int found
请问是什么问题导致的,外部词典是一个utf-8的文件,格式如下:
罗氏婴儿配方粉 n
挂花大头菜 n
黄毛籽 n
青豆 n
儿童营养饼干 n
汤菜 n
我在使用pip安装完python版之后,在参数中使用该设置:
-filter 使用过滤器去除一些没有意义的词语,例如“可以”。
thu1 = thulac.thulac(seg_only=True,filt=True)
然而并不能去掉结果中的标点符号以及"的"之类的停用词
请问下python版是将C++ 封装起来的还是python全部重写的?
以下载编译出最新的libthulac.so,测试发现fast_cut比cut的速度慢的太多了,什么原因呢?
书名号内只有一个字,整个句子会分为一个词,在线demo版也会有部分这样的问题,是训练模型的语料差别引起的吗?
句子:“没有一部能跟以前的佳作相提并论,意义何在?《我》纪录片形式根本不适合他”
结果:没有一部能跟以前的佳作相提并论,意义何在?《我》纪录片形式根本不适合他_r
句子:“没有一部能跟以前的佳作相提并论,意义何在?《我们》纪录片形式根本不适合他“
结果:没有_v 一_m 部_q 能_v 跟_p 以前_f 的_u 佳作_n 相提并论_id ,_w 意义_n 何在_v ?_w 《_w 我们_r 》_w 纪录片_n 形式_n 根本_d 不_d 适合_v 他_r
句子:“挺喜欢听人讲《洞》的跳跃式剪辑,为什么要这么干“
结果:“挺_d 喜欢_v 听_v 人_n 讲_v 《_w 洞》的跳跃式剪辑,为什么要这么干_c”
句子:“挺喜欢听人讲《黑洞》的跳跃式剪辑,为什么要这么干“
结果:“挺_d 喜欢_v 听_v 人_n 讲_v 《_w 黑洞_n 》_w 的_u 跳跃_v 式_k 剪辑_v ,_w 为什么_r 要_v 这么_r 干_v”
加载自定义词典一直提示IndexError: list index out of range,从3万减到300才可以。
我能否完全替换掉你们的词典,用自己的词典?
运行分词是抛出异常CBTaggingDecoder的get_seg_result函数,不知道是什么原因。具体异常信息如下:thulac (0.1.1)
File "xxx/thulac/init.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "xxx/thulac/init.py", line 90, in __cutWithOutMethod
array += (reduce(lambda x, y: x + [[y, '']], cut_method(line), []))
File "xxx/thulac/init.py", line 120, in __cutline
segged = self.__cws_tagging_decoder.get_seg_result()
File "xxx/thulac/character/CBTaggingDecoder.py", line 191, in get_seg_result
if((i == 0) or (self.labelInfo[self.result[i]][0] == '0') or (self.labelInfo[self.result[i]][0] == '3')):
KeyError: 24
Hi,
我最近在使用nltk 和 stanford corenlp。我发现 stanford 的词性标注标准 用的是 Penn Chinese Treebank ,如下
您好,我尝试用python导入THULAC包来完成一项中文分词任务。但是,在pycharm编译器里连demo都运行不了,请问一下这是怎么回事呢?
完整报错反馈如下:
`/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/andychao/Downloads/THULAC-Python-master/demo.py
Traceback (most recent call last):
File "/Users/andychao/Downloads/THULAC-Python-master/demo.py", line 5, in
thu1 = thulac.thulac("-seg_only") #设置模式为行分词模式
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/init.py", line 68, in init
self.cws_tagging_decoder.init((self.prefix+"cws_model.bin"), (self.prefix+"cws_dat.bin"),(self.prefix+"cws_label.txt"))
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/character/CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/character/CBModel.py", line 55, in init
inputfile = open(filename, "rb")
IOError: [Errno 2] No such file or directory: '/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/models/cws_model.bin'
Process finished with exit code 1`
像
“配置没得说 很好了!但是因为上一代note3的存在,如果note4超4800那性价比真的一般般!"
在demo上
“配置_v 没_d 得_vm 说_v 很_d 好_a 了_u !_w 但是_c 因为_c 上一代_n note3_x 的_u 存在_v ,_w 如果_c note4_x 超_v 4800_m 那_r 性_k 价_n 比_p 真的_a 一般_a 般_n !_w”
note3和note4是分成一个整体的,
但是在本地分词则是
“但是_c 因为_c 上一代_n note_x 3_m 的_u 存在_v ,_w 如果_c note_x 4_m 超_v 4800_m 那_r 性_k 价_n 比_p 真_a 的_u 一_d 般_a 般_n !_w”
还有像魅族,在本地也被分开,还有就是涉及到数字和字母组合的,像2.5D屏幕,在本地也被分成
“加上_v 2_m ._w 5_m D_x 的_u 屏幕_n ,_w 整体_n 大气_n 。_w”
我用的是经过资源申请表后下载的模型,THULAC_pro_c++_v1里的models,
但是效果和在线演示的差距很明显,本地类似的情况分词结果基本都一致,全分散开了。这个是我配置的问题呢,还是版本没对应好,
还请指点迷津
我想使用训练模式, 但是教程只说
./train_c [-s separator] [-b bigram_threshold] [-i iteration] training_filename model_filename
使用training_filename为训练集,训练出来的模型名字为model_filename
train_c 我不知道在哪里找到
Hi,
在运行主页1.1.接口使用示例-代码示例1的时候报错:
Traceback (most recent call last):
File "test_thulac_POS.py", line 16, in
thu1.cut_f("input.txt", "output.txt")
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 167, in cut_f
cutted = self.cut(oiraw, text = True)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 93, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 76, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 99, in __cutline
oiraw = decode(oiraw, coding = self.__coding)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/base/compatibility.py", line 4, in decode
return string.decode('utf-8')
File "/home/bdsirs/py2env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 4: invalid continuation byte
我已经在py文件的头部加了#encoding=utf-8,但是还是报错。请问这个有什么解决方法吗?
thulac中的trie是用单个数组来实现的,我在看源代码的时候非常困扰,网上又很难找到相关的资料,但是我又好奇thulac中的trie到底是怎样实现的,能不能给一个文档读一读呢。
您好,
感谢你们的分享!知道THULAC出了PYTHON 版本之后第一时间就过来尝试了一下。
但是在运行代码时候, 出现了以下错误。
Traceback (most recent call last):
File "D:\Software\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 1596, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "D:\Software\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:/Documents/Projects/TangPoetryAnalyzer/test.py", line 4, in <module>
thu1 = thulac.thulac() #默认模式
File "D:\Software\Python27\lib\site-packages\thulac\__init__.py", line 58, in __init__
self.__tagging_decoder.init((self.__prefix+"model_c_model.bin"),(self.__prefix+"model_c_dat.bin"),(self.__prefix+"model_c_label.txt"))
File "D:\Software\Python27\lib\site-packages\thulac\character\CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "D:\Software\Python27\lib\site-packages\thulac\character\CBModel.py", line 58, in __init__
self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
MemoryError
我看了下报错行是在CBModel.py中的__init__:
self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
我的测试代码是test.py
# -*- coding: utf-8 -*-
import thulac
thu1 = thulac.thulac() #默认模式
text = thu1.cut("我爱北京***", text=True) #进行一句话分词
print(text)
我的OS是Windows 10,安装命令是WIN命令行使用的pip install thulac. 版本号是Successfully installed thulac-0.1.1
IDE是PYCHARM,但是我尝试使用命令行编译依然会出现同样错误。
我尝试过的方法有:
但是都没能找到解决方案,因此发布了一个issue希望能够得到您的帮助,谢谢!
非常感谢创作这个中文分词包,并且自带词性识别,不过好像在Pycharm ide中无法导入thulac,在terminal里面可以,不知为何,同志们可以测试一下
我觀察到使用Demo [http://thulac.thunlp.org/demo] 對“你好”進行分詞
會解釋為:
你_r 好_a
而使用
thulac.thulac().cut("你好", text=True)
會變成
你好_id
這裡有幾個問題:
id
是什麼在自定义字典中添加了杨幂
句式:你喜欢杨幂吗 分词结果是:你_r 喜欢_v 杨幂吗_np
想要的结果是 你_r 喜欢_v 杨幂_rm 吗_u
只有当杨幂后面跟的字或者词能被系统识别时,系统才会把杨幂作为一个人名,为什么自定义中添加了杨幂没效果
您好,我看到您在issues里回答说已经可以使用户词典,但是我并没有找到相关的接口。请问,可以请您给出一个使用示例吗?谢谢。
当我把seg_only设置为True时,分词成功
当设置为False时,就产生MemoryError问题,请问是怎么回事?
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\thulac_init_.py", line 58, in init
self.__tagging_decoder.init((self.__prefix+"model_c_model.bin"),(self.__prefix+"model_c_dat.bin"),(self.__prefix+"model_c_label.txt"))
File "C:\Python27\lib\site-packages\thulac\character\CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "C:\Python27\lib\site-packages\thulac\character\CBModel.py", line 58, in init
self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
MemoryError
看到代码里注释掉了用户自定义词典的部分,目前版本是不是不支持用户自定义词典。请问新版会支持吗?非常感谢
Windows 7 + python3.6.2 不指定编码方式,读取utf-8字典文件,会报错 UnicodeDecodeError: 'gbk' codec can't decode byte …… illegal multibyte sequence
能否指定选项对以上模型不分词?
输入:
而荔 波 肉又 丧 心 病 狂 的不肯悔改
输出:
而_c 荔_v 波_n 肉_n 又_d 丧心病狂_i 的_u 不_d 肯_v 悔改_v
工具对文本的空格去除后,再进行分词。但我现在的任务不希望去除空格。
请问怎么设置可以不去除空格,直接分词?
能否考虑支持pip安装?
另外支持python3吗?
/Volumes/Transcend/Corpus/THULAC-Python/thulac/manage/Preprocesser.pyc in T2S(self, sentence)
283 def T2S(self, sentence):
284 newSentence = ""
--> 285 for i in range(sentence):
286 newSentence += str(getT2S(sentence[i]))
287 return newSentence
TypeError: range() integer end argument expected, got unicode.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.