Giter VIP home page Giter VIP logo

thulac-python's Issues

请明示安装方法

文科生学文本挖掘见笑。。。。。

说明文件只提到“将thulac放到目录下,通过 import thulac 来引用”,通过几次测试,我把thulac文件放在了Python安装目录lib文件下,在Python35 import成功,但是主页提供的命令不可用,后来注意到thulac只支持Python2环境,于是换了Python27 。以同样的方式复制到lib文件夹却无法import。idle的提示是:

import thulac

Traceback (most recent call last):
File "<pyshell#0>", line 1, in
import thulac
File "E:\Python27\lib\thulac__init__.py", line 4, in
from character.CBTaggingDecoder import CBTaggingDecoder
File "E:\Python27\lib\thulac\character\CBTaggingDecoder.py", line 10, in
import numpy
ImportError: No module named numpy

系统环境win10 64bit,Python2.7.12 32bit

断句信息

Hi,
使用cut_f对输入的文本进行分词、词性标注,获取的结果可以提供句子之间的间隔信息吗(断句)?还是我需要自己实现对句子进行断句,在调用cut函数对句子进行分词?

分词后返回byte字符

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import thulac
thu = thulac.thulac()
Model loaded succeed
thu.cut('我们中出了一个叛徒')
[['\xe6\x88\x91\xe4\xbb\xac', 'r'], ['\xe4\xb8\xad', 'f'], ['\xe5\x87\xba', 'v'], ['\xe4\xba\x86', 'u'], ['\xe4\xb8\x80\xe4\xb8\xaa', 'm'], ['\xe5\x8f\x9b\xe5\xbe\x92', 'n']]

cut后返回的list里的中午变成byte字符,求解,,我看别人用的都是好好的

这两句效果不太好

  1. 来自全国的情侣们在海南猴岛海誓山盟,见证爱情。吴天军摄 背老婆,抢椰子

结果:

来自_v 全国_n 的_u 情侣_n 们_k 在_p 海南_ns 猴岛海誓_nz 山盟_n ,_w 见证_n 爱情_n 。_w 吴天军_np 摄_g 背_v 老婆_n ,_w 抢_v 椰子_

2.花甲之年的她与老伴在儿子和儿媳的陪同下,和亲家母亲家公共同乘船游览南湾猴岛的海上风光。

结果:

花甲之年_id 的_u 她_r 与_p 老伴_n 在_p 儿子_n 和_c 儿媳_n 的_u 陪同_v 下_f ,_w 和_p 亲家_v 母亲_n 家公_np 共同_d 乘船_v 游览_v 南湾猴岛_ns 的_u 海上_s 风光_n 。_w

jieba分词却能很好切分. 我真的是随机选的.

TypeError: reduce() of empty sequence with no initial value

import thulac
thu = thulac.thulac(seg_only=True)
def thu_segment(sent):
    '''
    :type sent: str
    :return:
    '''
    text = thu.cut(sent, text=True)
    return text  # 返回的编码是UTF-8
if __name__ == '__main__':
    sent = thu_segment(" ")  # 中文空格
    print(sent)

I run the above code, then I got following errors:
Traceback (most recent call last):
File "E:/python/text_classification/preprocess/segment.py", line 35, in
sent = thu_segment(" ")
File "E:/python/text_classification/preprocess/segment.py", line 20, in thu_segment
text = thu.cut(sent, text=True)
File "D:\Python27\lib\site-packages\thulac_init_.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self._cutline, text = text)
File "D:\Python27\lib\site-packages\thulac_init
.py", line 80, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
I think that you should add an initial value in the reduce function to solve the problem. THX~

命令行方式中文分词

我看你们测试用i5,2.4ghz的cpu速度显示挺快的,为什么我用i7,4.2ghz,24g内存,935m的中文文档(已为utf8格式)跑了40min还没跑完。。。是有什么问题吗?截图如附件
2017-05-29 21-48-29
2017-05-29 21-48-48

__init__.py中代码变量写错了?

在运行词性标注模式的时候,init.py文件cut函数中最后一行:

    def cut(self, oiraw):
        if(self.version == 2):
            oiraw = oiraw.decode(self.coding)
        if(self.useT2S):
            traw, poc_cands = self.preprocesser.clean(oiraw)
            raw = self.preprocesser.T2S(traw)
        else:
            raw, poc_cands = self.preprocesser.clean(oiraw)

        if(len(raw) > 0):
            if(self.seg_only):
                tmp, tagged = self.cws_tagging_decoder.segmentTag(raw, poc_cands)
                segged = self.cws_tagging_decoder.get_seg_result()
                # if(self.userDict is not None):
                    # self.userDict.adjustSeg(segged)
                if(self.useFilter):
                    self.myfilter.adjustSeg(segged)
                self.nsDict.adjustSeg(segged)
                self.idiomDict.adjustSeg(segged)
                
                return list(map(lambda x: self.encode(x), segged))
                
            else:
                tmp, tagged = self.tagging_decoder.segmentTag(raw, poc_cands)

                # if(self.userDict is not None):
                    # self.userDict.adjustTag(tagged)
                if(self.useFilter):
                    self.myfilter.adjustTag(tagged)
                self.nsDict.adjustTag(tagged)
                self.idiomDict.adjustTag(tagged)
                    
                return list(map(lambda x: self.encode(x), segged)) <-请问这个变量是否该写成tagged
UnboundLocalError: local variable 'segged' referenced before assignment

TypeError: reduce() of empty sequence with no initial value

File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/utils.py", line 249, in album2json_handle
   data = json.dumps(album.to_ml_json())
 File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/models.py", line 114, in to_ml_json
   "content": self.segment(self.album_desc, engine="thu"),
 File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/models.py", line 165, in segment
   words = splitor.run(text)
 File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/utils.py", line 167, in run
   sentence_splited = unicode(self.splitor.cut(sentence.encode("utf-8"), True), "utf-8")
 File "/Users/xiaotaop/Documents/pyenvs/dj18/lib/python2.7/site-packages/thulac/__init__.py", line 97, in cut
   return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
 File "/Users/xiaotaop/Documents/pyenvs/dj18/lib/python2.7/site-packages/thulac/__init__.py", line 80, in __cutWithOutMethod
   txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
TypeError: reduce() of empty sequence with no initial value

查了下文本内容如下:

游戏名:恶灵附身 千万不要跳着看噢精彩就在一瞬间 如果你笑了,记得帮我点个赞并且分享给你的小伙伴们,让我们一起把欢乐传递下去!不要忘了订阅更多欢乐等着你!

代码如下:

class THUSplit(object):

    def __init__(self):
        self.splitor = thulac.thulac()

    def run(self, sentence):
        """
        :param sentence: unicode string
        :return: array of {"word": string, "flag": string}
        """
        data = []
        sentence_splited = unicode(self.splitor.cut(sentence.encode("utf-8"), True), "utf-8")
        entries = sentence_splited.split(" ")
        for entry in entries:
            tmp = entry.split("_")
            word = tmp[0]
            flag = tmp[1]
            data.append({"word": word, "flag": flag})

        return data

出现 “//” 符号时分词会报错

你好,当出现 “//” 这个网址中常见的符号时分词会报错,但是网页demo版可以分词

报错如下:
content = "我们很//开心"

text = thu1.cut(content, text=True) #进行一句话分词
File "C:\Anaconda2\lib\site-packages\thulac_init_.py", line 78, in cut
txt += reduce(lambda x, y: x + ' ' + y, self.cutline(line)) + '\n'
File "C:\Anaconda2\lib\site-packages\thulac_init_.py", line 133, in cutline
self.punctuation.adjustTag(tagged)
File "C:\Anaconda2\lib\site-packages\thulac\manage\Punctuation.py", line 39, in adjustTag
tmp = sentence[i][0]
IndexError: list index out of range
[Finished in 4.9s with exit code 1]

用户自定义字典读取编码为gbk

Hi, 我尝试使用用户自定义字典时候,发现代码中打开文件没有指定编码格式,在windows下会自动调用gbk来进行解码。是否可以指定utf-8 来进行读取txt文件,避免出现使用gbk解编码的情况?

系统的报错信息如下:

C:\Anaconda3\envs\nlp_py3\lib\site-packages\thulac\manage\Postprocesser.py in __init__(self, filename, tag, isTxt)
     11             lexicon = []
     12             f = open(filename, "r")
---> 13             for i, line in enumerate(f):
     14                 line = line.split()
     15                 lexicon.append([decode(line[0]), i])

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa1 in position 188: illegal multibyte sequence

MemoryError错误

当我运行thu1 = thulac.thulac()时python会一下子占用3G的内存资源,并最后显示MemoryError,麻烦问一下这是我内存不够的原因还是其他原因。
system:centos7_mini_x86_64_1708
python:2.7.5

模型三的训练语料能共享吗?

该模型是由多语料联合训练训练得到(语料包括来自多文体的标注文本和人民日报标注文本等)

模型三的训练语料能共享吗?

IndexError: list index out of range in `adjustTag`

Hi,

thulac 处理某些文本的时会出错

Traceback (most recent call last):
  File "test_thulac.py", line 9, in <module>
    tokens = thul.cut(text)
  File "/usr/local/lib/python2.7/dist-packages/thulac/__init__.py", line 90, in cut
    array += (reduce(lambda x, y: x + [y.split(self.separator)], self.cutline(line), []))
  File "/usr/local/lib/python2.7/dist-packages/thulac/__init__.py", line 133, in cutline
    self.punctuation.adjustTag(tagged)
  File "/usr/local/lib/python2.7/dist-packages/thulac/manage/Punctuation.py", line 39, in adjustTag
    tmp = sentence[i][0]
IndexError: list index out of range

样本文件:
测试.txt
测试2.txt

`三十三天` 被判断为 i , 但是三十五天不会

Hi,

今天发现一个问题,如下

In [29]: t.cut('过了三十五天我们就去广州')
Out[29]:
[['过', 'u'],
 ['了', 'u'],
 ['三十五', 'm'],
 ['天', 'q'],
 ['我们', 'r'],
 ['就', 'd'],
 ['去', 'v'],
 ['广州', 'ns']]

In [30]:

In [30]: t.cut('过了三十三天我们就去广州')
Out[30]:
[['过', 'u'],
 ['了', 'u'],
 ['三十三天', 'i'],
 ['我们', 'r'],
 ['就', 'd'],
 ['去', 'v'],
 ['广州', 'ns']]

三十五天 被切开了,但是三十三天没有。

普通cut方式分词效率较慢问题

测试了一下中长文本的分词效率,1+秒,全部使用默认参数,seg_only的效率明显提升,平均大概20ms(但感觉还是无法满足工业级大规模分词场景,还是有些慢)

环境:MacOS

Python 2.7.13 (default, Apr  4 2017, 08:47:57)
Type "copyright", "credits" or "license" for more information.

IPython 5.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import thulac

In [2]: import sys
   ...:
   ...: reload(sys)
   ...: sys.setdefaultencoding('utf-8')
   ...:

In [3]: t = thulac.thulac()
Model loaded succeed

In [4]: msg = u"""北京时间7月26日凌晨,在布达佩斯进行的2017年世界游泳锦标赛结束了男子100米仰泳决赛争夺,**选手徐嘉余以52秒44获得冠军!这是**选手首次在男子仰泳项目上获得世界大
   ...: 赛(奥运会和世界锦标赛)冠军!美国选手格雷维斯以52秒48获得亚军,两人差距仅仅0.04秒,世界纪录保持者、美国人默菲以52秒59获得第三。
   ...: 由于在刚刚结束的女子100米仰泳决赛上,加拿大选手马塞打破了世界纪录,男子100米仰泳的气氛也相应被掀了起来。为了给身体保暖,徐嘉余穿大衣入场,他如同国王一般,双手扬起接受全场
   ...: 欢呼。
   ...: 比赛开始!徐嘉余在这场比赛尽展绝对优势,他出水就领先!头50米徐嘉余用时25秒12排名第一!进入后半程争夺,徐嘉余继续全面领先!最终冲刺开始,嘉余不再象预赛和半决赛时那样多少放
   ...: 点,而是全力开冲。但两侧的美国选手格雷维斯和默菲对他形成了强力夹击,特别是格雷维斯,看上去都要追上了,但徐嘉余顽强的把优势保持到了终点,52秒44,徐嘉余获得冠军!"""

In [5]: %timeit t.cut(msg)
1 loop, best of 3: 1.09 s per loop

In [6]: t2 = thulac.thulac(seg_only=True)
Model loaded succeed

In [7]: %timeit t2.cut(msg)
10 loops, best of 3: 22.4 ms per loop

中文分词问题

”为了更好地生活“,切分成 “为了 更 好 地 生活"
这个粒度切得太小了吧,应该切成”为了 更好 地 生活“,结巴分词的是切成这个结果,求解释

导入外部词典的问题

我导入了外边词典,然后说我格式不对 所以自己把f = open(filename, "r")中的r修改成了rb ,但是心的问题出现了如下: File "D:\Python\lib\site-packages\thulac\manage\Postprocesser.py", line 20, in init
dm.makeDat(lexicon, 0)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 220, in makeDat
base = self.assign(0, children, True)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 196, in assign
base = self.alloc(offsets)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 159, in alloc
while (2 * (base + ord(offsets[size - 1])) >= self.datSize):
TypeError: ord() expected string of length 1, but int found
请问是什么问题导致的,外部词典是一个utf-8的文件,格式如下:

罗氏婴儿配方粉 n
挂花大头菜 n
黄毛籽 n
青豆 n
儿童营养饼干 n
汤菜 n

书名号内只有一个字,整个句子会分为一个词

书名号内只有一个字,整个句子会分为一个词,在线demo版也会有部分这样的问题,是训练模型的语料差别引起的吗?

python自带模型

句子:“没有一部能跟以前的佳作相提并论,意义何在?《我》纪录片形式根本不适合他”
结果:没有一部能跟以前的佳作相提并论,意义何在?《我》纪录片形式根本不适合他_r

句子:“没有一部能跟以前的佳作相提并论,意义何在?《我们》纪录片形式根本不适合他“
结果:没有_v 一_m 部_q 能_v 跟_p 以前_f 的_u 佳作_n 相提并论_id ,_w 意义_n 何在_v ?_w 《_w 我们_r 》_w 纪录片_n 形式_n 根本_d 不_d 适合_v 他_r

在线demo

句子:“挺喜欢听人讲《洞》的跳跃式剪辑,为什么要这么干“
结果:“挺_d 喜欢_v 听_v 人_n 讲_v 《_w 洞》的跳跃式剪辑,为什么要这么干_c

句子:“挺喜欢听人讲《黑洞》的跳跃式剪辑,为什么要这么干“
结果:“挺_d 喜欢_v 听_v 人_n 讲_v 《_w 黑洞_n 》_w 的_u 跳跃_v 式_k 剪辑_v ,_w 为什么_r 要_v 这么_r 干_v”

自定义词典有大小限制

加载自定义词典一直提示IndexError: list index out of range,从3万减到300才可以。
我能否完全替换掉你们的词典,用自己的词典?

分词运行时异常 KeyError

运行分词是抛出异常CBTaggingDecoder的get_seg_result函数,不知道是什么原因。具体异常信息如下:thulac (0.1.1)

File "xxx/thulac/init.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "xxx/thulac/init.py", line 90, in __cutWithOutMethod
array += (reduce(lambda x, y: x + [[y, '']], cut_method(line), []))
File "xxx/thulac/init.py", line 120, in __cutline
segged = self.__cws_tagging_decoder.get_seg_result()
File "xxx/thulac/character/CBTaggingDecoder.py", line 191, in get_seg_result
if((i == 0) or (self.labelInfo[self.result[i]][0] == '0') or (self.labelInfo[self.result[i]][0] == '3')):
KeyError: 24

python的demo运行报错

您好,我尝试用python导入THULAC包来完成一项中文分词任务。但是,在pycharm编译器里连demo都运行不了,请问一下这是怎么回事呢?

完整报错反馈如下:

`/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/andychao/Downloads/THULAC-Python-master/demo.py
Traceback (most recent call last):
File "/Users/andychao/Downloads/THULAC-Python-master/demo.py", line 5, in
thu1 = thulac.thulac("-seg_only") #设置模式为行分词模式
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/init.py", line 68, in init
self.cws_tagging_decoder.init((self.prefix+"cws_model.bin"), (self.prefix+"cws_dat.bin"),(self.prefix+"cws_label.txt"))
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/character/CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/character/CBModel.py", line 55, in init
inputfile = open(filename, "rb")
IOError: [Errno 2] No such file or directory: '/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/models/cws_model.bin'

Process finished with exit code 1`

为什么在本地分词的效果没有在线demo上的精准


“配置没得说 很好了!但是因为上一代note3的存在,如果note4超4800那性价比真的一般般!"
在demo上
“配置_v 没_d 得_vm 说_v 很_d 好_a 了_u !_w 但是_c 因为_c 上一代_n note3_x 的_u 存在_v ,_w 如果_c note4_x 超_v 4800_m 那_r 性_k 价_n 比_p 真的_a 一般_a 般_n !_w”
note3和note4是分成一个整体的,
但是在本地分词则是
“但是_c 因为_c 上一代_n note_x 3_m 的_u 存在_v ,_w 如果_c note_x 4_m 超_v 4800_m 那_r 性_k 价_n 比_p 真_a 的_u 一_d 般_a 般_n !_w”

还有像魅族,在本地也被分开,还有就是涉及到数字和字母组合的,像2.5D屏幕,在本地也被分成
“加上_v 2_m ._w 5_m D_x 的_u 屏幕_n ,_w 整体_n 大气_n 。_w”
我用的是经过资源申请表后下载的模型,THULAC_pro_c++_v1里的models,
但是效果和在线演示的差距很明显,本地类似的情况分词结果基本都一致,全分散开了。这个是我配置的问题呢,还是版本没对应好,
还请指点迷津

train_c不知道使用

我想使用训练模式, 但是教程只说
./train_c [-s separator] [-b bigram_threshold] [-i iteration] training_filename model_filename
使用training_filename为训练集,训练出来的模型名字为model_filename
train_c 我不知道在哪里找到

接口使用示例报错:UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 4: invalid continuation byte

Hi,
在运行主页1.1.接口使用示例-代码示例1的时候报错:
Traceback (most recent call last):
File "test_thulac_POS.py", line 16, in
thu1.cut_f("input.txt", "output.txt")
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 167, in cut_f
cutted = self.cut(oiraw, text = True)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 93, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 76, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 99, in __cutline
oiraw = decode(oiraw, coding = self.__coding)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/base/compatibility.py", line 4, in decode
return string.decode('utf-8')
File "/home/bdsirs/py2env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 4: invalid continuation byte

我已经在py文件的头部加了#encoding=utf-8,但是还是报错。请问这个有什么解决方法吗?

双数组trie的实现

thulac中的trie是用单个数组来实现的,我在看源代码的时候非常困扰,网上又很难找到相关的资料,但是我又好奇thulac中的trie到底是怎样实现的,能不能给一个文档读一读呢。

Python Demo MemoryError错误

您好,
感谢你们的分享!知道THULAC出了PYTHON 版本之后第一时间就过来尝试了一下。
但是在运行代码时候, 出现了以下错误。

Traceback (most recent call last):
  File "D:\Software\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 1596, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "D:\Software\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "D:/Documents/Projects/TangPoetryAnalyzer/test.py", line 4, in <module>
    thu1 = thulac.thulac()  #默认模式
  File "D:\Software\Python27\lib\site-packages\thulac\__init__.py", line 58, in __init__
    self.__tagging_decoder.init((self.__prefix+"model_c_model.bin"),(self.__prefix+"model_c_dat.bin"),(self.__prefix+"model_c_label.txt"))
  File "D:\Software\Python27\lib\site-packages\thulac\character\CBTaggingDecoder.py", line 36, in init
    self.model = CBModel(modelFile)
  File "D:\Software\Python27\lib\site-packages\thulac\character\CBModel.py", line 58, in __init__
    self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
MemoryError

我看了下报错行是在CBModel.py中的__init__:

self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)

我的测试代码是test.py

# -*- coding: utf-8 -*-
import thulac

thu1 = thulac.thulac()  #默认模式
text = thu1.cut("我爱北京***", text=True)  #进行一句话分词
print(text)

我的OS是Windows 10,安装命令是WIN命令行使用的pip install thulac. 版本号是Successfully installed thulac-0.1.1
IDE是PYCHARM,但是我尝试使用命令行编译依然会出现同样错误。

我尝试过的方法有:

  1. uninstall再install
  2. 命令行编译
  3. 网上寻找相关报错未果

但是都没能找到解决方案,因此发布了一个issue希望能够得到您的帮助,谢谢!

Pycharm 等IDE中无法导入thulac

非常感谢创作这个中文分词包,并且自带词性识别,不过好像在Pycharm ide中无法导入thulac,在terminal里面可以,不知为何,同志们可以测试一下

About mark `id` and the result of `demo`.

我觀察到使用Demo [http://thulac.thunlp.org/demo] 對“你好”進行分詞
會解釋為:
你_r 好_a

而使用
thulac.thulac().cut("你好", text=True)
會變成
你好_id

這裡有幾個問題:

  1. id是什麼
  2. 為什麼Demo的結果和實際結果不同呢?

自定义字典没起作用

在自定义字典中添加了杨幂
句式:你喜欢杨幂吗 分词结果是:你_r 喜欢_v 杨幂吗_np
想要的结果是 你_r 喜欢_v 杨幂_rm 吗_u
只有当杨幂后面跟的字或者词能被系统识别时,系统才会把杨幂作为一个人名,为什么自定义中添加了杨幂没效果

如何使用用户词典?

您好,我看到您在issues里回答说已经可以使用户词典,但是我并没有找到相关的接口。请问,可以请您给出一个使用示例吗?谢谢。

MemoryError问题

当我把seg_only设置为True时,分词成功
当设置为False时,就产生MemoryError问题,请问是怎么回事?

thu1 = thulac.thulac() memory error

Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\thulac_init_.py", line 58, in init
self.__tagging_decoder.init((self.__prefix+"model_c_model.bin"),(self.__prefix+"model_c_dat.bin"),(self.__prefix+"model_c_label.txt"))
File "C:\Python27\lib\site-packages\thulac\character\CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "C:\Python27\lib\site-packages\thulac\character\CBModel.py", line 58, in init
self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
MemoryError

python版用户自定义词典

看到代码里注释掉了用户自定义词典的部分,目前版本是不是不支持用户自定义词典。请问新版会支持吗?非常感谢

关于分词的时候自动去除空格

输入:

而荔 波 肉又 丧 心 病 狂 的不肯悔改

输出:

而_c 荔_v 波_n 肉_n 又_d 丧心病狂_i 的_u 不_d 肯_v 悔改_v

工具对文本的空格去除后,再进行分词。但我现在的任务不希望去除空格。
请问怎么设置可以不去除空格,直接分词

range() 手滑漏了一个 len(obj) 吧

/Volumes/Transcend/Corpus/THULAC-Python/thulac/manage/Preprocesser.pyc in T2S(self, sentence)
    283     def T2S(self, sentence):
    284         newSentence = ""
--> 285         for i in range(sentence):
    286             newSentence += str(getT2S(sentence[i]))
    287         return newSentence

TypeError: range() integer end argument expected, got unicode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.