thunlp / thulac-python Goto Github PK

View Code? Open in Web Editor NEW

2.0K 79.0 333.0 80 KB

An Efficient Lexical Analyzer for Chinese

License: MIT License

Python 100.00%

chinese-nlp

thulac-python's Issues

README.md里的一个小错误？

tag_only模式下tag为空字符 -> seg_only模式下tag为空字符

请明示安装方法

文科生学文本挖掘见笑。。。。。

说明文件只提到“将thulac放到目录下，通过 import thulac 来引用”，通过几次测试，我把thulac文件放在了Python安装目录lib文件下，在Python35 import成功，但是主页提供的命令不可用，后来注意到thulac只支持Python2环境，于是换了Python27 。以同样的方式复制到lib文件夹却无法import。idle的提示是：

import thulac

Traceback (most recent call last):
File "<pyshell#0>", line 1, in
import thulac
File "E:\Python27\lib\thulac__init__.py", line 4, in
from character.CBTaggingDecoder import CBTaggingDecoder
File "E:\Python27\lib\thulac\character\CBTaggingDecoder.py", line 10, in
import numpy
ImportError: No module named numpy

系统环境win10 64bit，Python2.7.12 32bit

断句信息

Hi,
使用cut_f对输入的文本进行分词、词性标注，获取的结果可以提供句子之间的间隔信息吗（断句）？还是我需要自己实现对句子进行断句，在调用cut函数对句子进行分词？

分词后返回byte字符

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import thulac
thu = thulac.thulac()
Model loaded succeed
thu.cut('我们中出了一个叛徒')
[['\xe6\x88\x91\xe4\xbb\xac', 'r'], ['\xe4\xb8\xad', 'f'], ['\xe5\x87\xba', 'v'], ['\xe4\xba\x86', 'u'], ['\xe4\xb8\x80\xe4\xb8\xaa', 'm'], ['\xe5\x8f\x9b\xe5\xbe\x92', 'n']]

cut后返回的list里的中午变成byte字符，求解，，我看别人用的都是好好的

这两句效果不太好

来自全国的情侣们在海南猴岛海誓山盟，见证爱情。吴天军摄背老婆，抢椰子

结果:

来自_v 全国_n 的_u 情侣_n 们_k 在_p 海南_ns 猴岛海誓_nz 山盟_n ，_w 见证_n 爱情_n 。_w 吴天军_np 摄_g 背_v 老婆_n ，_w 抢_v 椰子_

2.花甲之年的她与老伴在儿子和儿媳的陪同下，和亲家母亲家公共同乘船游览南湾猴岛的海上风光。

结果:

花甲之年_id 的_u 她_r 与_p 老伴_n 在_p 儿子_n 和_c 儿媳_n 的_u 陪同_v 下_f ，_w 和_p 亲家_v 母亲_n 家公_np 共同_d 乘船_v 游览_v 南湾猴岛_ns 的_u 海上_s 风光_n 。_w

jieba分词却能很好切分. 我真的是随机选的.

TypeError: reduce() of empty sequence with no initial value

import thulac
thu = thulac.thulac(seg_only=True)
def thu_segment(sent):
    '''
    :type sent: str
    :return:
    '''
    text = thu.cut(sent, text=True)
    return text  # 返回的编码是UTF-8
if __name__ == '__main__':
    sent = thu_segment(" ")  # 中文空格
    print(sent)

I run the above code, then I got following errors:
Traceback (most recent call last):
File "E:/python/text_classification/preprocess/segment.py", line 35, in
sent = thu_segment(" ")
File "E:/python/text_classification/preprocess/segment.py", line 20, in thu_segment
text = thu.cut(sent, text=True)
File "D:\Python27\lib\site-packages\thulac_init_.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self._cutline, text = text)
File "D:\Python27\lib\site-packages\thulac_init.py", line 80, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
I think that you should add an initial value in the reduce function to solve the problem. THX~

命令行方式中文分词

我看你们测试用i5，2.4ghz的cpu速度显示挺快的，为什么我用i7，4.2ghz，24g内存，935m的中文文档（已为utf8格式）跑了40min还没跑完。。。是有什么问题吗？截图如附件

init.py中代码变量写错了？

在运行词性标注模式的时候，init.py文件cut函数中最后一行：

    def cut(self, oiraw):
        if(self.version == 2):
            oiraw = oiraw.decode(self.coding)
        if(self.useT2S):
            traw, poc_cands = self.preprocesser.clean(oiraw)
            raw = self.preprocesser.T2S(traw)
        else:
            raw, poc_cands = self.preprocesser.clean(oiraw)

        if(len(raw) > 0):
            if(self.seg_only):
                tmp, tagged = self.cws_tagging_decoder.segmentTag(raw, poc_cands)
                segged = self.cws_tagging_decoder.get_seg_result()
                # if(self.userDict is not None):
                    # self.userDict.adjustSeg(segged)
                if(self.useFilter):
                    self.myfilter.adjustSeg(segged)
                self.nsDict.adjustSeg(segged)
                self.idiomDict.adjustSeg(segged)
                
                return list(map(lambda x: self.encode(x), segged))
                
            else:
                tmp, tagged = self.tagging_decoder.segmentTag(raw, poc_cands)

                # if(self.userDict is not None):
                    # self.userDict.adjustTag(tagged)
                if(self.useFilter):
                    self.myfilter.adjustTag(tagged)
                self.nsDict.adjustTag(tagged)
                self.idiomDict.adjustTag(tagged)
                    
                return list(map(lambda x: self.encode(x), segged)) <-请问这个变量是否该写成tagged？

UnboundLocalError: local variable 'segged' referenced before assignment

TypeError: reduce() of empty sequence with no initial value

File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/utils.py", line 249, in album2json_handle
   data = json.dumps(album.to_ml_json())
 File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/models.py", line 114, in to_ml_json
   "content": self.segment(self.album_desc, engine="thu"),
 File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/models.py", line 165, in segment
   words = splitor.run(text)
 File "/Users/xiaotaop/Documents/gitroom/recsys/similar_album/similar/similar/utils.py", line 167, in run
   sentence_splited = unicode(self.splitor.cut(sentence.encode("utf-8"), True), "utf-8")
 File "/Users/xiaotaop/Documents/pyenvs/dj18/lib/python2.7/site-packages/thulac/__init__.py", line 97, in cut
   return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
 File "/Users/xiaotaop/Documents/pyenvs/dj18/lib/python2.7/site-packages/thulac/__init__.py", line 80, in __cutWithOutMethod
   txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
TypeError: reduce() of empty sequence with no initial value

查了下文本内容如下：

游戏名：恶灵附身千万不要跳着看噢~~精彩就在一瞬间~~ 如果你笑了，记得帮我点个赞并且分享给你的小伙伴们，让我们一起把欢乐传递下去！不要忘了订阅更多欢乐等着你！

代码如下：

class THUSplit(object):

    def __init__(self):
        self.splitor = thulac.thulac()

    def run(self, sentence):
        """
        :param sentence: unicode string
        :return: array of {"word": string, "flag": string}
        """
        data = []
        sentence_splited = unicode(self.splitor.cut(sentence.encode("utf-8"), True), "utf-8")
        entries = sentence_splited.split(" ")
        for entry in entries:
            tmp = entry.split("_")
            word = tmp[0]
            flag = tmp[1]
            data.append({"word": word, "flag": flag})

        return data

新版取消了 -deli 参数吗？

请问新版是取消了 -deli 参数吗？

-deli delimeter 设置词与词性间的分隔符，默认为下划线_

出现 “//” 符号时分词会报错

你好，当出现 “//” 这个网址中常见的符号时分词会报错，但是网页demo版可以分词

报错如下：
content = "我们很//开心"

text = thu1.cut(content, text=True) #进行一句话分词
File "C:\Anaconda2\lib\site-packages\thulac_init_.py", line 78, in cut
txt += reduce(lambda x, y: x + ' ' + y, self.cutline(line)) + '\n'
File "C:\Anaconda2\lib\site-packages\thulac_init_.py", line 133, in cutline
self.punctuation.adjustTag(tagged)
File "C:\Anaconda2\lib\site-packages\thulac\manage\Punctuation.py", line 39, in adjustTag
tmp = sentence[i][0]
IndexError: list index out of range
[Finished in 4.9s with exit code 1]

请问有分词后生成词汇表的功能吗

如题

用户自定义字典读取编码为gbk

Hi, 我尝试使用用户自定义字典时候，发现代码中打开文件没有指定编码格式，在windows下会自动调用gbk来进行解码。是否可以指定utf-8 来进行读取txt文件，避免出现使用gbk解编码的情况？

系统的报错信息如下：

C:\Anaconda3\envs\nlp_py3\lib\site-packages\thulac\manage\Postprocesser.py in __init__(self, filename, tag, isTxt)
     11             lexicon = []
     12             f = open(filename, "r")
---> 13             for i, line in enumerate(f):
     14                 line = line.split()
     15                 lexicon.append([decode(line[0]), i])

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa1 in position 188: illegal multibyte sequence

MemoryError错误

当我运行thu1 = thulac.thulac()时python会一下子占用3G的内存资源，并最后显示MemoryError，麻烦问一下这是我内存不够的原因还是其他原因。
system:centos7_mini_x86_64_1708
python:2.7.5

模型三的训练语料能共享吗？

该模型是由多语料联合训练训练得到（语料包括来自多文体的标注文本和人民日报标注文本等）

模型三的训练语料能共享吗？

IndexError: list index out of range in `adjustTag`

Hi,

thulac 处理某些文本的时会出错

Traceback (most recent call last):
  File "test_thulac.py", line 9, in <module>
    tokens = thul.cut(text)
  File "/usr/local/lib/python2.7/dist-packages/thulac/__init__.py", line 90, in cut
    array += (reduce(lambda x, y: x + [y.split(self.separator)], self.cutline(line), []))
  File "/usr/local/lib/python2.7/dist-packages/thulac/__init__.py", line 133, in cutline
    self.punctuation.adjustTag(tagged)
  File "/usr/local/lib/python2.7/dist-packages/thulac/manage/Punctuation.py", line 39, in adjustTag
    tmp = sentence[i][0]
IndexError: list index out of range

样本文件:
测试.txt
测试2.txt

是否提供训练语料接口？

`三十三天` 被判断为 i ，但是三十五天不会

Hi,

今天发现一个问题，如下

In [29]: t.cut('过了三十五天我们就去广州')
Out[29]:
[['过', 'u'],
 ['了', 'u'],
 ['三十五', 'm'],
 ['天', 'q'],
 ['我们', 'r'],
 ['就', 'd'],
 ['去', 'v'],
 ['广州', 'ns']]

In [30]:

In [30]: t.cut('过了三十三天我们就去广州')
Out[30]:
[['过', 'u'],
 ['了', 'u'],
 ['三十三天', 'i'],
 ['我们', 'r'],
 ['就', 'd'],
 ['去', 'v'],
 ['广州', 'ns']]

三十五天被切开了，但是三十三天没有。

普通cut方式分词效率较慢问题

测试了一下中长文本的分词效率，1+秒，全部使用默认参数，seg_only的效率明显提升，平均大概20ms（但感觉还是无法满足工业级大规模分词场景，还是有些慢）

环境：MacOS

Python 2.7.13 (default, Apr  4 2017, 08:47:57)
Type "copyright", "credits" or "license" for more information.

IPython 5.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import thulac

In [2]: import sys
   ...:
   ...: reload(sys)
   ...: sys.setdefaultencoding('utf-8')
   ...:

In [3]: t = thulac.thulac()
Model loaded succeed

In [4]: msg = u"""北京时间7月26日凌晨，在布达佩斯进行的2017年世界游泳锦标赛结束了男子100米仰泳决赛争夺，**选手徐嘉余以52秒44获得冠军！这是**选手首次在男子仰泳项目上获得世界大
   ...: 赛（奥运会和世界锦标赛）冠军！美国选手格雷维斯以52秒48获得亚军，两人差距仅仅0.04秒，世界纪录保持者、美国人默菲以52秒59获得第三。
   ...: 由于在刚刚结束的女子100米仰泳决赛上，加拿大选手马塞打破了世界纪录，男子100米仰泳的气氛也相应被掀了起来。为了给身体保暖，徐嘉余穿大衣入场，他如同国王一般，双手扬起接受全场
   ...: 欢呼。
   ...: 比赛开始！徐嘉余在这场比赛尽展绝对优势，他出水就领先！头50米徐嘉余用时25秒12排名第一！进入后半程争夺，徐嘉余继续全面领先！最终冲刺开始，嘉余不再象预赛和半决赛时那样多少放
   ...: 点，而是全力开冲。但两侧的美国选手格雷维斯和默菲对他形成了强力夹击，特别是格雷维斯，看上去都要追上了，但徐嘉余顽强的把优势保持到了终点，52秒44，徐嘉余获得冠军！"""

In [5]: %timeit t.cut(msg)
1 loop, best of 3: 1.09 s per loop

In [6]: t2 = thulac.thulac(seg_only=True)
Model loaded succeed

In [7]: %timeit t2.cut(msg)
10 loops, best of 3: 22.4 ms per loop

中文分词问题

”为了更好地生活“，切分成 “为了更好地生活"
这个粒度切得太小了吧，应该切成”为了更好地生活“，结巴分词的是切成这个结果，求解释

导入外部词典的问题

我导入了外边词典，然后说我格式不对所以自己把f = open(filename, "r")中的r修改成了rb ,但是心的问题出现了如下： File "D:\Python\lib\site-packages\thulac\manage\Postprocesser.py", line 20, in init
dm.makeDat(lexicon, 0)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 220, in makeDat
base = self.assign(0, children, True)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 196, in assign
base = self.alloc(offsets)
File "D:\Python\lib\site-packages\thulac\base\Dat.py", line 159, in alloc
while (2 * (base + ord(offsets[size - 1])) >= self.datSize):
TypeError: ord() expected string of length 1, but int found
请问是什么问题导致的，外部词典是一个utf-8的文件，格式如下:

罗氏婴儿配方粉 n
挂花大头菜 n
黄毛籽 n
青豆 n
儿童营养饼干 n
汤菜 n

可以设置参数使用户可以通过自定义的停用词文件过滤停用词么？

我在使用pip安装完python版之后，在参数中使用该设置：
-filter 使用过滤器去除一些没有意义的词语，例如“可以”。

thu1 = thulac.thulac(seg_only=True,filt=True)

然而并不能去掉结果中的标点符号以及＂的＂之类的停用词

请问下python版是将C++ 封装起来的还是python全部重写的？

fast_cut的速度特别慢，比cut慢多了

以下载编译出最新的libthulac.so，测试发现fast_cut比cut的速度慢的太多了，什么原因呢？

书名号内只有一个字，整个句子会分为一个词

书名号内只有一个字，整个句子会分为一个词，在线demo版也会有部分这样的问题，是训练模型的语料差别引起的吗？

python自带模型

句子：“没有一部能跟以前的佳作相提并论，意义何在？《我》纪录片形式根本不适合他”
结果：没有一部能跟以前的佳作相提并论，意义何在？《我》纪录片形式根本不适合他_r

句子：“没有一部能跟以前的佳作相提并论，意义何在？《我们》纪录片形式根本不适合他“
结果：没有_v 一_m 部_q 能_v 跟_p 以前_f 的_u 佳作_n 相提并论_id ，_w 意义_n 何在_v ？_w 《_w 我们_r 》_w 纪录片_n 形式_n 根本_d 不_d 适合_v 他_r

在线demo

句子：“挺喜欢听人讲《洞》的跳跃式剪辑，为什么要这么干“
结果：“挺_d 喜欢_v 听_v 人_n 讲_v 《_w 洞》的跳跃式剪辑，为什么要这么干_c”

句子：“挺喜欢听人讲《黑洞》的跳跃式剪辑，为什么要这么干“
结果：“挺_d 喜欢_v 听_v 人_n 讲_v 《_w 黑洞_n 》_w 的_u 跳跃_v 式_k 剪辑_v ，_w 为什么_r 要_v 这么_r 干_v”

自定义词典有大小限制

加载自定义词典一直提示IndexError: list index out of range，从3万减到300才可以。
我能否完全替换掉你们的词典，用自己的词典？

分词运行时异常 KeyError

运行分词是抛出异常CBTaggingDecoder的get_seg_result函数，不知道是什么原因。具体异常信息如下：thulac (0.1.1)

File "xxx/thulac/init.py", line 97, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "xxx/thulac/init.py", line 90, in __cutWithOutMethod
array += (reduce(lambda x, y: x + [[y, '']], cut_method(line), []))
File "xxx/thulac/init.py", line 120, in __cutline
segged = self.__cws_tagging_decoder.get_seg_result()
File "xxx/thulac/character/CBTaggingDecoder.py", line 191, in get_seg_result
if((i == 0) or (self.labelInfo[self.result[i]][0] == '0') or (self.labelInfo[self.result[i]][0] == '3')):
KeyError: 24

Any Part-of-speech standard Explatation? Compatible to Penn Chinese Treebank ?

Hi,

我最近在使用nltk 和 stanford corenlp。我发现 stanford 的词性标注标准用的是 Penn Chinese Treebank ，如下

跟 thulac 的标注方式有很大的区别。但是官方文档中没有提是根据什么原则标注的。
不知道是否可以兼容 stanford 的词性标注方式，或者提供一个转换方式？

python的demo运行报错

您好，我尝试用python导入THULAC包来完成一项中文分词任务。但是，在pycharm编译器里连demo都运行不了，请问一下这是怎么回事呢？

完整报错反馈如下：

`/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/andychao/Downloads/THULAC-Python-master/demo.py
Traceback (most recent call last):
File "/Users/andychao/Downloads/THULAC-Python-master/demo.py", line 5, in
thu1 = thulac.thulac("-seg_only") #设置模式为行分词模式
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/init.py", line 68, in init
self.cws_tagging_decoder.init((self.prefix+"cws_model.bin"), (self.prefix+"cws_dat.bin"),(self.prefix+"cws_label.txt"))
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/character/CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/character/CBModel.py", line 55, in init
inputfile = open(filename, "rb")
IOError: [Errno 2] No such file or directory: '/Users/andychao/Library/Python/2.7/lib/python/site-packages/thulac/models/cws_model.bin'

Process finished with exit code 1`

为什么在本地分词的效果没有在线demo上的精准

像
“配置没得说很好了！但是因为上一代note3的存在，如果note4超4800那性价比真的一般般！"
在demo上
“配置_v 没_d 得_vm 说_v 很_d 好_a 了_u ！_w 但是_c 因为_c 上一代_n note3_x 的_u 存在_v ，_w 如果_c note4_x 超_v 4800_m 那_r 性_k 价_n 比_p 真的_a 一般_a 般_n ！_w”
note3和note4是分成一个整体的，
但是在本地分词则是
“但是_c 因为_c 上一代_n note_x 3_m 的_u 存在_v ，_w 如果_c note_x 4_m 超_v 4800_m 那_r 性_k 价_n 比_p 真_a 的_u 一_d 般_a 般_n ！_w”

还有像魅族，在本地也被分开，还有就是涉及到数字和字母组合的，像2.5D屏幕，在本地也被分成
“加上_v 2_m ._w 5_m D_x 的_u 屏幕_n ，_w 整体_n 大气_n 。_w”
我用的是经过资源申请表后下载的模型，THULAC_pro_c++_v1里的models，
但是效果和在线演示的差距很明显，本地类似的情况分词结果基本都一致，全分散开了。这个是我配置的问题呢，还是版本没对应好，
还请指点迷津

使用fast_cut时Segmentation fault: 11

下面是so文件路径

这个是什么原因呢，望解答谢谢

train_c不知道使用

我想使用训练模式, 但是教程只说
./train_c [-s separator] [-b bigram_threshold] [-i iteration] training_filename model_filename
使用training_filename为训练集，训练出来的模型名字为model_filename
train_c 我不知道在哪里找到

接口使用示例报错：UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 4: invalid continuation byte

Hi,
在运行主页1.1.接口使用示例-代码示例1的时候报错：
Traceback (most recent call last):
File "test_thulac_POS.py", line 16, in
thu1.cut_f("input.txt", "output.txt")
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 167, in cut_f
cutted = self.cut(oiraw, text = True)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 93, in cut
return self.__cutWithOutMethod(oiraw, self.__cutline, text = text)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 76, in __cutWithOutMethod
txt += reduce(lambda x, y: x + ' ' + y, cut_method(line)) + '\n'
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/init.py", line 99, in __cutline
oiraw = decode(oiraw, coding = self.__coding)
File "/home/bdsirs/py2env/local/lib/python2.7/site-packages/thulac/base/compatibility.py", line 4, in decode
return string.decode('utf-8')
File "/home/bdsirs/py2env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 4: invalid continuation byte

我已经在py文件的头部加了#encoding=utf-8,但是还是报错。请问这个有什么解决方法吗？

双数组trie的实现

thulac中的trie是用单个数组来实现的，我在看源代码的时候非常困扰，网上又很难找到相关的资料，但是我又好奇thulac中的trie到底是怎样实现的，能不能给一个文档读一读呢。

Python Demo MemoryError错误

您好，
感谢你们的分享！知道THULAC出了PYTHON 版本之后第一时间就过来尝试了一下。
但是在运行代码时候，出现了以下错误。

Traceback (most recent call last):
  File "D:\Software\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 1596, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "D:\Software\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "D:/Documents/Projects/TangPoetryAnalyzer/test.py", line 4, in <module>
    thu1 = thulac.thulac()  #默认模式
  File "D:\Software\Python27\lib\site-packages\thulac\__init__.py", line 58, in __init__
    self.__tagging_decoder.init((self.__prefix+"model_c_model.bin"),(self.__prefix+"model_c_dat.bin"),(self.__prefix+"model_c_label.txt"))
  File "D:\Software\Python27\lib\site-packages\thulac\character\CBTaggingDecoder.py", line 36, in init
    self.model = CBModel(modelFile)
  File "D:\Software\Python27\lib\site-packages\thulac\character\CBModel.py", line 58, in __init__
    self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
MemoryError

我看了下报错行是在CBModel.py中的__init__：

self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)

我的测试代码是test.py

# -*- coding: utf-8 -*-
import thulac

thu1 = thulac.thulac()  #默认模式
text = thu1.cut("我爱北京***", text=True)  #进行一句话分词
print(text)

我的OS是Windows 10，安装命令是WIN命令行使用的pip install thulac. 版本号是Successfully installed thulac-0.1.1
IDE是PYCHARM，但是我尝试使用命令行编译依然会出现同样错误。

我尝试过的方法有：

uninstall再install
命令行编译
网上寻找相关报错未果

但是都没能找到解决方案，因此发布了一个issue希望能够得到您的帮助，谢谢！

是否可以进行同义词设定？

Pycharm 等IDE中无法导入thulac

非常感谢创作这个中文分词包，并且自带词性识别，不过好像在Pycharm ide中无法导入thulac，在terminal里面可以，不知为何，同志们可以测试一下

About mark `id` and the result of `demo`.

我觀察到使用Demo [http://thulac.thunlp.org/demo] 對“你好”進行分詞
會解釋為：
你_r 好_a

而使用
thulac.thulac().cut("你好", text=True)
會變成
你好_id

這裡有幾個問題：

id是什麼
為什麼Demo的結果和實際結果不同呢？

自定义字典没起作用

在自定义字典中添加了杨幂
句式：你喜欢杨幂吗分词结果是：你_r 喜欢_v 杨幂吗_np
想要的结果是你_r 喜欢_v 杨幂_rm 吗_u
只有当杨幂后面跟的字或者词能被系统识别时，系统才会把杨幂作为一个人名，为什么自定义中添加了杨幂没效果

如何使用用户词典？

您好，我看到您在issues里回答说已经可以使用户词典，但是我并没有找到相关的接口。请问，可以请您给出一个使用示例吗？谢谢。

MemoryError问题

当我把seg_only设置为True时，分词成功
当设置为False时，就产生MemoryError问题，请问是怎么回事？

thu1 = thulac.thulac() memory error

Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\thulac_init_.py", line 58, in init
self.__tagging_decoder.init((self.__prefix+"model_c_model.bin"),(self.__prefix+"model_c_dat.bin"),(self.__prefix+"model_c_label.txt"))
File "C:\Python27\lib\site-packages\thulac\character\CBTaggingDecoder.py", line 36, in init
self.model = CBModel(modelFile)
File "C:\Python27\lib\site-packages\thulac\character\CBModel.py", line 58, in init
self.fl_weights = struct.unpack("<"+str(self.f_size * self.l_size)+"i", temp)
MemoryError

f = open(filename, "r")

Windows 7 + python3.6.2 不指定编码方式，读取utf-8字典文件，会报错 UnicodeDecodeError: 'gbk' codec can't decode byte …… illegal multibyte sequence

关于URL，邮件，日期的处理

2018/05/30
[email protected]
http://test.com
https://www.test.com

2018_m /_w 0_m 5_m /_w 30_m
abc_x @_w 123.com_np
http_x :_w //_w test.com_x
https_x :_w //_w www.test.com_x

能否指定选项对以上模型不分词？

关于分词的时候自动去除空格

输入：

而荔波肉又丧心病狂的不肯悔改

输出：

而_c 荔_v 波_n 肉_n 又_d 丧心病狂_i 的_u 不_d 肯_v 悔改_v

工具对文本的空格去除后，再进行分词。但我现在的任务不希望去除空格。
请问怎么设置可以不去除空格，直接分词？

是否考虑支持pip

能否考虑支持pip安装？
另外支持python3吗？

range() 手滑漏了一个 len(obj) 吧

/Volumes/Transcend/Corpus/THULAC-Python/thulac/manage/Preprocesser.pyc in T2S(self, sentence)
    283     def T2S(self, sentence):
    284         newSentence = ""
--> 285         for i in range(sentence):
    286             newSentence += str(getT2S(sentence[i]))
    287         return newSentence

TypeError: range() integer end argument expected, got unicode.

thunlp / thulac-python Goto Github PK

thulac-python's Issues

python自带模型

在线demo

2018/05/30 [email protected] http://test.com https://www.test.com

2018_m /_w 0_m 5_m /_w 30_m abc_x @_w 123.com_np http_x :_w //_w test.com_x https_x :_w //_w www.test.com_x

Recommend Projects

Recommend Topics

Recommend Org

2018/05/30
[email protected]
http://test.com
https://www.test.com

2018_m /_w 0_m 5_m /_w 30_m
abc_x @_w 123.com_np
http_x :_w //_w test.com_x
https_x :_w //_w www.test.com_x