Comments (8)
不知道字典的存放位置是否影响?
我输入如下代码,结果还是没有返回我想要的结果,字典文件中已添加我想要的词
thu1 = thulac.thulac(user_dict="D:/python/text_preprocessing/dict.txt")
thu1.cut('我爱深度学习和机器学习', text=True)
Out[14]: '我_r 爱_v 深度_n 学习_v 和_c 机器_n 学习_v'
不知道哪里出错了?= =
from thulac-python.
您好,非常感谢对THULAC的支持,其中用户词典的定义方法在ReadMe中已经说明~
在定义thulac类的时候,用户词典作为一个参数载入即可~
thulac(user_dict=None, model_path=None, T2S=False, seg_only=False, filt=False, deli='_')初始化程序,进行自定义设置
user_dict 设置用户词典,用户词典中的词会被打上uw标签。词典中每一个词一行,UTF8编码
T2S 默认False, 是否将句子从繁体转化为简体
seg_only 默认False, 时候只进行分词,不进行词性标注
filt 默认False, 是否使用过滤器去除一些没有意义的词语,例如“可以”。
model_path 设置模型文件所在文件夹,默认为models/
deli 默认为‘_’, 设置词与词性之间的分隔符
from thulac-python.
#coding:utf-8
import thulac
thu1 = thulac.thulac(seg_only=True, user_dict="mydict.txt") #设置模式为行分词模式
a = thu1.cut("我爱北京***", text=True)
mydict.txt 内容每词一行:
机器学习
数据挖掘
...
我爱北京***
from thulac-python.
好的,谢谢。
from thulac-python.
位置不对python应该会直接报一个file not found吧,你试试
for line in open("D:/python/text_preprocessing/dict.txt) 看看内容对不对?
还是找不到问题可能是windows和linux/mac环境不同了
from thulac-python.
请教一下,用户词典里的词如果有空格,有没办法将其分出来,比如,justin bieber是一个歌手
分成justin bieber
,是
, 一个
, 歌手
。
from thulac-python.
因为空格的出现更多的是在英文中,从现在的处理中暂时无法达到这样的效果,我们会在下一个版本尽量解决这个问题
from thulac-python.
您好,在使用 thulac.thulac(user_dict=myDictFile)會出現以下 encoding 的 問題,也試著將 T2S 參數設為 True (dict 檔已是 'utf-8'), 請問可以怎麼處理呢? 謝謝!
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe6 in position 0: illegal multibyte sequence
from thulac-python.
Related Issues (20)
- python 3.8.2 下time.clock()函数不存在,导致thulac使用异常。 HOT 5
- 提一个建议哈 HOT 1
- 请问用户字典放在哪个位置,谢谢
- 速度非常慢啊 HOT 1
- 请问这个有没有ner的功能 HOT 1
- 用户自定义词典不生效 HOT 3
- python 默认加载的是lite模型么? HOT 2
- MIT license and restriction for commercial use?
- README.me里的小错误
- Pcharm Error
- segement tag, not available for python 3.8 HOT 8
- #如果只需要分词功能,可在增加参数"seg_only" python -m thulac input.txt output.txt seg_only HOT 2
- AttributeError: module 'time' has no attribute 'clock' HOT 3
- UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 8: illegal multibyte sequence HOT 1
- 什么时候更新elasticsearch插件
- 内存小的电脑速度慢是正常的吗
- 分词功能有粗分和细分的区别吗
- 如何加载次数很多的话会导致占用很高的内存,有没有判断识别?就像缓存
- 分词错误
- 自定义词典分词问题
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thulac-python.