windsooon / cherry Goto Github PK

View Code? Open in Web Editor NEW

495.0 495.0 42.0 10.52 MB

text classification - no machine learning knowledge needed

License: MIT License

Python 100.00%

cherry's People

Contributors

Stargazers

Watchers

cherry's Issues

请问一下这个setup.py有啥用，我采用pip install -U cherry安装不了

Checklist

I have verified that that issue exists against the master branch of cherry.
I have searched for similar issues in both open and closed tickets and cannot find a duplicate.
I have reduced the issue to the simplest possible case.

Steps to reproduce

Expected behavior

Actual behavior

建议给jieba添加自定义敏感词库

你好，非常好的一个项目。
您这个项目用jeba来分词，但jieba默认词库里面没有敏感词这个分类，导致“**、草泥马”这类词，jieba是不认识的，因此这些词也就不会被加入分词列表，也就无法判断这些词的权重了。

PS: 今天下午安装cherry的时候，安装成功，但引用会报错

File "/root/anaconda3/lib/python3.6/tkinter/__init__.py", line 36, in <module>
    import _tkinter # If this fails your Python may not be configured for Tk
ImportError: libX11.so.6: cannot open shared object file: No such file or directory

所以又执行了一步yum install tk-devel tkinter （centOS 6）

cherry model does not achieve the desired effect in multithreading mode

Thanks to the author's selfless dedication. I have encountered some problems in the text recognition process, I would like to consult.
First, I found that even with multithreading, the execution time of a program is similar to the execution time of a single thread.
Second, when multi-threaded tasks are enabled, the time required for the model to recognize once increases dramatically.
Third, if I change the model recognition operation to something other time-consuming operation (for example, #r = http_get("https://qq.com"), line 88 of the script file), multithreading can be performed normally.
Finally, when the number of threads is large, the recognition operation can even take 100 seconds or more and then start falling again.

#######################################
Multithreaded time test information(threadNum=50)

E:\cherry1\venv\Scripts\python.exe E:/time_test.py
Training may take some time depend on your dataset
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Seba\AppData\Local\Temp\jieba.cache
Loading model cost 1.323 seconds.
Prefix dict has been built succesfully.
数据库连接成功
模型识别耗时为：2.175182819366455
模型识别耗时为：2.6150031089782715
模型识别耗时为：3.3370752334594727
模型识别耗时为：3.415862798690796
模型识别耗时为：3.567445993423462
模型识别耗时为：4.146943807601929
模型识别耗时为：4.393248081207275
模型识别耗时为：4.393249988555908
模型识别耗时为：4.6884238719940186
模型识别耗时为：4.708403825759888
模型识别耗时为：4.698431730270386
模型识别耗时为：4.758283376693726
模型识别耗时为：4.966742515563965
模型识别耗时为：5.017577886581421
模型识别耗时为：5.155210971832275
模型识别耗时为：5.3357274532318115
模型识别耗时为：5.42249321937561
模型识别耗时为：5.414516925811768
模型识别耗时为：5.396564722061157
模型识别耗时为：5.449436664581299
进程创建成功
模型识别耗时为：21.53141474723816

Process finished with exit code 0

##################################
Single thread time test information

E:\cherry1\venv\Scripts\python.exe E:time_test.py
Training may take some time depend on your dataset
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Seba\AppData\Local\Temp\jieba.cache
Loading model cost 1.338 seconds.
Prefix dict has been built succesfully.
数据库连接成功
模型识别耗时为：0.26527881622314453
模型识别耗时为：0.29318737983703613
模型识别耗时为：0.26030683517456055
模型识别耗时为：0.26927995681762695
模型识别耗时为：0.26130032539367676
模型识别耗时为：0.26030540466308594
模型识别耗时为：0.26329588890075684
模型识别耗时为：0.2503325939178467
模型识别耗时为：0.26830220222473145
模型识别耗时为：0.2642951011657715
模型识别耗时为：0.2603034973144531
模型识别耗时为：0.2543506622314453
模型识别耗时为：0.265289306640625
模型识别耗时为：0.2533226013183594
模型识别耗时为：0.2962031364440918
模型识别耗时为：0.3091733455657959
模型识别耗时为：0.3101999759674072
模型识别耗时为：0.3031899929046631
模型识别耗时为：0.2622978687286377
模型识别耗时为：0.25830793380737305
进程创建成功
模型识别耗时为：21.028748989105225

Process finished with exit code 0

#################################
testing script

# coding:utf-8
# 开发团队 :
# 开发人员 : qiying
# 开发时间 : 2019/9/18 13:24
# 开发名称 : time_test.py
# 开发工具 :
import requests
import chardet
import time
from bs4 import BeautifulSoup
import sys
import pymysql
import queue
import importlib
import threading
import cherry
importlib.reload(sys)
import regex as regex
from langconv import *
from sklearn.naive_bayes import MultinomialNB




class MyThread(threading.Thread):
    def __init__(self, func):
        threading.Thread.__init__(self)
        self.func = func

    def run(self):
        self.func()


def http_get(str_url):
    ret = {'title': '', 'text': '', 'code': 200, 'header': ''}
    timeout = 3
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) ',
               'Referer': 'https://www.baidu.com',
               'X-FORWARDED-FOR': '216.239.53.53'

               }

    try:
        u = requests.get(str_url, headers=headers, timeout=timeout)  # allow_redirects=False
    except Exception as e:
        ret['code'] = 0
        return ret
    chardeter = chardet.detect(u.content)
    u.encoding = chardeter['encoding']
    ret['text'] = u.text
    ret['header'] = u.headers
    try:
        soup = BeautifulSoup(u.text, 'lxml')
        ret['title'] = soup.title.text
    except Exception as e:
        ret['title'] = ''

    return ret

def Traditional2Simplified(sentence):
    """
    将sentence中的繁体字转为简体字
    :param sentence: 待转换的句子
    :return: 将句子中繁体字转换为简体字之后的句子
    """
    sentence = Converter('zh-hans').convert(sentence)
    return sentence

def get_cn(soup):
    #soup = get_soup(url)
    words = regex.findall('[\u4e00-\u9fa5]+', soup.get_text())
    return ''.join(words)


def worker():
    global get_num,skip_num
    while not q.empty():
        item = q.get()  # 获得任务
        if item[4] == None:
            continue
        soup = BeautifulSoup(item[4], "html5lib")
        chinese_text = Traditional2Simplified(get_cn(soup))
        list1 = [chinese_text]
        if len(chinese_text) > 0:
            try:
                cherry_start_time  = time.time()
                r = cherry.classify(model='harmful', text=list1)
                #r = http_get("https://qq.com")
                cherry_end_time = time.time()
                time1 = (cherry_end_time - cherry_start_time)
                print("模型识别耗时为：%s" % str(time1))
            except Exception as e:
                print("分类模型出现异常: %s  %s " % (str(e), chinese_text))
                continue




def main():
    threads = []
    for i in range(threadNum):  # 开启threadNum个线程
        thread = MyThread(worker)
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()
    print("进程创建成功")



def link_db(sql):
    db = pymysql.connect(host='####', port=3306, user='root', password='###', db='###', charset='utf8')
    cursor = db.cursor()  # 建立游标cursor当前的程序到数据之间连接管道
    ############ 发生错误时回滚
    try:
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        print(sql)
        print("数据库连接出现异常:  %s " % (str(e)))
        db.rollback()
    results = cursor.fetchall()
    print ("数据库连接成功")
    db.close()
    return  results


test_start_time = time.time()
mnb = MultinomialNB(class_prior=[0.4, 0.15, 0.15, 0.15, 0.15])
cherry.train(model='harmful', clf=mnb)
threadNum=50
q = queue.Queue()
sql = "SELECT * FROM  idc_web where task_id=3  limit 100"
results = link_db(sql)
###########   任务入队
for row in results:
    q.put(row)


main()
test_end_time = time.time()
time = (test_end_time - test_start_time)
print("模型识别耗时为：%s" % str(time))
exit(0)

Model loading problem

If I don't train the model before the script (Script for the previous question) runs, an exception will be thrown when sorting. Ps: The model has been trained before

###########################

#mnb = MultinomialNB(class_prior=[0.4, 0.15, 0.15, 0.15, 0.15])
#cherry.train(model='harmful', clf=mnb)

The exceptions are as follows:

数据库连接成功
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
分类模型出现异常: Can't get attribute 'CountVectorizer' on <module 'sklearn.feature_extraction.text' from 'E:\\cherry1\\venv\\lib\\site-packages\\sklearn\\feature_extraction\\text.py'>  后台登陆 
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Seba\AppData\Local\Temp\jieba.cache
Loading model cost 1.403 seconds.
Prefix dict has been built succesfully.
模型识别耗时为：7.677468538284302

怎么自定义训练内容

你好能写下教程吗？具体怎么自定义分类，我想加一个广告的分类，还有有一些文本都会报异常的，如下
result = cherry.classify('草泥马')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.7/site-packages/cherry/api.py", line 18, in classify
return Result(text=text, lan=lan)
File "/usr/local/lib/python3.7/site-packages/cherry/classify.py", line 26, in init
self._data_to_vector()
File "/usr/local/lib/python3.7/site-packages/cherry/classify.py", line 73, in _data_to_vector
raise TextNotFoundError(error)
cherry.exceptions.TextNotFoundError: Tokens in text do not in train data

IndexError: arrays used as indices must be of integer (or boolean) type

I tried to test cherry

cherry.classify('Wonder.Woman.2017.1080p.BRRip.6CH.MkvCage.mkv')

But it raised an error:

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.064 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "pornography_detect.py", line 33, in <module>
    percentage = cherry.classify('Wonder.Woman.2017.1080p.BRRip.6CH.MkvCage.mkv').percentage
  File "/root/anaconda3/lib/python3.6/site-packages/cherry/api.py", line 18, in classify
    return Result(text=text, lan=lan)
  File "/root/anaconda3/lib/python3.6/site-packages/cherry/classify.py", line 27, in __init__
    self.percentage, self.word_list = self._bayes_classify()
  File "/root/anaconda3/lib/python3.6/site-packages/cherry/classify.py", line 84, in _bayes_classify
    non_zero_vector = final_vector[np.array(self.word_index)]
IndexError: arrays used as indices must be of integer (or boolean) type

And:

cherry.classify('测试文本test text').percentage

Output:

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.182 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "pornography_detect.py", line 33, in <module>
    percentage = cherry.classify('测试文本test text').percentage
  File "/root/anaconda3/lib/python3.6/site-packages/cherry/api.py", line 18, in classify
    return Result(text=text, lan=lan)
  File "/root/anaconda3/lib/python3.6/site-packages/cherry/classify.py", line 27, in __init__
    self.percentage, self.word_list = self._bayes_classify()
  File "/root/anaconda3/lib/python3.6/site-packages/cherry/classify.py", line 84, in _bayes_classify
    non_zero_vector = final_vector[np.array(self.word_index)]
IndexError: arrays used as indices must be of integer (or boolean) type

What's wrong...

训练完模型非常大，识别很慢怎么解决？

训练了300W数据，训练完trained.pkl和ve.pkl分别为15M左右，识别一个句子需要5s以上。

训练了自定义模型后，给出的各个类型的概率，但是具体对应的是哪个类型呢？怎么看？

res.get_probability
array([[1.41203867e-10, 2.22271422e-15, 9.99999998e-01, 2.50630140e-10,
2.95633885e-18, 1.27142354e-09]])

例如这样的，结果，我怎么看每个可能性对应的是哪个类型？是按照什么顺序排列的？