Light

computing-intelligence / jupyters_and_slides Goto Github PK

View Code? Open in Web Editor NEW

405.0 116.0 439.0 296.61 MB

Jupyter Notebook 99.96% Python 0.04%

jupyters_and_slides's Introduction

Artificial Intellgence for Text Mining

For 2018 to 2021, I started to teach AI on NLP and text mining for China's learners.

These are source codes for the previous lessons.

jupyters_and_slides's People

Contributors

Stargazers

Watchers

Forkers

chililiu peterxingke pieere shuqingjinse xtzd dudulook27 koffer9 lxx-coder syrup7 rockhugoo star1103 kaifengjj xkx9431 bobking88 ustccheng02 yangzhuronghuang xiaoyi1540 doraduan12 eigenlaw alanskyy vermoriarty eylinzhao tmacchen1995 kazuhiradz maxtk7xox thhtom zemingyao heroalone dawoerduo yingjiegao1 xguojing imsick zerovera xraigor iriscsx overfree lqzmforer 13ob0 totorocrystal willamhou huangchaorestart lizchen1996 zhangxiaoxijoey yyy1990 rxma1805 puttyc miaorain darlingleeyou fessigy huyang719 hbliudun tankkeywang charliejane foreverchthollist jimzhang1030 getacat joelack meet-alfie hello-lx lusiasunny solarisyan oceanshadow shaobingbing linvis mi524 mosbest zhengzhihust quark30 tanghf123 chenhuaj yueqii jasonlin317 zhun189 bryantbyr wuchuankang yjj1992 loinapex daaafuuu feixue00 wuxiwuhen he1oise huzitu peacheychen deqianbai jzq66 enzoyang430 joeshpcheung changjiong kelly2016 yueraiyu serendipityforzhch xuermao yexinp racleray briareox qianwenli-git licang shuoshuozts deng003 monkeeye

jupyters_and_slides's Issues

【性能问题】数据库读取后保存数据到txt的速度问题

我有个数据库读取保存数据的性能问题要请教下：
疑问A：

同样的代码，save_txt的代码写到get_news_from_sql的最后面，保存文本慢得要死，一行一行地读取数据
将代码分开写成函数，速度一下子提升上万倍，一下子就保存好了

疑问B：

怎么排查，调试这种问题呢？

下面快的代码和慢的代码的主要区别：

下面是写在一起，速度很快的代码

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''=================================================
@IDE    ：PyCharm
@Author ：LuckyHuibo
@Date   ：2019/8/20 20:03
@Desc   ：连接数据库，读取数据——分开写，速度很快的代码
=================================================='''
import pymysql
import re
import pysnooper


def clean(s):
    """
    清洗数据
    :param s: 文本
    :return:
    """
    re_compile = re.compile(r'�|《|》|\/|）|（|【|】|\\n|\\r|\\t|\\u3000|;|\*')
    string = re_compile.sub('', str(s))
    return string


# 从数据库中得到新闻语料库
@pysnooper.snoop()
def get_news_from_sql(host, user, password, database, port):
    print('开始连接数据库...')
    db = pymysql.connect(host, user, password, database, port, charset='utf8')  # 不添加charset，读取到的数据是乱码
    print(db)
    print('连接成功...')

    cursor = db.cursor()
    sql = """SELECT content from news_chinese"""
    try:
        cursor.execute(sql)
    except Exception as e:
        # 如果发生异常，则回滚
        print("发生异常", e)
        db.rollback()
        return

    news = cursor.fetchall()
    print(len(news))
    cursor.close()
    db.close()

    return news

    # 同样的代码，save_txt的代码写到get_news_from_sql的最后面，保存文本慢得要死，一行一行地读取数据
    # 将代码分开写成函数，速度一下子提升上万倍，一下子就保存好了

def save_txt(news):
    try:
        with open('../data/news-sentences-xut2.txt', 'w', encoding='utf-8') as f:
            for content in news:
                data = content[0]
                text = clean(data)
                f.write(text + '\n')
    except Exception as w:
        print('保存数据到文本出现问题', w)


if __name__ == "__main__":
    host = "rm-8vbwj6507z6465505ro.mysql.zhangbei.rds.aliyuncs.com"
    user = "root"
    password = "AI@2019@ai"
    database = "stu_db"
    port = 3306
    try:
        contents = get_news_from_sql(host, user, password, database, port)
        save_txt(contents)
    except Exception:
        # 如果发生异常，则回滚
        print("ERROR", Exception)
        # db.rollback()
        pass

下面是写在一起，速度很慢的代码

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''=================================================
@IDE    ：PyCharm
@Author ：LuckyHuibo
@Date   ：2019/8/20 20:03
@Desc   ：连接数据库，读取数据

【问题】我有个数据库读取保存数据的性能问题要请教下：
# 同样的代码，save_txt的代码写到get_news_from_sql的最后面，保存文本慢得要死，一行一行地读取数据
# 将代码分开写成函数，速度一下子提升上万倍，一下子就保存好了
=================================================='''
import pymysql
import re
import pysnooper


def clean(s):
    """
    清洗数据
    :param s: 文本
    :return:
    """
    re_compile = re.compile(r'�|《|》|\/|）|（|【|】|\\n|\\r|\\t|\\u3000|;|\*')
    string = re_compile.sub('', str(s))
    return string


# 从数据库中得到新闻语料库
@pysnooper.snoop()
def get_news_from_sql(host, user, password, database, port):
    print('开始连接数据库...')
    db = pymysql.connect(host, user, password, database, port, charset='utf8')  # 不添加charset，读取到的数据是乱码
    print(db)
    print('连接成功...')

    cursor = db.cursor()
    sql = """SELECT content from news_chinese"""
    try:
        cursor.execute(sql)
    except Exception as e:
        # 如果发生异常，则回滚
        print("发生异常", e)
        db.rollback()
        return

    news = cursor.fetchall()
    print(len(news))
    cursor.close()
    db.close()

    # return news

    # 同样的代码，save_txt的代码写到get_news_from_sql的最后面，保存文本慢得要死，一行一行地读取数据
    # 将代码分开写成函数，速度一下子提升上万倍，一下子就保存好了

    # def save_txt(news):
    try:
        with open('../data/news-sentences-xut.txt', 'w', encoding='utf-8') as f:
            for content in news:
                data = content[0]
                text = clean(data)
                f.write(text + '\n')
    except Exception as w:
        print('保存数据到文本出现问题', w)


if __name__ == "__main__":
    host = "rm-8vbwj6507z6465505ro.mysql.zhangbei.rds.aliyuncs.com"
    user = "root"
    password = "AI@2019@ai"
    database = "stu_db"
    port = 3306
    try:
        contents = get_news_from_sql(host, user, password, database, port)
        # save_txt(contents)
    except Exception:
        # 如果发生异常，则回滚
        print("ERROR", Exception)
        # db.rollback()
        pass

【Word2Vec】训练的model没有“说”字，报KeyError: "word '说' not in vocabulary"

根据数据库训练出来的model，找不到跟说相关的词，报KeyError: "word '说' not in vocabulary"

min_count=1 已经设置为1了

path_news_txt（保存读取的news_chinese表的数据）


from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from gensim import models
# 从config配置中读取path_news_txt（保存读取的news_chinese表的数据）, path_news_model（保存的model的路径）文件路径
from config.file_path import path_news_txt, path_news_model

if __name__ == "__main__":
    # 对读取的数据库news进行训练
    news_vec = Word2Vec(LineSentence(path_news_txt), size=100, min_count=1, workers=8)
    # 将训练结果保存为model
    news_vec.save(path_news_model)

    # 加载news_model，进行数据的测试
    model = models.Word2Vec.load(path_news_model)
    # 查找model中跟“说”相关的词
    said = model.most_similar('说')

    '''执行后报错，说训练的model中没有“说”这个词，但是数据库中有【说】字，且min_count=1了
    
    File "C:\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 464, in word_vec
    raise KeyError("word '%s' not in vocabulary" % word)
    KeyError: "word '说' not in vocabulary"  
      
    '''

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.