czy1999 / weibo-topic-spider Goto Github PK

View Code? Open in Web Editor NEW

268.0 7.0 65.0 3.07 MB

微博超级话题爬虫，微博词频统计+情感分析+简单分类，新增肺炎超话爬取数据

License: MIT License

Python 100.00%

weibo-topic-spyder weibo wuhan emotion-analysis weibo-topic crawler topic spider

weibo-topic-spider's Introduction

weibo-topic-spyder

微博超级话题爬虫，微博词频统计+情感分析+简单分类

新增微博普通话题爬取，其中讨论和阅读数量的爬取待完善

爬取数据展示

使用方法

爬虫主文件：

微博普通话题：normal-topic-spyder.py

微博超级话题：super-topic-spyder.py

在爬虫主文件中的主函数中输入账号、密码和想要爬取的超话名称即可开始爬取，需要提前安装所需的python库和chromedriver驱动

爬取结束后数据会自动保存在当前目录下的excel文件中，每行为一个微博数据。

提示：普通的话题爬取需要添加#，例如#话题#，超级话题无需添加；一般而言，在微博中超级话题前带有钻石标志，普通话题为#话题#的格式

超级话题爬虫

使用了selenium模拟浏览器登陆进行爬取，具体话题爬取数量受微博限制，目前单个话题最大获取微博数量为8000条左右，选择了使用手机网页模式爬取，以获得最佳的爬取效果。

账号与IP数量对单个超话的爬取帮助不大，就只设置了单账号和ip模式，若需多超话同时爬取可以自行添加。

如需爬取多个超话，可以选择使用cookie登陆，最为方便

词频统计

使用了jieba库进行分词，最后对分词结果进行简单统计并且存储到txt中

情感分析

调用了百度大脑的api接口，可以自行注册获取key，平台不限调用次数，详细接口见百度大脑

其他

欢迎大家参与和完善：如有其他问题，欢迎提交issue

weibo-topic-spider's People

Contributors

Stargazers

Watchers

weibo-topic-spider's Issues

能否在超话内输入关键词再爬取相应微博

求指教，请问可以模拟手机端输入关键词后，爬取超话内带有相应关键词的微博吗？
或者可以只爬取某一段时间内的文本吗？感谢解答

运行super-topic-spyder报错

运行图片加载部分的代码报错no such element: Unable to locate element，我自己换了一些路径还是不行，求大佬指点

爬取普通话题全文显示的问题

爬取的时候会遇到这种情况，不能显示全文

目前，多个省份在2019年政府工作报告中明确要提高退休人员养老金。
**社科院世界社保研究中心执行研究员张盈华表示：考虑到物价指数和职工工资上涨情况，预计2019年退休人员养老金依然会维持上调。如果维持去年养老金上调绝对额的话，预计今年上调 ...全文```

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id='app']/div[1]/div[1]/div[1]/div[4]/div/div/div/a/div[2]/h4[1]"}
(Session info: chrome=80.0.3987.100)

我想问下写爬虫逻辑的文件在哪里呀

爬取手机超话，无法爬到

文件已存在
开始自动登陆，若出现验证码手动验证
暂停20秒，用于验证码验证
判断页面1成功 0失败结果是=1
Traceback (most recent call last):
File "D:/毕业论文/weibo-topic-spider-master/super-topic-spyder.py", line 268, in
spider(username,password,book_name_xls,sheet_name_xls,keyword,maxWeibo,num,save_pic)
File "D:/毕业论文/weibo-topic-spider-master/super-topic-spyder.py", line 230, in spider
elem = driver.find_element_by_xpath("//[@Class='m-text-cut']").click();
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
'value': value})['value']
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//[@Class='m-text-cut']"}
(Session info: chrome=87.0.4280.88)
(Driver info: chromedriver=71.0.3578.137 (86ee722808adfe9e3c92e6e8ea746ade08423c7e),platform=Windows NT 10.0.18363 x86_64)
这是报错内容，看了其他issue也有出现这个错误，解释是说超话不存在，但是我这个手机超话是存在的呀，怎么会出现这种问题呢，求解

无法显示全文

现在是没有进入到每一条博文的页面，所以如果这条博文正文比较长，就会被隐藏掉，博文显示的就是【...全文】，爬下来的时候也爬不到全文。例如sample“肺炎.xls”第七行。请问如何解决，如果还要同时下载图片呢？

又爬不到内容了，求指教

我修改了一部分地方，但是还是出错了。您帮忙看看。
文件已存在
开始自动登陆，若出现验证码手动验证
暂停20秒，用于验证码验证
判断页面1成功 0失败结果是=1
超话链接获取完毕，休眠2秒
Traceback (most recent call last):
File "F:/宝洁商赛/微博爬虫/weibo.py", line 187, in
spider(username,password,driver,book_name_xls,sheet_name_xls,keyword,maxWeibo)
File "F:/宝洁商赛/微博爬虫/weibo.py", line 167, in spider
yuedu_taolun = driver.find_element_by_xpath("//[@id='app']/div[1]/div[1]/div[1]/div[4]/div/div/div/a/div[2]/h4[1]").text
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
'value': value})['value']
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//[@id='app']/div[1]/div[1]/div[1]/div[4]/div/div/div/a/div[2]/h4[1]"}
(Session info: chrome=79.0.3945.130)

储存问题

情感分析之后是不是没有加到excel这一步？

写入excel文档出错

爬取完后在写入时，大概在450条左右，一直报数据重复，但实际并没有重复，最后一千条数据只成功写入四百多条，这是为什么呢

运行报错：element is not attached to the page document

Traceback (most recent call last):
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 97, in get_all_text
weibo_content = driver.find_element(By.CLASS_NAME,'weibo-text').text
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 1250, in find_element
'value': value})['value']
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 425, in execute
self.error_handler.check_response(response)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".weibo-text"}
(Session info: headless chrome=100.0.4896.127)
Stacktrace:
Backtrace:
Ordinal0 [0x007E7413+2389011]
Ordinal0 [0x00779F61+1941345]
Ordinal0 [0x0066C658+837208]
Ordinal0 [0x006991DD+1020381]
Ordinal0 [0x0069949B+1021083]
Ordinal0 [0x006C6032+1204274]
Ordinal0 [0x006B4194+1130900]
Ordinal0 [0x006C4302+1196802]
Ordinal0 [0x006B3F66+1130342]
Ordinal0 [0x0068E546+976198]
Ordinal0 [0x0068F456+980054]
GetHandleVerifier [0x00999632+1727522]
GetHandleVerifier [0x00A4BA4D+2457661]
GetHandleVerifier [0x0087EB81+569713]
GetHandleVerifier [0x0087DD76+566118]
Ordinal0 [0x00780B2B+1968939]
Ordinal0 [0x00785988+1989000]
Ordinal0 [0x00785A75+1989237]
Ordinal0 [0x0078ECB1+2026673]
BaseThreadInitThunk [0x75DEFA29+25]
RtlGetAppContainerNamedObjectPath [0x77B47A7E+286]
RtlGetAppContainerNamedObjectPath [0x77B47A4E+238]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 270, in
spider(username,password,book_name_xls,sheet_name_xls,keyword,maxWeibo,num,save_pic)
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 249, in spider
get_current_weibo_data(elems,book_name_xls,name,yuedu,taolun,maxWeibo,num) #爬取实时
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 157, in get_current_weibo_data
insert_data(elems,book_name_xls,name,yuedu,taolun,num,save_pic)
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 57, in insert_data
weibo_content = get_all_text(elem)
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 101, in get_all_text
weibo_content = elem.find_elements(By.CSS_SELECTOR,'div.weibo-text')[0].text
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 762, in find_elements
{"using": by, "value": value})['value']
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 710, in _execute
return self._parent.execute(command, params)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 425, in execute
self.error_handler.check_response(response)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: headless chrome=100.0.4896.127)
Stacktrace:
Backtrace:
Ordinal0 [0x007E7413+2389011]
Ordinal0 [0x00779F61+1941345]
Ordinal0 [0x0066C658+837208]
Ordinal0 [0x0066F064+847972]
Ordinal0 [0x0066EF22+847650]
Ordinal0 [0x0066F1B0+848304]
Ordinal0 [0x00698EF2+1019634]
Ordinal0 [0x0069949B+1021083]
Ordinal0 [0x0068FE51+982609]
Ordinal0 [0x006B4194+1130900]
Ordinal0 [0x0068F974+981364]
Ordinal0 [0x006B4364+1131364]
Ordinal0 [0x006C4302+1196802]
Ordinal0 [0x006B3F66+1130342]
Ordinal0 [0x0068E546+976198]
Ordinal0 [0x0068F456+980054]
GetHandleVerifier [0x00999632+1727522]
GetHandleVerifier [0x00A4BA4D+2457661]
GetHandleVerifier [0x0087EB81+569713]
GetHandleVerifier [0x0087DD76+566118]
Ordinal0 [0x00780B2B+1968939]
Ordinal0 [0x00785988+1989000]
Ordinal0 [0x00785A75+1989237]
Ordinal0 [0x0078ECB1+2026673]
BaseThreadInitThunk [0x75DEFA29+25]
RtlGetAppContainerNamedObjectPath [0x77B47A7E+286]
RtlGetAppContainerNamedObjectPath [0x77B47A4E+238]

麻烦你看一下，谢谢

这个现在还可以用吗？

为什么会报这样的错呀？

NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id='loginName']"}
(Session info: chrome=75.0.3770.100)

生成的excel没有爬取的内容

您好，我在运行时爬取到的信息无法添加在excel里，页面也没报错，请问可能是什么问题呢

输出的问题

重复写入第456条数据，数据重复，一直在重复这一步，请问怎么回事呀。

excelSave包找不到

就是excelSave这个包找不到，应该是作者自己写的吧
能不能说一下这个包咋写的

有关网页跳转的问题

您好！您的爬虫爬取的是微博的“实时”内容，我想爬取“热门”内容。
但是研究了一下感觉那一部分是前端有关的代码，没有看懂，不知道怎么跳转到“热门”进行爬取。
希望您不吝赐教。

爬取肺炎求助超话出了exception，请指教

报错如下：
DevTools listening on ws://127.0.0.1:5947/devtools/browser/311bbf3a-feea-4651-a9a7-137d6255c46e
文件已存在
开始自动登陆，若出现验证码手动验证
暂停20秒，用于验证码验证
判断页面1成功 0失败结果是=1
Traceback (most recent call last):
File "C:\Users\wangz\Source\Repos\weibo-topic-spider\weibo-topic-spyder.py", line 191, in
spider(username,password,driver,book_name_xls,sheet_name_xls,keyword,maxWeibo)
File "C:\Users\wangz\Source\Repos\weibo-topic-spider\weibo-topic-spyder.py", line 168, in spider
elem.click()
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=80.0.3987.87)

我查了一下大概是说DOM上的元素改变导致报错，但是不知道如何修改，还望指教，感激不尽

有关爬取消息数量的问题

想问下爬取超话时发现一般只能爬几百条就停了如何让爬取的数量多些呢

找不到叫做 'excelSave'的模块

pip install excelSave 报错，然后google搜索了一下貌似也没有找到【import excelSave as save】。请问一下这个是什么问题呢，有什么解决办法吗

有关用户信息的问题

您好，很抱歉我又来问问题了...现在这个爬虫是可以爬取用户的等级，请问是否可以爬取用户注册微博的时间，以及用户所在地区呢？这个我查了资料还是不太明白要怎么操作o(╥﹏╥)o
另，请问您有时间更新全文爬取吗...（暴风哭泣）

有关情感化分析的一个问题

谢谢您分享的代码，使用您的代码处理了网上爬下来的数据，但是最后显示的全部都是积极情绪，我调整了百度api，重新建了一个模也还是一个结果，所以想请教一下是【不是一定要分词后才能使用情感分析？】非常感谢您的解答。

微博网页限制只能显示50页内容

网页版微博最多只显示50页内容，所以不管设置数量多大，最多也只能爬1000多条。听说手机版微博没有这个限制，不知道有没有办法解决呢？

想请教excelSave的问题

我用的Python3.7版本，excelSave这个包貌似没法pip install，显示没有对应版本诶，而且网上也查不到这个包相关信息...想请教一下怎么解决呀？

页面加载不出来

您好，我想问下页面加载不出来是什么问题？

小白求指导如何运行起来爬取普通话题

新手接触爬虫，老师给的任务是爬取微博话题##的内容，已经安装了Chromedriver，尝试使用pycharm运行时，有很多报错，是我环境设置等的原因吗？好懵逼，哭哭
C:\ProgramData\Anaconda3\python.exe C:/Users/Administrator.DESKTOP-IKQ3NBP/Desktop/weibo-topic-spider-master/normal-topic-spyder.py
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 72, in start
self.process = subprocess.Popen(cmd, env=self.env,
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
PermissionError: [WinError 5] 拒绝访问。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/Administrator.DESKTOP-IKQ3NBP/Desktop/weibo-topic-spider-master/normal-topic-spyder.py", line 187, in
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application')#你的chromedriver的地址
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 73, in init
self.service.start()
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 86, in start
raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'Application' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Process finished with exit code 1

desktop页面上的微博和mobile上的微博不一样

以肺炎患者求助超话为例
m端的微博每小时只有几个
d端的微博每分钟都有几个