Giter VIP home page Giter VIP logo

weibo-topic-spider's Introduction

weibo-topic-spyder

微博超级话题爬虫,微博词频统计+情感分析+简单分类

新增微博普通话题爬取,其中讨论和阅读数量的爬取待完善

爬取数据展示

使用方法

爬虫主文件:

微博普通话题:normal-topic-spyder.py

微博超级话题:super-topic-spyder.py

在爬虫主文件中的主函数中输入账号、密码和想要爬取的超话名称即可开始爬取,需要提前安装所需的python库和chromedriver驱动

爬取结束后数据会自动保存在当前目录下的excel文件中,每行为一个微博数据。

提示:普通的话题爬取需要添加#,例如#话题#,超级话题无需添加;一般而言,在微博中超级话题前带有钻石标志,普通话题为#话题#的格式

超级话题爬虫

使用了selenium模拟浏览器登陆进行爬取,具体话题爬取数量受微博限制,目前单个话题最大获取微博数量为8000条左右,选择了使用手机网页模式爬取,以获得最佳的爬取效果。

账号与IP数量对单个超话的爬取帮助不大,就只设置了单账号和ip模式,若需多超话同时爬取可以自行添加。

如需爬取多个超话,可以选择使用cookie登陆,最为方便

词频统计

使用了jieba库进行分词,最后对分词结果进行简单统计并且存储到txt中

情感分析

调用了百度大脑的api接口,可以自行注册获取key,平台不限调用次数,详细接口见百度大脑

其他

欢迎大家参与和完善:如有其他问题,欢迎提交issue

weibo-topic-spider's People

Contributors

czy1999 avatar rd-pong avatar snowmanjx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibo-topic-spider's Issues

运行super-topic-spyder报错

运行图片加载部分的代码报错no such element: Unable to locate element,我自己换了一些路径还是不行,求大佬指点

爬取普通话题全文显示的问题

爬取的时候会遇到这种情况,不能显示全文

目前,多个省份在2019年政府工作报告中明确要提高退休人员养老金。
**社科院世界社保研究中心执行研究员张盈华表示:考虑到物价指数和职工工资上涨情况,预计2019年退休人员养老金依然会维持上调。如果维持去年养老金上调绝对额的话,预计今年上调 ...全文```

是超级话题 但是出现以下问题

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id='app']/div[1]/div[1]/div[1]/div[4]/div/div/div/a/div[2]/h4[1]"}
(Session info: chrome=80.0.3987.100)

爬取手机超话,无法爬到

文件已存在
开始自动登陆,若出现验证码手动验证
暂停20秒,用于验证码验证
判断页面1成功 0失败 结果是=1
Traceback (most recent call last):
File "D:/毕业论文/weibo-topic-spider-master/super-topic-spyder.py", line 268, in
spider(username,password,book_name_xls,sheet_name_xls,keyword,maxWeibo,num,save_pic)
File "D:/毕业论文/weibo-topic-spider-master/super-topic-spyder.py", line 230, in spider
elem = driver.find_element_by_xpath("//[@Class='m-text-cut']").click();
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
'value': value})['value']
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\21141\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//
[@Class='m-text-cut']"}
(Session info: chrome=87.0.4280.88)
(Driver info: chromedriver=71.0.3578.137 (86ee722808adfe9e3c92e6e8ea746ade08423c7e),platform=Windows NT 10.0.18363 x86_64)
这是报错内容,看了其他issue也有出现这个错误,解释是说超话不存在,但是我这个手机超话是存在的呀,怎么会出现这种问题呢,求解

无法显示全文

现在是没有进入到每一条博文的页面,所以如果这条博文正文比较长,就会被隐藏掉,博文显示的就是【...全文】,爬下来的时候也爬不到全文。例如sample“肺炎.xls”第七行。请问如何解决,如果还要同时下载图片呢?

又爬不到内容了,求指教

我修改了一部分地方,但是还是出错了。您帮忙看看。
文件已存在
开始自动登陆,若出现验证码手动验证
暂停20秒,用于验证码验证
判断页面1成功 0失败 结果是=1
超话链接获取完毕,休眠2秒
Traceback (most recent call last):
File "F:/宝洁商赛/微博爬虫/weibo.py", line 187, in
spider(username,password,driver,book_name_xls,sheet_name_xls,keyword,maxWeibo)
File "F:/宝洁商赛/微博爬虫/weibo.py", line 167, in spider
yuedu_taolun = driver.find_element_by_xpath("//[@id='app']/div[1]/div[1]/div[1]/div[4]/div/div/div/a/div[2]/h4[1]").text
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
'value': value})['value']
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "D:\anaconda\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//
[@id='app']/div[1]/div[1]/div[1]/div[4]/div/div/div/a/div[2]/h4[1]"}
(Session info: chrome=79.0.3945.130)

储存问题

情感分析之后是不是没有加到excel这一步?

写入excel文档出错

爬取完后在写入时,大概在450条左右,一直报数据重复,但实际并没有重复,最后一千条数据只成功写入四百多条,这是为什么呢
image
image

运行报错:element is not attached to the page document

Traceback (most recent call last):
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 97, in get_all_text
weibo_content = driver.find_element(By.CLASS_NAME,'weibo-text').text
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 1250, in find_element
'value': value})['value']
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 425, in execute
self.error_handler.check_response(response)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".weibo-text"}
(Session info: headless chrome=100.0.4896.127)
Stacktrace:
Backtrace:
Ordinal0 [0x007E7413+2389011]
Ordinal0 [0x00779F61+1941345]
Ordinal0 [0x0066C658+837208]
Ordinal0 [0x006991DD+1020381]
Ordinal0 [0x0069949B+1021083]
Ordinal0 [0x006C6032+1204274]
Ordinal0 [0x006B4194+1130900]
Ordinal0 [0x006C4302+1196802]
Ordinal0 [0x006B3F66+1130342]
Ordinal0 [0x0068E546+976198]
Ordinal0 [0x0068F456+980054]
GetHandleVerifier [0x00999632+1727522]
GetHandleVerifier [0x00A4BA4D+2457661]
GetHandleVerifier [0x0087EB81+569713]
GetHandleVerifier [0x0087DD76+566118]
Ordinal0 [0x00780B2B+1968939]
Ordinal0 [0x00785988+1989000]
Ordinal0 [0x00785A75+1989237]
Ordinal0 [0x0078ECB1+2026673]
BaseThreadInitThunk [0x75DEFA29+25]
RtlGetAppContainerNamedObjectPath [0x77B47A7E+286]
RtlGetAppContainerNamedObjectPath [0x77B47A4E+238]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 270, in
spider(username,password,book_name_xls,sheet_name_xls,keyword,maxWeibo,num,save_pic)
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 249, in spider
get_current_weibo_data(elems,book_name_xls,name,yuedu,taolun,maxWeibo,num) #爬取实时
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 157, in get_current_weibo_data
insert_data(elems,book_name_xls,name,yuedu,taolun,num,save_pic)
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 57, in insert_data
weibo_content = get_all_text(elem)
File "D:/MyPythonProject/weibo-topic-spider-master/super-topic-spyder.py", line 101, in get_all_text
weibo_content = elem.find_elements(By.CSS_SELECTOR,'div.weibo-text')[0].text
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 762, in find_elements
{"using": by, "value": value})['value']
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 710, in _execute
return self._parent.execute(command, params)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 425, in execute
self.error_handler.check_response(response)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: headless chrome=100.0.4896.127)
Stacktrace:
Backtrace:
Ordinal0 [0x007E7413+2389011]
Ordinal0 [0x00779F61+1941345]
Ordinal0 [0x0066C658+837208]
Ordinal0 [0x0066F064+847972]
Ordinal0 [0x0066EF22+847650]
Ordinal0 [0x0066F1B0+848304]
Ordinal0 [0x00698EF2+1019634]
Ordinal0 [0x0069949B+1021083]
Ordinal0 [0x0068FE51+982609]
Ordinal0 [0x006B4194+1130900]
Ordinal0 [0x0068F974+981364]
Ordinal0 [0x006B4364+1131364]
Ordinal0 [0x006C4302+1196802]
Ordinal0 [0x006B3F66+1130342]
Ordinal0 [0x0068E546+976198]
Ordinal0 [0x0068F456+980054]
GetHandleVerifier [0x00999632+1727522]
GetHandleVerifier [0x00A4BA4D+2457661]
GetHandleVerifier [0x0087EB81+569713]
GetHandleVerifier [0x0087DD76+566118]
Ordinal0 [0x00780B2B+1968939]
Ordinal0 [0x00785988+1989000]
Ordinal0 [0x00785A75+1989237]
Ordinal0 [0x0078ECB1+2026673]
BaseThreadInitThunk [0x75DEFA29+25]
RtlGetAppContainerNamedObjectPath [0x77B47A7E+286]
RtlGetAppContainerNamedObjectPath [0x77B47A4E+238]

麻烦你看一下,谢谢

这个现在还可以用吗?

为什么会报这样的错呀?

NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id='loginName']"}
(Session info: chrome=75.0.3770.100)

相关问题

报错:
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element

...
is not clickable at point (240, 643). Other element would receive the click:
...

(Session info: chrome=79.0.3945.130)
好像是窗口大小问题,不知道该如何解决

输出的问题

重复写入第456条数据,数据重复, 一直在重复这一步,请问怎么回事呀。

excelSave包找不到

就是excelSave这个包 找不到,应该是作者自己写的吧
能不能说一下这个包咋写的

有关网页跳转的问题

您好!您的爬虫爬取的是微博的“实时”内容,我想爬取“热门”内容。
但是研究了一下感觉那一部分是前端有关的代码,没有看懂,不知道怎么跳转到“热门”进行爬取。
希望您不吝赐教。

爬取肺炎求助超话出了exception,请指教

报错如下:
DevTools listening on ws://127.0.0.1:5947/devtools/browser/311bbf3a-feea-4651-a9a7-137d6255c46e
文件已存在
开始自动登陆,若出现验证码手动验证
暂停20秒,用于验证码验证
判断页面1成功 0失败 结果是=1
Traceback (most recent call last):
File "C:\Users\wangz\Source\Repos\weibo-topic-spider\weibo-topic-spyder.py", line 191, in
spider(username,password,driver,book_name_xls,sheet_name_xls,keyword,maxWeibo)
File "C:\Users\wangz\Source\Repos\weibo-topic-spider\weibo-topic-spyder.py", line 168, in spider
elem.click()
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\wangz\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=80.0.3987.87)

我查了一下大概是说DOM上的元素改变导致报错,但是不知道如何修改,还望指教,感激不尽

找不到叫做 'excelSave'的模块

pip install excelSave 报错,然后google搜索了一下貌似也没有找到【import excelSave as save】。请问一下这个是什么问题呢,有什么解决办法吗

有关用户信息的问题

您好,很抱歉我又来问问题了...现在这个爬虫是可以爬取用户的等级,请问是否可以爬取用户注册微博的时间,以及用户所在地区呢?这个我查了资料还是不太明白要怎么操作o(╥﹏╥)o
另,请问您有时间更新全文爬取吗...(暴风哭泣)

相关问题

您好,想请教一下,我想添加一个功能:若是转发别人的微博,爬取原微博博主的名称;最近才接触爬虫,才疏学浅,望赐教。

有关情感化分析的一个问题

谢谢您分享的代码,使用您的代码处理了网上爬下来的数据,但是最后显示的全部都是积极情绪,我调整了百度api,重新建了一个模也还是一个结果,所以想请教一下是【不是一定要分词后才能使用情感分析?】非常感谢您的解答。

微博网页限制只能显示50页内容

网页版微博最多只显示50页内容,所以不管设置数量多大,最多也只能爬1000多条。听说手机版微博没有这个限制,不知道有没有办法解决呢?

想请教excelSave的问题

我用的Python3.7版本,excelSave这个包貌似没法pip install,显示没有对应版本诶,而且网上也查不到这个包相关信息...想请教一下怎么解决呀?

小白求指导如何运行起来爬取普通话题

新手接触爬虫,老师给的任务是爬取微博话题##的内容,已经安装了Chromedriver,尝试使用pycharm运行时,有很多报错,是我环境设置等的原因吗?好懵逼,哭哭
C:\ProgramData\Anaconda3\python.exe C:/Users/Administrator.DESKTOP-IKQ3NBP/Desktop/weibo-topic-spider-master/normal-topic-spyder.py
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 72, in start
self.process = subprocess.Popen(cmd, env=self.env,
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
PermissionError: [WinError 5] 拒绝访问。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/Administrator.DESKTOP-IKQ3NBP/Desktop/weibo-topic-spider-master/normal-topic-spyder.py", line 187, in
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application')#你的chromedriver的地址
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 73, in init
self.service.start()
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 86, in start
raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'Application' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Process finished with exit code 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.