dataabc / weibo-crawler Goto Github PK
View Code? Open in Web Editor NEW新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频
新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频
call last): File "weibo.py", line 404, in get_one_weibo is_long_retweet = retweeted_status['isLongText'] KeyError: 'isLongText'
能麻烦再看一下嘛?已经更新到最近的版本了,结果还是没有bid的结果。
('Error: ', KeyError('statuses_count',))
Traceback (most recent call last):
File "weibo.py", line 479, in start
self.get_pages()
File "weibo.py", line 452, in get_pages
page_count = self.get_page_count()
File "weibo.py", line 364, in get_page_count
weibo_count = self.user['statuses_count']
KeyError: 'statuses_count'
这个微博已经炸号(但号主本人也就是我可以正常浏览)请问是否炸号有可能导致了该错误?
爬虫无法爬取具体的发布时间(精确到分钟),想问下这个功能能够加上嘛?谢谢!
如果微博内容中含有超链接,则在爬取的微博正文中只能看到“网页链接”四个字,而无法看到该超链接的内容
def get_pages(self):
"""获取全部微博"""
self.get_user_info()
gender = u'女' if self.user['gender'] == 'f' else u'男'#性别
import pandas as pd
df=pd.DataFrame()
df['ID']=self.user['id']
df['昵称']=self.user['screen_name']
df['性别']=gender
df['微博数']=self.user['statuses_count']
df['微博粉丝数']=self.user['followers_count']
df['微博关注数']=self.user['follow_count']
df['微博简介']=self.user['description']
df['微博等级']=self.user['urank']
df['微博会员等级']=self.user['mbrank']
df.to_excel('%s个人信息.xlsx'%self.user['screen_name'],index=False)
page_count = self.get_page_count()
wrote_count = 0
self.print_user_info()
你好,我按照这个输出,但是输出文件是空的,没有数据
可以增加用户的注册时间信息吗
微博会将链接转成短链接。 显示的时候就是一个 “网页链接” 的超链
爬取下来时,就变成纯文字 网页链接 四个字。链接t.cn/xxx的内容就丢失了。
下载:O网页链接 码:6n4h
好像记得早前的版本能爬出t.cn的内容
是否可以存为json格式,例如:“点赞数:3283927”
大概由于微博是比较新注册的? userid比较长, 结果也出现了类似英文个性化域名输入时的错误
Error: 'statuses_count'
Traceback (most recent call last):
File "weibo.py", line 739, in start
self.get_pages()
File "weibo.py", line 694, in get_pages
page_count = self.get_page_count()
File "weibo.py", line 475, in get_page_count
weibo_count = self.user['statuses_count']
KeyError: 'statuses_count'
id是1005055911162580,麻烦大神再看一下,谢谢!
环境
重现步骤
$ python -m pip install -r requirements.txt
报错信息
Collecting lxml==4.3.4 (from -r requirements.txt (line 1))
Using cached https://files.pythonhosted.org/packages/da/b5/d3e0d22649c63e92cb0902847da9ae155c1e801178ab5d272308f35f726e/lxml-4.3.4.tar.gz
Requirement already satisfied: requests==2.22.0 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from -r requirements.txt (line 2)) (2.22.0)
Requirement already satisfied: tqdm==4.32.2 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from -r requirements.txt (line 3)) (4.32.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (2019.11.28)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (1.25.8)
Installing collected packages: lxml
Running setup.py install for lxml: started
Running setup.py install for lxml: finished with status 'error'
ERROR: Command errored out with exit status 1:
command: 'C:\Users\shizhang\AppData\Local\Programs\Python\Python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"'; __file__='"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\shizhang\AppData\Local\Temp\pip-record-s9zyd_o1\install-record.txt' --single-version-externally-managed --compile
cwd: C:\Users\shizhang\AppData\Local\Temp\pip-install-71w5rudd\lxml\
Complete output (77 lines):
Building lxml version 4.3.4.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.8
creating build\lib.win-amd64-3.8\lxml
copying src\lxml\builder.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\cssselect.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\doctestcompare.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\ElementInclude.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\pyclasslookup.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\sax.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\usedoctest.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\_elementpath.py -> build\lib.win-amd64-3.8\lxml
copying src\lxml\__init__.py -> build\lib.win-amd64-3.8\lxml
creating build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\__init__.py -> build\lib.win-amd64-3.8\lxml\includes
creating build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\builder.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\clean.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\defs.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\diff.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\ElementSoup.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\formfill.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\html5parser.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\soupparser.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\usedoctest.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\_diffcommand.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\_html5builder.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\_setmixin.py -> build\lib.win-amd64-3.8\lxml\html
copying src\lxml\html\__init__.py -> build\lib.win-amd64-3.8\lxml\html
creating build\lib.win-amd64-3.8\lxml\isoschematron
copying src\lxml\isoschematron\__init__.py -> build\lib.win-amd64-3.8\lxml\isoschematron
copying src\lxml\etree.h -> build\lib.win-amd64-3.8\lxml
copying src\lxml\etree_api.h -> build\lib.win-amd64-3.8\lxml
copying src\lxml\lxml.etree.h -> build\lib.win-amd64-3.8\lxml
copying src\lxml\lxml.etree_api.h -> build\lib.win-amd64-3.8\lxml
copying src\lxml\includes\c14n.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\config.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\dtdvalid.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\etreepublic.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\htmlparser.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\relaxng.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\schematron.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\tree.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\uri.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\xinclude.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\xmlerror.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\xmlparser.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\xmlschema.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\xpath.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\xslt.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\__init__.pxd -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\etree_defs.h -> build\lib.win-amd64-3.8\lxml\includes
copying src\lxml\includes\lxml-version.h -> build\lib.win-amd64-3.8\lxml\includes
creating build\lib.win-amd64-3.8\lxml\isoschematron\resources
creating build\lib.win-amd64-3.8\lxml\isoschematron\resources\rng
copying src\lxml\isoschematron\resources\rng\iso-schematron.rng -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\rng
creating build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl
copying src\lxml\isoschematron\resources\xsl\RNG2Schtrn.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl
copying src\lxml\isoschematron\resources\xsl\XSD2Schtrn.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl
creating build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_abstract_expand.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_dsdl_include.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_message.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_skeleton_for_xslt1.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_svrl_for_xslt1.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\readme.txt -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/
----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\shizhang\AppData\Local\Programs\Python\Python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"'; __file__='"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\shizhang\AppData\Local\Temp\pip-record-s9zyd_o1\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.
WARNING: You are using pip version 19.2.3, however version 20.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
你好,因为我要爬取的量太大了,昨天爬了200多个用户就不行了。所以我想把我的user_id_list设置为一个文件夹,然后文件夹里很多小的txt文件,每个txt包含100个id,然后设置分别对文件夹里的每个txt爬取,每爬完一个txt睡眠长一点时间后再继续爬文件夹里的下一个txt。可不可以问下,如果我想这样实现的话,需要怎么修改代码?
Error: Expecting property name enclosed in double quotes: line 9 column 1 (char 172)
Traceback (most recent call last):
File "D:/Anaconda3/weibo-crawler-master/weibo.py", line 807, in main
config = json.loads(f.read())
File "D:\Anaconda3\lib\json_init_.py", line 348, in loads
return _default_decoder.decode(s)
File "D:\Anaconda3\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "D:\Anaconda3\lib\json\decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 9 column 1 (char 172)
Error: list index out of range
Traceback (most recent call last):
File "E:/微博数据/爬虫工具/weiboSpider-master/weiboSpider-master/weiboSpider.py", line 161, in extract_user_info
if selector.xpath("//div[@Class='tip'][2]/text()")[0] == u'学习经历':
IndexError: list index out of range
用户昵称:
用户id: user_id_list.txt
微博数: 9
关注数: 244
粉丝数: 263
url:https://weibo.cn/user_id_list.txt
不能按照txt里的内容爬取 不知道哪里出了差错 txt内容就是实验内容 胡歌迪丽热巴那三个明星
您好,如果想一次获取多个用户(非特定用户)的微博信息,是不是可以在哪里设置random?
尝试了一下这个repository做一些个人需求的爬虫,根据自己的实践经验提一些建议。
如果需要阅读输出的内容,目前的输出格式不是很友好,反而是标准输出非常易读,而且没有url信息干扰。
改动不多:
//a/@href
可以获取,但是确实会有信息重复等缺陷。<br />
替换成\\n
就可以输出换行并且不打扰其他格式。sleep
一次略少,我比较怂就每一页都会sleep
一次,但是正常情况下适当增加sleep
概率也应该有帮助。我甚至在此之外每20页再sleep
一次。但持续时间有点太长了,需要经常更新cookies。sleep
。我仅仅改成每3条sleep
一次就已经有所改善,但是依然容易被杀,这导致非常需要断点续写功能,关于断点续写参见下一条。考虑到被限流后,微博页(get_one_page)、长微博、视频的获取都有可能成为困难,因此断点续写功能还是很重要的。有一些个人思路供参考:
最后应该声明这些都是根据个人经验提出的参考意见。本来只是自己想小作修改为个人所用,但也许对其他人也有帮助还是写一段。由于最开始没打算上传因此代码目前被改得乱七八糟所以就不放我自己写的了……大部分都实现了所以寄希望于owner进行更优雅的改动了……但也许有空我也会考虑重新改一下……不过还是希望得到owner的反馈对以上所述进行取舍……
SyntaxError: invalid token
你好,请问怎么才能把用户的粉丝和关注的user_id也一起爬下来?
我想要去年某一段时间的统计,但是JS只能从起始-至今,我应该怎么修改JS?
请问如何选取爬取微博的时间段?不想一下子爬全部的微博,只要某年的或者某日到某日的
Error: list index out of range
Traceback (most recent call last):
File "E:\weiboSpider.py", line 161, in extract_user_info
if selector.xpath("//div[@Class='tip'][2]/text()")[0] == u'学习经历':
IndexError: list index out of range
Error: invalid literal for int() with base 10: ''
Traceback (most recent call last):
File "E:\weiboSpider.py", line 254, in get_user
weibo_num = int(user_info[0][3:-1])
ValueError: invalid literal for int() with base 10: ''
你好,我尝试把id都放在新建的txt文档里,但是分别遇到了两次错误
您好,可否增加获取评论详情的模块
csv插入是每个人一张表且有是否原创等result_header2和3,想问一下这个在MySQL插入要怎么实现?
请问有些人的user_id为英文要怎么办?比如anglebaby的id就是realangelababy
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Progress: 0%| | 0/156 [00:00<?, ?it/s]第1页
Progress: 0%| | 0/156 [00:00<?, ?it/s]
微博爬取完成,共爬取0条微博
信息抓取完毕
Error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 854, in start
self.get_pages()
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 803, in get_pages
self.get_user_info()
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 174, in get_user_info
js = self.get_json(params)
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 113, in get_json
return r.json()
File "/Users/cc/anaconda3/lib/python3.7/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/Users/cc/anaconda3/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/Users/cc/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/cc/anaconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
File "weibo.py", line 16, in <module> import requests ImportError: No module named requests
requests删了装,不懂上面这是什么情况
也输入了cookie,但是会出现这种结果。since date是 2018-01-01,结果到2019-11-17就开始出现这种现象了。现在尝试去掉cookie看看会不会有什么问题
你好,我在跑代码的时候遇到转发微博的源视频没有下载下来的情况,csv里面显示源视频的url是空的。比如在爬这个账号12.23号发的微博时,这条转发的源视频没有下载下来,json中的filter也设为0了。微博链接:https://weibointl.api.weibo.cn/share/109561554.html?weibo_id=4452708243343956
我爬取博主(1711243680),用的 python2.7, 在Ubuntu中操作, 不断出现如下错误:
('Error: ', KeyError('status',))
Traceback (most recent call last):
File "weibo.py", line 302, in get_one_page
wb = self.get_one_weibo(w)
File "weibo.py", line 280, in get_one_weibo
retweet = self.get_long_weibo(retweet_id)
File "weibo.py", line 70, in get_long_weibo
weibo_info = js['status']
KeyError: 'status'
倒是可以继续爬,但是会漏掉一些推文.换另一个博主也是同样情况, filter = 0或1 都会出现同样情况.不知道是为什么?不知道是不是跟微博包含视频有关?
大佬,直接加个txt好像没有用?
你好,
(1)怎样设置只爬取用户数据,不爬取微博数据?
(2)能否爬取用户数据的所在地(比如北京 海淀)和年龄?如何修改
(3)程序出错,错误原因可能为以下两者:
1.user_id不正确;
2.此用户微博可能需要设置cookie才能爬取。
解决方案:
请参考
https://github.com/dataabc/weibo-crawler#如何获取user_id
获取正确的user_id;
或者参考
https://github.com/dataabc/weibo-crawler#3程序设置
中的“设置cookie”部分设置cookie信息
在出现以上错误的时候,怎样跳过此ID的读取,继续运行?
非常感谢!
我在Node程序中调用weibo.py,代码如下:
//index.js
const spawn = require("child_process").spawn;
const process = spawn('python',["./weibo.py"])
process.stdout.on('data', (data) => {
console.log(`stdout: ${data}`);
});
process.stderr.on('data', (data) => {
console.error(`stderr: ${data}`);
});
index.js和weibo.py在同一级目录下
执行命令
$ node index.js
报错如下:
stderr: Traceback (most recent call last):
stderr: File "./weibo.py", line 947, in start
self.get_pages()
File "./weibo.py", line 888, in get_pages
self.get_user_info()
File "./weibo.py", line 200, in get_user_info
user = self.standardize_info(user_info)
File "./weibo.py", line 424, in standardize_info
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None
stdout: ('Error: ', TypeError('encode() argument 1 must be string, not None',))
报错结果显示weibo.py第424行报错:
weibo[k] = v.replace(u"\u200b", "").encode(
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
sys.stdout.encoding
的值是None,引发了报错,那么能否用固定值替换sys.stdout.encoding
Error: 'pic_download'
Traceback (most recent call last):
File "G:/PyCharm/spider/1.py", line 1073, in main
wb = Weibo(config)
File "G:/PyCharm/spider/1.py", line 26, in init
self.validate_config(config)
File "G:/PyCharm/spider/1.py", line 68, in validate_config
if config[argument] != 0 and config[argument] != 1:
KeyError: 'pic_download'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.