Giter VIP home page Giter VIP logo

weibo-crawler's People

Contributors

blancray avatar bluehtml avatar brian95827 avatar casouri avatar cuckon avatar dataabc avatar dependabot[bot] avatar echo536 avatar enel1jk avatar gaelthas avatar haozewu avatar hjyssg avatar libai1024 avatar lisicheng1997 avatar lwd-temp avatar mobyw avatar nanamicat avatar noisyle avatar plutonji avatar thepopezhang avatar waizui avatar weydon-ding avatar xyauhideto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibo-crawler's Issues

'Error: ', KeyError('statuses_count',)

('Error: ', KeyError('statuses_count',))
Traceback (most recent call last):
  File "weibo.py", line 479, in start
    self.get_pages()
  File "weibo.py", line 452, in get_pages
    page_count = self.get_page_count()
  File "weibo.py", line 364, in get_page_count
    weibo_count = self.user['statuses_count']
KeyError: 'statuses_count'

这个微博已经炸号(但号主本人也就是我可以正常浏览)请问是否炸号有可能导致了该错误?

since_date设置后无效

你好,我设置了since_date为“2019-12-01”后爬取时还是能爬到很久以前(2012年)的微博。
image

无法下载微博中的超链接

如果微博内容中含有超链接,则在爬取的微博正文中只能看到“网页链接”四个字,而无法看到该超链接的内容

你好,不好意思,再问下...我按照你说的做了以后,输出的文件没有数据,是空的

def get_pages(self):
    """获取全部微博"""
    self.get_user_info()
    
    gender = u'女' if self.user['gender'] == 'f' else u'男'#性别
    import pandas as pd
    df=pd.DataFrame()
    df['ID']=self.user['id']
    df['昵称']=self.user['screen_name']
    df['性别']=gender
    df['微博数']=self.user['statuses_count']
    df['微博粉丝数']=self.user['followers_count']
    df['微博关注数']=self.user['follow_count']
    df['微博简介']=self.user['description']
    df['微博等级']=self.user['urank']
    df['微博会员等级']=self.user['mbrank']
    df.to_excel('%s个人信息.xlsx'%self.user['screen_name'],index=False)
    
    page_count = self.get_page_count()
    wrote_count = 0
    self.print_user_info()

你好,我按照这个输出,但是输出文件是空的,没有数据

关于微博原文内的链接内容

微博会将链接转成短链接。 显示的时候就是一个 “网页链接” 的超链
爬取下来时,就变成纯文字 网页链接 四个字。链接t.cn/xxx的内容就丢失了。
下载:O网页链接 码:6n4h

好像记得早前的版本能爬出t.cn的内容

[userid问题]这次的数字似乎比较大,也出现了同样的问题

大概由于微博是比较新注册的? userid比较长, 结果也出现了类似英文个性化域名输入时的错误

Error: 'statuses_count'
Traceback (most recent call last):
File "weibo.py", line 739, in start
self.get_pages()
File "weibo.py", line 694, in get_pages
page_count = self.get_page_count()
File "weibo.py", line 475, in get_page_count
weibo_count = self.user['statuses_count']
KeyError: 'statuses_count'

id是1005055911162580,麻烦大神再看一下,谢谢!

安装Python模块失败

环境

  • OS: Windows 10
  • Python3.8.1

重现步骤

$ python -m pip install -r requirements.txt

报错信息

Collecting lxml==4.3.4 (from -r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/da/b5/d3e0d22649c63e92cb0902847da9ae155c1e801178ab5d272308f35f726e/lxml-4.3.4.tar.gz
Requirement already satisfied: requests==2.22.0 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from -r requirements.txt (line 2)) (2.22.0)
Requirement already satisfied: tqdm==4.32.2 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from -r requirements.txt (line 3)) (4.32.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (2019.11.28)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\shizhang\appdata\local\programs\python\python38\lib\site-packages (from requests==2.22.0->-r requirements.txt (line 2)) (1.25.8)
Installing collected packages: lxml
  Running setup.py install for lxml: started
    Running setup.py install for lxml: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\shizhang\AppData\Local\Programs\Python\Python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"'; __file__='"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\shizhang\AppData\Local\Temp\pip-record-s9zyd_o1\install-record.txt' --single-version-externally-managed --compile
         cwd: C:\Users\shizhang\AppData\Local\Temp\pip-install-71w5rudd\lxml\
    Complete output (77 lines):
    Building lxml version 4.3.4.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.8
    creating build\lib.win-amd64-3.8\lxml
    copying src\lxml\builder.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\cssselect.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\doctestcompare.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\ElementInclude.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\pyclasslookup.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\sax.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\usedoctest.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\_elementpath.py -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\__init__.py -> build\lib.win-amd64-3.8\lxml
    creating build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\__init__.py -> build\lib.win-amd64-3.8\lxml\includes
    creating build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\builder.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\clean.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\defs.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\diff.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\ElementSoup.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\formfill.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\html5parser.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\soupparser.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\usedoctest.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\_diffcommand.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\_html5builder.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\_setmixin.py -> build\lib.win-amd64-3.8\lxml\html
    copying src\lxml\html\__init__.py -> build\lib.win-amd64-3.8\lxml\html
    creating build\lib.win-amd64-3.8\lxml\isoschematron
    copying src\lxml\isoschematron\__init__.py -> build\lib.win-amd64-3.8\lxml\isoschematron
    copying src\lxml\etree.h -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\etree_api.h -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\lxml.etree.h -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\lxml.etree_api.h -> build\lib.win-amd64-3.8\lxml
    copying src\lxml\includes\c14n.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\config.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\dtdvalid.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\etreepublic.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\htmlparser.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\relaxng.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\schematron.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\tree.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\uri.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\xinclude.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\xmlerror.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\xmlparser.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\xmlschema.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\xpath.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\xslt.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\__init__.pxd -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\etree_defs.h -> build\lib.win-amd64-3.8\lxml\includes
    copying src\lxml\includes\lxml-version.h -> build\lib.win-amd64-3.8\lxml\includes
    creating build\lib.win-amd64-3.8\lxml\isoschematron\resources
    creating build\lib.win-amd64-3.8\lxml\isoschematron\resources\rng
    copying src\lxml\isoschematron\resources\rng\iso-schematron.rng -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\rng
    creating build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl
    copying src\lxml\isoschematron\resources\xsl\RNG2Schtrn.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl
    copying src\lxml\isoschematron\resources\xsl\XSD2Schtrn.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl
    creating build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_abstract_expand.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_dsdl_include.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_message.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_skeleton_for_xslt1.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_svrl_for_xslt1.xsl -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\readme.txt -> build\lib.win-amd64-3.8\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    running build_ext
    building 'lxml.etree' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\shizhang\AppData\Local\Programs\Python\Python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"'; __file__='"'"'C:\\Users\\shizhang\\AppData\\Local\\Temp\\pip-install-71w5rudd\\lxml\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\shizhang\AppData\Local\Temp\pip-record-s9zyd_o1\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.
WARNING: You are using pip version 19.2.3, however version 20.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

将id_list的值设置为包含多个txt文件的文件夹

你好,因为我要爬取的量太大了,昨天爬了200多个用户就不行了。所以我想把我的user_id_list设置为一个文件夹,然后文件夹里很多小的txt文件,每个txt包含100个id,然后设置分别对文件夹里的每个txt爬取,每爬完一个txt睡眠长一点时间后再继续爬文件夹里的下一个txt。可不可以问下,如果我想这样实现的话,需要怎么修改代码?

您好,这个报错应该怎么修改呀?

Error: Expecting property name enclosed in double quotes: line 9 column 1 (char 172)
Traceback (most recent call last):
File "D:/Anaconda3/weibo-crawler-master/weibo.py", line 807, in main
config = json.loads(f.read())
File "D:\Anaconda3\lib\json_init_.py", line 348, in loads
return _default_decoder.decode(s)
File "D:\Anaconda3\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "D:\Anaconda3\lib\json\decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 9 column 1 (char 172)

我试了一下把用户id都放在txt文件里爬取 发生以下错误

Error: list index out of range
Traceback (most recent call last):
File "E:/微博数据/爬虫工具/weiboSpider-master/weiboSpider-master/weiboSpider.py", line 161, in extract_user_info
if selector.xpath("//div[@Class='tip'][2]/text()")[0] == u'学习经历':
IndexError: list index out of range
用户昵称:
用户id: user_id_list.txt
微博数: 9
关注数: 244
粉丝数: 263
url:https://weibo.cn/user_id_list.txt


不能按照txt里的内容爬取 不知道哪里出了差错 txt内容就是实验内容 胡歌迪丽热巴那三个明星

如何获取多个user_id

您好,如果想一次获取多个用户(非特定用户)的微博信息,是不是可以在哪里设置random?

[新功能提议] 关于输出、避免限制、下载、断点续写

尝试了一下这个repository做一些个人需求的爬虫,根据自己的实践经验提一些建议。

关于输出

如果需要阅读输出的内容,目前的输出格式不是很友好,反而是标准输出非常易读,而且没有url信息干扰。
改动不多:

  • 可以考虑将标准输出改为文件输出,易读且不影响其他进度条的监视,因为输出容易淹没进度条信息。
  • 可以考虑额外再输出原博地址、原博内链接;因为想阅读很久远的微博可能很难手动获取,附上地址会有帮助;原博内偶尔有图片链接、视频链接等,但不在html格式里展示,同时被xpath删除,这个也比较简单,//a/@href可以获取,但是确实会有信息重复等缺陷。
  • 为了易读性,应该保留换行信息;也比较简单,仅仅将<br />替换成\\n就可以输出换行并且不打扰其他格式。

关于避免限制

  • 供其他使用者参考:根据我的经验,我感觉1-5页sleep一次略少,我比较怂就每一页都会sleep一次,但是正常情况下适当增加sleep概率也应该有帮助。我甚至在此之外每20页再sleep一次。但持续时间有点太长了,需要经常更新cookies。
  • 被杀不易发现,但有时看似正常但无法访问长微博。
  • 为了帮助监视爬虫情况,我认为也应该定期输出抓取的微博条数,可以跟页数一起显示;同时根据目前情况,不计入置顶一页理论上应该有10条,可以考虑输出非10条的页供判断。当然确实有很多页不足10条,因此这个输出应仅供参考。
  • 下载图片不易被限制,但视频容易被限制,约半小时(?)后才能解禁,关于视频参见下一条。

关于下载

  • 下载图片基本没有被限制的情况。
  • 下载视频过分容易被限制,可能也是因为集中下载视频的行为,而且并不报错而是变成1KB的文件,因此可以考虑通过同时监视视频下载个数、时间、下载量,定期sleep。我仅仅改成每3条sleep一次就已经有所改善,但是依然容易被杀,这导致非常需要断点续写功能,关于断点续写参见下一条。

关于断点续写

考虑到被限流后,微博页(get_one_page)、长微博、视频的获取都有可能成为困难,因此断点续写功能还是很重要的。有一些个人思路供参考:

  • 一个可能较大的改动:希望能隔一段时间dump一次文件,不止是微博内容,还有下载,因为实际上下载也很占据时间……同时输出汇总信息,供参考判断。微博条数多的时候很重要,可以帮助debug(?);而且断点续写也可以根据该功能来实现,选择从某一页开始重新跑即可,不必为了下载而从头开始;此外向微博请求的数据也会更像人类一点,猜测会帮助减少限制的可能性。
  • 建议加入从某页开始的toggle或者是某个时间节点截止的option。因为程序思路是根据页数顺序下载收集,since_date在第一次爬取不成功的情况下不是很有用……挂掉的情况下还是end_date或是restart_page等目前不存在的功能会有帮助一点。

最后应该声明这些都是根据个人经验提出的参考意见。本来只是自己想小作修改为个人所用,但也许对其他人也有帮助还是写一段。由于最开始没打算上传因此代码目前被改得乱七八糟所以就不放我自己写的了……大部分都实现了所以寄希望于owner进行更优雅的改动了……但也许有空我也会考虑重新改一下……不过还是希望得到owner的反馈对以上所述进行取舍……

CSV打开字段是乱码?

4473417006580828,IuNUwrBYo,浠�澶╂����绗��娆′���瀹������涓�涓ü娆℃���ü�����寰�骞歌����板��瀹�锛�杩�涓����澶т��圭���ュ�ワ�璁╂��浠�浠�韬��瀛﹀�板�澶���村�颁�澶╋���杩��戒�浠��i���峰��介���璋㈣阿�ㄤ�涓ü绔���垢绂��板��骞�����灏�浼�浼翠滑锛���浣 浠��璧峰伐浣����跺��寰�蹇���璋㈣阿��娆㈠��瀹���浣 浠����浼�甯︾�ü��瀹����d唤���锛�缁х画��琛��灏�骞哥�灏辫��扮�浜�锛�绁��夸�浠��涓ü绔��芥��垢绂�锛甯���涓�娆★�杩��戒�浣 �搁��#涓�涓ü绔���垢绂�澶х�灞ü#,https://wx2.sinaimg.cn/large/006aZ9kDly1gc0tr5z1tbj30s50gldj5.jpg,,,2020-02-18,,468563,52426,74071,涓�涓ü绔���垢绂�澶х�灞ü,

您好,请问出现这样的错误显示该怎么解决呢?

Error: list index out of range
Traceback (most recent call last):
File "E:\weiboSpider.py", line 161, in extract_user_info
if selector.xpath("//div[@Class='tip'][2]/text()")[0] == u'学习经历':
IndexError: list index out of range
Error: invalid literal for int() with base 10: ''
Traceback (most recent call last):
File "E:\weiboSpider.py", line 254, in get_user
weibo_num = int(user_info[0][3:-1])
ValueError: invalid literal for int() with base 10: ''

你好,user id list无法正常读取

你好,我尝试把id都放在新建的txt文档里,但是分别遇到了两次错误

  1. 按照readme里面每行一个id,id后面加空格再加注释(可选)的方式,显示utfcode无法解码(Error: 'utf-8' codec can't decode byte 0xc0 in position 13: invalid start byte)。
  2. 去掉注释和空格以后,每行留一个id,只能爬第一个id,然后显示Error: 'nickname'
    请问可以帮忙解答一下是什么原因吗?
    非常感谢这个项目及解答!

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)请问如何处理?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Progress: 0%| | 0/156 [00:00<?, ?it/s]第1页
Progress: 0%| | 0/156 [00:00<?, ?it/s]
微博爬取完成,共爬取0条微博
信息抓取完毕


Error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 854, in start
self.get_pages()
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 803, in get_pages
self.get_user_info()
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 174, in get_user_info
js = self.get_json(params)
File "/Users/cc/Downloads/weibo-crawler-master/weibo.py", line 113, in get_json
return r.json()
File "/Users/cc/anaconda3/lib/python3.7/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/Users/cc/anaconda3/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/Users/cc/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/cc/anaconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

no module named requests

File "weibo.py", line 16, in <module> import requests ImportError: No module named requests
requests删了装,不懂上面这是什么情况

用id1669879400试写mysql数据库时报错

你好,报错为:pymysql.err.DataError: (1406, "Data too long for column 'at_users' at row 1")

这里数据库小白一枚..查了一下网上说有可能是数据库的编码问题。我看了一下mysql的编码如下(server的编码也是utf8mb4),不知道具体哪个值的编码应该再进行调整。麻烦你方便的时候帮忙看一下~谢谢!!
屏幕截图(24)

('Error: ', KeyError('status',))

我爬取博主(1711243680),用的 python2.7, 在Ubuntu中操作, 不断出现如下错误:

('Error: ', KeyError('status',))
Traceback (most recent call last):
  File "weibo.py", line 302, in get_one_page
    wb = self.get_one_weibo(w)
  File "weibo.py", line 280, in get_one_weibo
    retweet = self.get_long_weibo(retweet_id)
  File "weibo.py", line 70, in get_long_weibo
    weibo_info = js['status']
KeyError: 'status'

倒是可以继续爬,但是会漏掉一些推文.换另一个博主也是同样情况, filter = 0或1 都会出现同样情况.不知道是为什么?不知道是不是跟微博包含视频有关?

怎样设置只爬取用户数据,增加爬取年龄数据

你好,
(1)怎样设置只爬取用户数据,不爬取微博数据?
(2)能否爬取用户数据的所在地(比如北京 海淀)和年龄?如何修改
(3)程序出错,错误原因可能为以下两者:
1.user_id不正确;
2.此用户微博可能需要设置cookie才能爬取。
解决方案:
请参考
https://github.com/dataabc/weibo-crawler#如何获取user_id
获取正确的user_id;
或者参考
https://github.com/dataabc/weibo-crawler#3程序设置
中的“设置cookie”部分设置cookie信息
在出现以上错误的时候,怎样跳过此ID的读取,继续运行?
非常感谢!

在外部调用weibo.py报错:TypeError: encode() argument 1 must be string, not None

我在Node程序中调用weibo.py,代码如下:

//index.js
const spawn = require("child_process").spawn; 
const process = spawn('python',["./weibo.py"])
process.stdout.on('data', (data) => {
    console.log(`stdout: ${data}`);
});
process.stderr.on('data', (data) => {
    console.error(`stderr: ${data}`);
});

index.js和weibo.py在同一级目录下
执行命令

$ node index.js

报错如下:

stderr: Traceback (most recent call last):

stderr:   File "./weibo.py", line 947, in start
    self.get_pages()
  File "./weibo.py", line 888, in get_pages
    self.get_user_info()
  File "./weibo.py", line 200, in get_user_info
    user = self.standardize_info(user_info)
  File "./weibo.py", line 424, in standardize_info
    sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None

stdout: ('Error: ', TypeError('encode() argument 1 must be string, not None',))

报错结果显示weibo.py第424行报错:

weibo[k] = v.replace(u"\u200b", "").encode(
                    sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)

sys.stdout.encoding的值是None,引发了报错,那么能否用固定值替换sys.stdout.encoding

KeyError: 'pic_download'该怎么解决?

Error: 'pic_download'
Traceback (most recent call last):
File "G:/PyCharm/spider/1.py", line 1073, in main
wb = Weibo(config)
File "G:/PyCharm/spider/1.py", line 26, in init
self.validate_config(config)
File "G:/PyCharm/spider/1.py", line 68, in validate_config
if config[argument] != 0 and config[argument] != 1:
KeyError: 'pic_download'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.