nghuyong / weibospider Goto Github PK
View Code? Open in Web Editor NEW持续维护的新浪微博采集工具🚀🚀🚀
License: MIT License
持续维护的新浪微博采集工具🚀🚀🚀
License: MIT License
2019-03-05 18:28:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://weibo.cn/6132874125/profile?page=1> (referer: https://weibo.cn/u/6132874125)
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/WeiboSpider/WeiboSpider/sina/spiders/weibo_spider.py", line 115, in parse_tweet
tweet_nodes = tree_node.xpath('//div[@Class="c" and @id]')
AttributeError: 'NoneType' object has no attribute 'xpath'
我看到程序里登录账号时没有验证码是返回“未出现验证码”,没有继续去获取cookie。另外pkill这个命令是否是在linux下的系统命令?在windows下对应的修改能否请教下。
从网上买了300多个白号,前两个月还能使用,从上周开始就有问题了,一直提示twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
或者418,最后的结果是账号池为空,重新登录以后也不能用,明明自己再网页上测试,账号都是能用的
这个地址只挖掘到了:
木兔这句话真的 我已经哭到窒息了……
其他转发的信息都没有保留下来,这样正确吗?
我在Linux服务器上通过MOngoDB数据库建立好了账号池,然后代码如何与账号池进行连接?
我看了下,这个是不是缺少宫格验证的步骤,直接进去应该不行。现在cn站登陆都要验证。
WeiboSpider/sina/spiders/weibo_spider.py
Line 88 in a9bc097
Request里的dont_filter=True有什么作用?
你好,我是新手想請教一下。我用了你的例子來嘗試一下,可是為什麼我拿不到微博信息呢?
萬分感激!
2018-10-03 02:05:33 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Oct 6 2017, 22:29:07) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-10-03 02:05:33 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sina.spiders', 'SPIDER_MODULES': ['sina.spiders'], 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'sina'}
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled item pipelines:
['sina.pipelines.MongoDBPipeline']
2018-10-03 02:05:33 [scrapy.core.engine] INFO: Spider opened
2018-10-03 02:05:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-03 02:05:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-03 02:05:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/2803301701/info> (referer: None)
2018-10-03 02:05:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/1699432410/info> (referer: None)
2018-10-03 02:05:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/u/2803301701> (referer: https://weibo.cn/2803301701/info)
2018-10-03 02:05:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weibo.cn/u/2803301701>
{'_id': '2803301701',
'authentication': u'\u300a\u4eba\u6c11\u65e5\u62a5\u300b\u6cd5\u4eba\u5fae\u535a',
'birthday': u'1948-06-15',
'crawl_time': 1538546734,
'fans_num': 72033515,
'follows_num': 3033,
'gender': u'\u7537',
'nick_name': u'\u4eba\u6c11\u65e5\u62a5',
'province': u'\u5317\u4eac',
'tweets_num': 91312,
'vip_level': u'6\u7ea7'}
2018-10-03 02:05:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/u/1699432410> (referer: https://weibo.cn/1699432410/info)
2018-10-03 02:05:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weibo.cn/u/1699432410>
{'_id': '1699432410',
'authentication': u'\u65b0\u534e\u793e\u6cd5\u4eba\u5fae\u535a',
'birthday': u'1931-11-07',
'crawl_time': 1538546737,
'fans_num': 42741520,
'follows_num': 4242,
'gender': u'\u7537',
'nick_name': u'\u65b0\u534e\u89c6\u70b9',
'province': u'\u5317\u4eac',
'tweets_num': 100178,
'vip_level': u'5\u7ea7'}
2018-10-03 02:05:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/2803301701/profile?page=1> (referer: https://weibo.cn/u/2803301701)
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
是不是这样?还有个问题,当微博是转发的时候,好像只爬取了被转发的内容,而转发时候自己加的文字不被抓取?class = cmt还是 ctt那部分
看了一下代码,information_item['_id'],在持续增加,应该会遍历所有用户吧,所以应该不止爬取新华社和人民日报,请问是这样吗
您好!
自己想爬微博@了谁的信息,例如
其中的@叶婉婷cici,想要他的基本信息,chrome检查元素的内容是<a href="/n/%E5%8F%B6%E5%A9%89%E5%A9%B7cici">@叶婉婷cici</a>
我在您的search分支weibo_spider.py代码的基础上增加了yield Request(url=self.base_url+href, callback=self.parse_atwho )
,但是在运行爬虫的时候总是提示
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://weibo.cn/n/%E5%90%8D%E4%BA%BA%E5%9D%8A%E9%97%B4%E5%85%AB%E5%8D%A6> (failed 3 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.
请问该如何解决?
现在验证方式变成点击验证,点击失败再滑块了,请问这个怎么破?
想要获取微博的地理位置信息,请问您有办法吗?
前几天一天爬过800万数据,昨天开始突然开始封锁ip了。不知道换了其他ip会不会解决。作者有没有计划做一个破解封锁ip的版本?还有,能确定418是ip被封锁的代码吗?因为我发现这个所谓的封锁,好像不一会就可以再次用了。会不会不是ip封锁的原因?
WeiboSpider/sina/spiders/weibo_spider.py
Line 172 in b5acb4f
这里应该是all_content[0:]?
分布式中多台主机连到一个mongodb,一天千万级的数据的话,写入的速度能够跟上吗?写入会阻塞住吗?数据量大了之后mongodb性能会下降吧。因为没有这个环境只好来提问了。
您好,我发现有好几天的数据只爬到了下午5点之后的,通过手动搜索发现当天5点之前也有数据,但就是爬不到,请问是为什么呀?该如何解决?多谢!
如题。。。
运行几分钟后,就提示ip 被封了!!!请更换ip,或者停止程序...,想问怎么解决?谢谢大牛指导
就是给出一些用户ID作为种子,然后不断延伸。无差别爬取所有的用户微博。
而且单独运行login.py会提示:
'pkill' �����ڲ����ⲿ���Ҳ���ǿ����еij���
���������ļ���
Message: 'phantomjs' executable needs to be in PATH.
验证码位置 326 486 438 598
'NoneType' object is not subscriptable
破解失效了??
单条微博你只获取 tweet_node.xpath('.//span[@Class="ctt"]')[0] 这span标签里的text,微博内容获取的不全吧,有些在span标签外边的,你是怎么解决的
2018-07-23 22:16:08 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: sina)
2018-07-23 22:16:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) - [GCC 7.2.0], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Linux-3.10.0-862.9.1.el7.x86_64-x86_64-with-centos-7.5.1804-Core
2018-07-23 22:16:08 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'sina', 'CONCURRENT_REQUESTS': 32, 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'sina.spiders', 'SPIDER_MODULES': ['sina.spiders']}
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'sina.middlewares.UserAgentMiddleware',
'sina.middlewares.CookiesMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled item pipelines:
['sina.pipelines.MongoDBPipeline']
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Spider opened
2018-07-23 22:16:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-23 22:16:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-23 22:16:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> from <GET https://weibo.cn/5303798085/info>
2018-07-23 22:16:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> (referer: None)
2018-07-23 22:16:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> (referer: None)
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/WeiboSpider/sina/spiders/weibo_spider.py", line 29, in parse_information
ID = re.findall('(\d+)/info', response.url)[0]
IndexError: list index out of range
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-23 22:16:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 825,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2765,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 23, 14, 16, 9, 992802),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 54239232,
'memusage/startup': 54239232,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2018, 7, 23, 14, 16, 9, 126466)}
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Spider closed (finished)
怎么样才可以做到持续爬呢?我才爬了一会就显示账号池为空了
非常感谢!
请问在何处加入ip代理?
如题,这个爬虫超级棒!但这几天使用后发现,几乎几分钟就会被提示“[weibo_spider] ERROR: ip 被封了!!!请更换ip,或者停止程序...”, 想请教一下:如何在源代码中加入IP池
我在抓一个用户的所有关注时候发现weibo.com上只能看到5页,而weibo.cn上面能看到20页,但是这两个都无法看到全部的关注,这个问题有遇到过吗
生成了数十万个链接在redis里面,怎样知道当前还有多少个URL没有被爬过?
你好,请问在search分支,2个账号情况下如何解决在爬取一会之后提示‘当前账号池为空’的问题,谢谢
求教!
如何满足以下要求:
01 抓取某一关键词的全部微博内容
02 从今日开始,每天或者每小时抓取,持续一段时间
MongoDB数据库需要自己建表吗?
还是直接运行python代码就行?
好像现在买的小号取不到验证码了,还有就是封IP的问题,
👏👏👏开始没有爬
运行weibo_spider.py 出现ModuleNotFoundError: No module named 'sina.items'
是因为缺少某个第三方库吗?
可能涉及到mongo操作的疑问,谢谢:
除了在mongo shell中去访问数据,可以以表视图查看吗,如data_structure.md中的那样?
可以将爬取的数据导出成其他方便处理的格式吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.