Giter VIP home page Giter VIP logo

weibospider's Issues

weibo.cn无法登陆获取cookie

image
image
像上面这个,我把代码图片校验的代码注释后发现cookie格式很异常,自行登陆发现,有个js包404了,人工都无法登陆了,更别说机器了,限制还是很大啊

AttributeError: 'NoneType' object has no attribute 'xpath'

2019-03-05 18:28:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://weibo.cn/6132874125/profile?page=1> (referer: https://weibo.cn/u/6132874125)
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/WeiboSpider/WeiboSpider/sina/spiders/weibo_spider.py", line 115, in parse_tweet
tweet_nodes = tree_node.xpath('//div[@Class="c" and @id]')
AttributeError: 'NoneType' object has no attribute 'xpath'

账号池+分布式分支的爬虫工具不能用了吗

从网上买了300多个白号,前两个月还能使用,从上周开始就有问题了,一直提示twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.或者418,最后的结果是账号池为空,重新登录以后也不能用,明明自己再网页上测试,账号都是能用的

https://weibo.com/1631641650/H7cTe8otL

这个地址只挖掘到了:

木兔这句话真的 我已经哭到窒息了…… ​​​​

其他转发的信息都没有保留下来,这样正确吗?

獲取不到微博內容

你好,我是新手想請教一下。我用了你的例子來嘗試一下,可是為什麼我拿不到微博信息呢?
萬分感激!

2018-10-03 02:05:33 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Oct  6 2017, 22:29:07) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-10-03 02:05:33 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sina.spiders', 'SPIDER_MODULES': ['sina.spiders'], 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'sina'}
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled item pipelines:
['sina.pipelines.MongoDBPipeline']
2018-10-03 02:05:33 [scrapy.core.engine] INFO: Spider opened
2018-10-03 02:05:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-03 02:05:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-03 02:05:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/2803301701/info> (referer: None)
2018-10-03 02:05:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/1699432410/info> (referer: None)
2018-10-03 02:05:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/u/2803301701> (referer: https://weibo.cn/2803301701/info)
2018-10-03 02:05:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weibo.cn/u/2803301701>
{'_id': '2803301701',
 'authentication': u'\u300a\u4eba\u6c11\u65e5\u62a5\u300b\u6cd5\u4eba\u5fae\u535a',
 'birthday': u'1948-06-15',
 'crawl_time': 1538546734,
 'fans_num': 72033515,
 'follows_num': 3033,
 'gender': u'\u7537',
 'nick_name': u'\u4eba\u6c11\u65e5\u62a5',
 'province': u'\u5317\u4eac',
 'tweets_num': 91312,
 'vip_level': u'6\u7ea7'}
2018-10-03 02:05:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/u/1699432410> (referer: https://weibo.cn/1699432410/info)
2018-10-03 02:05:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weibo.cn/u/1699432410>
{'_id': '1699432410',
 'authentication': u'\u65b0\u534e\u793e\u6cd5\u4eba\u5fae\u535a',
 'birthday': u'1931-11-07',
 'crawl_time': 1538546737,
 'fans_num': 42741520,
 'follows_num': 4242,
 'gender': u'\u7537',
 'nick_name': u'\u65b0\u534e\u89c6\u70b9',
 'province': u'\u5317\u4eac',
 'tweets_num': 100178,
 'vip_level': u'5\u7ea7'}
2018-10-03 02:05:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/2803301701/profile?page=1> (referer: https://weibo.cn/u/2803301701)
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

如何获取微博中@的人的信息

您好!
自己想爬微博@了谁的信息,例如
屏幕快照 2019-03-09 07 18 05 PM
其中的@叶婉婷cici,想要他的基本信息,chrome检查元素的内容是<a href="/n/%E5%8F%B6%E5%A9%89%E5%A9%B7cici">@叶婉婷cici</a>
我在您的search分支weibo_spider.py代码的基础上增加了yield Request(url=self.base_url+href, callback=self.parse_atwho ),但是在运行爬虫的时候总是提示

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://weibo.cn/n/%E5%90%8D%E4%BA%BA%E5%9D%8A%E9%97%B4%E5%85%AB%E5%8D%A6> (failed 3 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

请问该如何解决?

账号池的方法已经完全没法用了。前几天还可以

前几天一天爬过800万数据,昨天开始突然开始封锁ip了。不知道换了其他ip会不会解决。作者有没有计划做一个破解封锁ip的版本?还有,能确定418是ip被封锁的代码吗?因为我发现这个所谓的封锁,好像不一会就可以再次用了。会不会不是ip封锁的原因?

爬取带链接表情的微博内容时会丢弃img标签后的内容

image

这种带img表情的微博,一般后面跟的是标签结束符,这样的话您抓取的就会在表情那儿终止,而后面的文字就会被丢弃。请问如何解决这个问题呢。我观察到了一条微博结束时会有一个   ,不知道能否将这个作为判断条件。

分布式千万级中mongodb性能插入能撑住吗?

分布式中多台主机连到一个mongodb,一天千万级的数据的话,写入的速度能够跟上吗?写入会阻塞住吗?数据量大了之后mongodb性能会下降吧。因为没有这个环境只好来提问了。

抓取不到所有的微博内容

你好,我在你的代码上进行了一些修改,希望能够实现循环爬取,主要就是在抓取粉丝列表这个方法的最后,对每个粉丝进行信息的抓取,具体代码就两行,如下图红框所示。但是我遇到了一个问题,就是爬虫虽然在持续运行,也在爬取数据,但是爬取的数据确不全。比如人民日报有9w条微博,但我只爬到了250条就没了。我不确定是不是因为我加的循环爬取的位置不对,或是这样方式不对。因此想请教一下是哪里错了。如果是思路不对,麻烦指导一下正确的循环爬取应该怎样改,非常感谢!
image

爬取数据不全

您好,我发现有好几天的数据只爬到了下午5点之后的,通过手动搜索发现当天5点之前也有数据,但就是爬不到,请问是为什么呀?该如何解决?多谢!

单条微博内容你抓全了吗

单条微博你只获取 tweet_node.xpath('.//span[@Class="ctt"]')[0] 这span标签里的text,微博内容获取的不全吧,有些在span标签外边的,你是怎么解决的

ERROR: Spider error processing

2018-07-23 22:16:08 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: sina)
2018-07-23 22:16:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) - [GCC 7.2.0], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Linux-3.10.0-862.9.1.el7.x86_64-x86_64-with-centos-7.5.1804-Core
2018-07-23 22:16:08 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'sina', 'CONCURRENT_REQUESTS': 32, 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'sina.spiders', 'SPIDER_MODULES': ['sina.spiders']}
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'sina.middlewares.UserAgentMiddleware',
'sina.middlewares.CookiesMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled item pipelines:
['sina.pipelines.MongoDBPipeline']
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Spider opened
2018-07-23 22:16:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-23 22:16:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-23 22:16:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> from <GET https://weibo.cn/5303798085/info>
2018-07-23 22:16:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> (referer: None)
2018-07-23 22:16:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> (referer: None)
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/WeiboSpider/sina/spiders/weibo_spider.py", line 29, in parse_information
ID = re.findall('(\d+)/info', response.url)[0]
IndexError: list index out of range
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-23 22:16:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 825,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2765,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 23, 14, 16, 9, 992802),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 54239232,
'memusage/startup': 54239232,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2018, 7, 23, 14, 16, 9, 126466)}
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Spider closed (finished)

如何在这个代码的基础上加入IP池?

如题,这个爬虫超级棒!但这几天使用后发现,几乎几分钟就会被提示“[weibo_spider] ERROR: ip 被封了!!!请更换ip,或者停止程序...”, 想请教一下:如何在源代码中加入IP池

微博限制能看到的关注数量为5页

我在抓一个用户的所有关注时候发现weibo.com上只能看到5页,而weibo.cn上面能看到20页,但是这两个都无法看到全部的关注,这个问题有遇到过吗

Exception: 当前账号池为空

你好,请问在search分支,2个账号情况下如何解决在爬取一会之后提示‘当前账号池为空’的问题,谢谢

如何实时抓取微博内容

求教!
如何满足以下要求:

01 抓取某一关键词的全部微博内容
02 从今日开始,每天或者每小时抓取,持续一段时间

Weibo_spider报错

运行weibo_spider.py 出现ModuleNotFoundError: No module named 'sina.items'
是因为缺少某个第三方库吗?

能否通过表的界面查看爬取的数据?

可能涉及到mongo操作的疑问,谢谢:

  1. 除了在mongo shell中去访问数据,可以以表视图查看吗,如data_structure.md中的那样?

  2. 可以将爬取的数据导出成其他方便处理的格式吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.