qinxuye / cola Goto Github PK
View Code? Open in Web Editor NEWA high-level distributed crawling framework.
License: Other
A high-level distributed crawling framework.
License: Other
cola抓取完成后,就会反复提示“no budget left to process”,.不会抓取新增加的微博内容.即使关掉cola,重新再运行,也还是这样。
develop分支.单机模式。基本都是默认设置.
start uid有65个。
$ python init.py
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
...
...
^CCatch interrupt signal, start to stop
Counters during running:
{'error_urls': 20,
'finishes': 65,
'pages': 7321,
'secs': 15064.990124702454}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 175.00 seconds for running
在/tmp/user/1000/cola/worker/8ZcGfAqHmzc/mq/inc
下,有个文件9223372036854775807
, 大小为4194304。
head 9223372036854775807
�^pccopy_reg
_reconstructor
p1
(cweibo.bundle
WeiboUserBundle
p2
c__builtin__
object
p3
NtRp4
tail 9223372036854775807
ag12
ag43
ag647
ag54
asbsbsg1253
g1255
sS'last_error_page_times'
p1259
I0
sb.
多谢!
Win7下,用coca运行分布式程序,报错如下:
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.
Traceback (most recent call last):
File "C:\Python27\Scripts\coca-script.py", line 8, in
load_entry_point('cola==0.1.0beta', 'console_scripts', 'coca')()
File "C:\Python27\lib\site-packages\cola-0.1.0beta-py2.7.egg\cola\cmdline.py",
line 38, in execute
args.func(args)
File "C:\Python27\lib\site-packages\cola-0.1.0beta-py2.7.egg\cola\commands\mas
ter.py", line 49, in run
working_dir=args.working)
File "C:\Python27\lib\site-packages\cola-0.1.0beta-py2.7.egg\cola\context.py",
line 113, in init
self.manager.start(manager_init)
File "C:\Python27\lib\multiprocessing\managers.py", line 528, in start
self._address = reader.recv()
EOFError
不知道是不是我的问题,小弟初进github。linux和python还不是很懂。 安装了pyyaml,mechanize,python-dateutil,BeautifulSoup4,mongoengine,rsa,MongoDB等依赖。也在/usr/local/lib/pythonX.X/dist-packages中添加了pth文件。但是执行单机模式/contrib/wiki下python init.py命令时显示错误
Traceback (most recent call last):
File "init.py", line 28, in
from cola.core.urls import UrlPatterns, Url
ImportError: No module named cola.core.urls
想请指教一下小弟哪一步出了问题。小弟尚在学习,水平有限,问题如果太傻,请多包涵
如果微博内容有图片,如何爬取
具体情况是这样的,我Ctrl+C中断一个单机的任务,重新运行之后,会在FileBloomFilter有关的地方挂,无法启动。
然后我发现它的单元测试没有过。。
具体我也正在看。好像是它的父类BloomFilter没有初始化。。。。
我希望实现每个用户最多抓取MAX_WEIBO条微博,我的想法是:在contrib/weibo/parse.py的MicroBlogParser中添加以下代码:
self.crawled_num_lock.acquire()
self.bundle.current_crawled_num += inc_num
self.crawled_num_lock.release()
print 'Current crawled microblog num:', self.bundle.current_crawled_num
if self.bundle.current_crawled_num > MAX_WEIBO:
finished = True
其中inc_num是本次抓取新增的微博数,这里在bundle中添加了一个属性current_crawled_num,并且为这个操作添加了一个Lock。希望访问 self.bundle.current_crawled_num就可以得到当前对于某个用户的已经抓取的微博数。
但是这个方法似乎行不通,cola仍然在抓取,而且print 'Current crawled microblog num:', self.bundle.current_crawled_num
只打印了一遍。
出现这个问题可能跟cola的内部实现有关,请问如果要实现这个需求,该如何更改代码呢?
具体我空间了
Traceback (most recent call last):
File "stop.py", line 29, in
from cola.worker.recover import recover
ImportError: No module named worker.recover
start to get none
是网页没找到吗还是?还有weibosearch 中的.yaml没有配置keys,但是put_starts也没看到被调用呀,求指教一下啦,谢谢啦~~~
第一次运行起来后,用Ctrl+C退出了。第二次python init.py启动后就一直卡在这里不动了:
/home/iot/cf/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
/home/iot/cf/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
用户信息部分模块失效
在develp分支中,运行:
python contrib/weibo/stop.py
提示错误:
Traceback (most recent call last):
File "contrib/weibo/stop.py", line 29, in
from cola.worker.recover import recover
ImportError: No module named worker.recover
Wiki 中說
单机模式非常简单,只需运行contrib/wiki/init.py即可。
cd /to/path/cola/contrib/weibo
python init.py
這樣做會回報錯誤
如果先 启动cola master 和 cola worker,再運行单机模式就可正常運作。
你好,我在网上看到有人说,直接抓取web页的话,有些数据时是在js里面,不好解析。
另外有人说爬取weibo.cn手机版比较方便。
我的需求是爬取一个用户的全部粉丝和全部好友,然后往下继续爬出一个用户的全部粉丝和全部好友。用这个cola不知道能不能实现?
我启动了mongod.配置了wiki.yaml,都是用的localhost和27017,但还是出现socket.gaierror: [Errno -2] Name or service not known,请教下是什么原因~
提示:
WARNING: QApplication was not created in the main() thread.
QObject::setParent: Cannot set parent, new parent is in a different thread
无法运行,怎么不能set parent呢?
如登录失败等情况,程序没有通过Ctrl+C的方式退出,再次启动时会锁死。stop.py不能工作,我目前是通过删除/tmp/cola文件夹的方式来解锁。
如果在yaml配置文件中提供了多个登录用户名和密码,那么cola是如何使用的呢?尤其是当其中一个被封号(例如持续出错)。
你好,感谢你提供了这样的一个框架,It helps a lot。
我注意到你把微博抓取的instances设置为2,且由于
# cola/worker/loader.py
if master is None:
with StandaloneWorkerJobLoader(job, root, force=force) as job_loader:
job_loader.run()
,全局只有2个线程在抓取微博。
我在做类似爬虫的时候触发了新浪的反爬虫机制,造成每次登录必须输入验证码的情况,原因估计是并发抓取的线程数太多(16个)。于是想问下你这个线程数是怎么得出来的。
报错如下:
/home/kqc/.local/share/Trash/files/cola/cola/core/opener.py:74: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
login fail, reason:
login fail, reason:
Finish visiting pages count: 0
Finish visiting pages count: 0
我已经在weibo.yaml中填写了用户名(邮箱形式)和密码,而且我的微博也无需验证码。
另外,请问weibo.yaml中的start部分如何填写?是否一定要填写自己的uid?
出错提示:
Error when handle bundle: 1644564144, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1644564144&end_id=3778591994786968&_t=0&_k=1416407703559230&__rnd=1416407854641&pagebar=1&max_id=3768747372034188&page=2
'NoneType' object has no attribute 'text'
Traceback (most recent call last):
File "/home/kqc/github/cola/cola/job/executor.py", line 504, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "/home/kqc/github/cola/cola/job/executor.py", line 430, in _parse
**options).parse()
File "/home/kqc/github/cola/contrib/kweibo/parsers.py", line 207, in parse
forwards = func_div.find('a', attrs={'action-type': action_type_re("forward")}).text
AttributeError: 'NoneType' object has no attribute 'text'
另外,@chineking 能否告知你是如何查看 http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1644564144&end_id=3778591994786968&_t=0&_k=1416407703559230&__rnd=1416407854641&pagebar=1&max_id=3768747372034188&page=2 返回的数据的(用的什么工具,以后出错我也可以自行解决)?返回的数据似乎不能直接错位html直接显示。
It seems that QApplication() must be created in main thread, however, the default weibosearch config uses two instances which will cause one of the QApplication started in non-main thread.
请问以下,登陆失败引起的无法抓取,这个有解决的办法吗?
我这边使用的是单机抓取。我手动登陆发现,被弹了验证码。
我一共尝试了两次。第一次一切正常。
第二次的时候只输出start to get None,没有得到其他内容。
后来换了一个ID测试,还是这样。
在Arch Linux上测试,Python2.7
比如这个页面,
http://weibo.com/2955709171/info
在info这一页并没有显示昵称。
具体报错如下:
Traceback (most recent call last):
File "init.py", line 62, in
from cola.worker.loader import load_job
ImportError: No module named worker.loader
我是通过selenium模拟登陆获取cookie来抓取新浪微博内容的,感觉不太优雅,不知道您采用的是什么方式?如果能对代码进行注释就跟好了。
Traceback (most recent call last):
File "__init__.py", line 56, in <module>
ctx = Context(local_mode=True)
File "/home/ddmbr/play/cola/cola/context.py", line 104, in __init__
self.addrs = [self.fix_addr(_ip) for _ip in self.ips]
File "/home/ddmbr/play/cola/cola/context.py", line 64, in <lambda>
fix_addr = lambda _, addr: addr if ':' in addr \
TypeError: argument of type 'NoneType' is not iterable
抓取微博的时候不知道为什么
parsers的176行会报错 mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
以下是错误信息
D:\cola\contrib\weibo>init.py
D:\cola\cola\core\opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
process bundle from priority 0
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
D:\cola\cola\core\opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
process bundle from priority 0
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
_options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
Error when handle bundle: 1898353550, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=14182337179320
00&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=
1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418233844764&
pagebar=0&max_id=3778673397938545&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
Error when handle bundle: 1898353550, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=14182337179320
00&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234623885&pagebar=1&max_id=3734740953372321&page=1
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=50&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234624721&page=2
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234625316&pagebar=0&max_id=3656888933158899&page=2
Error when handle bundle: 1898353550, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1898353550&end_id=3786306393521083&_t=0&_k=14182337179320
00&__rnd=1418234625316&pagebar=0&max_id=3656888933158899&page=2
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&
pagebar=0&max_id=3751405376185938&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*_options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234625316&pagebar=0&max_id=3656888933158899&page=2
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
**options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
爬微博的数据,程序跑到一半报错了。
Error log:
get 1400220917 url: http://weibo.com/1400220917/follow
Error when fetch url: http://weibo.com/1746173800/follow
Error when get bundle: 1746173800
'NoneType' object has no attribute 'find'
Traceback (most recent call last):
File "/home/iot/cf/Cola/cola/cola/worker/loader.py", line 229, in _execute_bundle
*_options).parse()
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 554, in parse
return self._error(url, e)
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 91, in _error
raise e
AttributeError: 'NoneType' object has no attribute 'find'
Finish 1746173800
start to get 1804559491
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
Error when fetch url: http://weibo.com/1400220917/follow
Error when get bundle: 1400220917
'NoneType' object has no attribute 'find'
Traceback (most recent call last):
File "/home/iot/cf/Cola/cola/cola/worker/loader.py", line 229, in _execute_bundle
*_options).parse()
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 554, in parse
return self._error(url, e)
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 91, in _error
raise e
AttributeError: 'NoneType' object has no attribute 'find'
Finish 1400220917
start to get 3283884397
Finish visiting pages count: 1878
Finish visiting pages count: 1878
在weibo.yaml配置文件中:
speed:
max: -1
single: -1
adaptive: no
请问,single
和adaptive
选项的含义是什么?
如果speed.max
配置为20,而且在单机模式下,single
的配置还有意义么?
RT。cola会重新启动程序么?
我尝试在ubuntu和windows启动weibosearch, 但是在抛出一些QPixmap warning 就停下来了。。。
Hey !
This lib looks so cool, cant wait to test the weibo crawling 👍
Problem is : login fails
yop@ubuntu:~/cola$python contrib/weibo/__init__.py
/home/clement/Dev/mitras/oldies/cola/cola/core/opener.py:74: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
login fail
login fail
I thought maybe there was because of the weibo.yaml file, because it seems there is a typo here
https://github.com/chineking/cola/blob/master/contrib/weibo/weibo.yaml#L13
it is
login:
- username: # username
password: # password
Should it be ?
login:
- username: # username
- password: # password
But then I got another error
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/clement/Dev/mitras/oldies/cola/cola/worker/loader.py", line 287, in _call
if not self._login(opener):
File "/home/clement/Dev/mitras/oldies/cola/cola/worker/loader.py", line 169, in _login
login_success = self.job.login_hook(opener, **kw)
File "/home/clement/Dev/mitras/oldies/cola/contrib/weibo/__init__.py", line 37, in login_hook
passwd = str(kw['password'])
KeyError: 'password'
Any idea?
运行\contrib\weibosearch_init_.py,始终出现JobWorkerRunning: There has been a running job worker.因而,很自然地运行\contrib\weibosearch\stop.py,IDLE上出现Force to stop? (y or n) ,敲上y后,却出现错误TypeError: recover() takes exactly 1 argument (0 given)。不知道是什么原因...
运行提示:
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
^CCatch interrupt signal, start to stop
Counters during running:
{'finishes': 1,
'pages': 800,
'secs': 527.077085018158}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 39842.86 seconds for running
请问,出现这个提示是什么意思?是没有bundle再处理了么?
我会不断的抓取用户关注,所以应该不是没有bundle。
我开了3个instance,用3个帐号。
在抓取信息的时候暂时还不能判断是否是企业帐号,如果是企业帐号的话信息就会为空了~应该是少了个if神马的
有时还会直接catch
get 1908349515 url: http://weibo.com/1908349515/info
Error when handle bundle: 1908349515, url: http://weibo.com/1908349515/info
ValidationError (WeiboUser:54888bf6c95f801b60bce315) (site.Invalid URL: http://w
eibo.com/376765750
http://weibo.com/linuxde: ['info'])
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exceptio
n
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
**options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 605, in parse
weibo_user.save()
File "E:\Tools\Script\Python27\lib\site-packages\mongoengine\document.py", lin
e 224, in save
self.validate(clean=clean)
File "E:\Tools\Script\Python27\lib\site-packages\mongoengine\base\document.py"
, line 323, in validate
raise ValidationError(message, errors=errors)
ValidationError: ValidationError (WeiboUser:54888bf6c95f801b60bce315) (site.Inva
lid URL: http://weibo.com/376765750
执行环境为core=2, instances=5(尝试过大于2,皆会报错) 执行出错log如下:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/mnt/hgfs/crawl/cola-code/cola/job/task.py", line 235, in run
obj = self.executor.execute(self.running, is_inc=is_inc)
File "/mnt/hgfs/crawl/cola-code/cola/job/executor.py", line 416, in execute
self.mq.put(next_urls)
File "/mnt/hgfs/crawl/cola-code/cola/core/mq/__init.py", line 103, in put
self.conn.recv()
IOError: bad message length
Finish https://www.tumblr.com/svc/indash_blog/posts?tumblelog_name_or_id=ominouslester&post_id=&limit=50&offset=0
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/mnt/hgfs/crawl/cola-code/cola/job/task.py", line 235, in run
obj = self.executor.execute(self.running, is_inc=is_inc)
File "/mnt/hgfs/crawl/cola-code/cola/job/executor.py", line 416, in execute
self.mq.put(next_urls)
File "/mnt/hgfs/crawl/cola-code/cola/core/mq/__init.py", line 103, in put
self.conn.recv()
UnpicklingError: bad pickle data
get url: 1
Finish 1
请问作者近期有发布cola的计划么?cola一个release都还没有啊...
我最进在抓取微博,但是帐号被封的厉害,所以下一步打算加入多帐号登录功能,请问这个思路是否可行?
如果我的bundle 只有一個 url, parser 返回 nexturls 為 []
terminal 會顯示 start to get None,而不是自動停止
If my bundle only have one URL and parser return nexturls is [],
terminal will print out "start to get None", but I expect the program to stop automatically.
内容如下,然后就什么都抓不到了
/home/dash/workspace/cola/cola/core/opener.py:74: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to get None
start to get None
如果br.response().read()读取数据失败的话,这个会产生错误,是不是应该加上异常处理?
报错提示如下:
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, _self.__kwargs)
File "/home/kqc/github/cola/cola/job/container.py", line 123, in run
self.init()
File "/home/kqc/github/cola/cola/job/container.py", line 88, in init
self.init_tasks()
File "/home/kqc/github/cola/cola/job/container.py", line 104, in init_tasks
is_local=self.is_local, job_name=self.job_name)
File "/home/kqc/github/cola/cola/job/task.py", line 81, in __init
self.prepare()
File "/home/kqc/github/cola/cola/job/task.py", line 102, in prepare
self.executor.login()
File "/home/kqc/github/cola/cola/job/executor.py", line 151, in login
if not self._login(shuffle=random):
File "/home/kqc/github/cola/cola/job/executor.py", line 174, in _login
login_result = self.job_desc.login_hook(self.opener, **kw)
File "/home/kqc/github/cola/contrib/kweibo/init.py", line 40, in login_hook
return loginer.login()
File "/home/kqc/github/cola/contrib/kweibo/login.py", line 107, in login
json_data = json.loads(regex.search(text).group(1))
File "/usr/lib/python2.7/json/init.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Counters during running:
{}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 0.34 seconds for running
我想我是不是彻底被封了...?
我现在打算抓取每个微博的评论,所以需要改动已有代码。我有一个疑问:在contrib/sina/__init__.py
中定义了如下一组url模式,例如微博的模式:
Url(r'http://weibo.com/aj/mblog/mbloglist.*', 'micro_blog', MicroBlogParser),
访问http://weibo.com/aj/mblog/mbloglist.*
请问你这个页面模式是怎么得到的呢?
我熟悉的方式是:先访问某个微博页面,如:http://weibo.com/p/1006061774908135/home?from=page_100606&mod=TAB#place, 然后观察页面的结构采用bs4或者lxml进行抽取。
烦请指点!
我用develop分支抓取微博数据,发现其现在无法自行终止,最终的(部分)输出消息如下:
start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
^CCatch interrupt signal, start to stop
Counters during running:
{'finishes': 1,
'pages': 3,
'secs': 1.501857042312622}
Processing shutting down
Shutdown finished
配置如下:
job:
db: kweibo
mode: bundle # also can be bundle
size: 50 # the destination (including bundle or url) size
speed:
max: 20 # to the cluster, -1 means no restrictions, if greater than 0, means webpages opened per minute
single: -1 # max restrictions to a single instance
adaptive: no
instances: 1
priorities: 3 # priorities queue count in mq
copies: 1 # redundant size of objects in mq
inc: yes
shuffle: no # only work in bundle mode, means the urls in a bundle will shuffle before fetching
clear: yes
error:
network:
retries: 0 # 0 means no retry, -1 means keeping on trying
span: 20 # seconds span to retry
ignore: yes # only work under bundle mode, if True will ignore this url and move to the next after several tries, or move to the next bundle
server: # like 404 or 500 error returned by server
retries: 5
span: 10
ignore: no
components:
deduper:
cls: cola.core.dedup.FileBloomFilterDeduper
(下面还有一些我自己添加的配置,应该跟这个没关系)。
使用默认配置,抓取uid为3211200050的微博账户信息后,核对数据库中的抓取信息,与实际微博页面比较,发现数据抓取不全,比如:
抓取到的最后一条数据是
"mid" : "3552487192125522",
"content" : "聊天室服务分析设计 - 轩脉刃 - 博客园 http://t.cn/zY8WDNu",
"created" : ISODate("2013-03-05T13:46:00Z"),
实际上微博页面最后一条的数据是3月4日的,不是这条。而这条信息,在数据库种出现了两次。
代码没细看,外部分析如下:
经过对比爬虫和浏览器访问时的url,怀疑该问题是爬取ajax页面参数与浏览器刷新时参数不完全一致引起的,比如浏览器ajax刷新时,使用的url为:
http://weibo.com/aj/mblog/mbloglist?_wv=5&page=2&count=50&pre_page=1&end_id=3599178829236368&_k=137372704888226&_t=0&end_msign=-1&uid=3211200050&__rnd=1373727119799
而cola爬取的页面为:
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3599178829236368&_t=0&_k=1373769737464139&__rnd=1373769743284&pagebar=0&max_id=3582553644135684&page=2
检查了cola爬取的url中的数据,发现确实有重复数据,怀疑是爬取该页面时获取的指定获取微博范围的几个id值有点不对。
查询数据库发现没有爬取到数据
我用的是IUS的库安装的Python 2.7(因为这个库提供pip)。
运行Master或者Worker但不提供IP时无法启动:
[root@localhost ~]# coca master -s
unknown command options
附上IP后正常启动
[root@localhost ~]# coca master -s 地址
start master at: 地址:11103
但是就算是附上IP也无法关闭:
[root@localhost ~]# coca master -k 地址
Traceback (most recent call last):
File "/usr/bin/coca", line 9, in <module>
load_entry_point('Cola==0.1.0b0', 'console_scripts', 'coca')()
File "/usr/lib/python2.7/site-packages/cola/cmdline.py", line 38, in execute
args.func(args)
File "/usr/lib/python2.7/site-packages/cola/commands/master.py", line 50, in run
ctx = Context(is_client=True, master_addr=args.kill)
File "/usr/lib/python2.7/site-packages/cola/context.py", line 110, in __init__
self.addrs = [self.fix_addr(_ip) for _ip in self.ips]
File "/usr/lib/python2.7/site-packages/cola/context.py", line 66, in <lambda>
fix_addr = lambda _, addr: addr if ':' in addr \
TypeError: argument of type 'NoneType' is not iterable
提交作业时也必须给出Master:
[root@localhost ~]# coca job -u app/weibo/
Traceback (most recent call last):
File "/usr/bin/coca", line 9, in <module>
load_entry_point('Cola==0.1.0b0', 'console_scripts', 'coca')()
File "/usr/lib/python2.7/site-packages/cola/cmdline.py", line 38, in execute
args.func(args)
File "/usr/lib/python2.7/site-packages/cola/commands/job.py", line 79, in run
ctx = Context(is_client=True, master_addr=master_addr)
File "/usr/lib/python2.7/site-packages/cola/context.py", line 82, in __init__
raise ValueError('Master address must be supplied when local_mode is False')
ValueError: Master address must be supplied when local_mode is False
给出Master之后:
[root@localhost ~]# coca job -m 地址:11103 -u app/weibo/
Traceback (most recent call last):
File "/usr/bin/coca", line 9, in <module>
load_entry_point('Cola==0.1.0b0', 'console_scripts', 'coca')()
File "/usr/lib/python2.7/site-packages/cola/cmdline.py", line 38, in execute
args.func(args)
File "/usr/lib/python2.7/site-packages/cola/commands/job.py", line 79, in run
ctx = Context(is_client=True, master_addr=master_addr)
File "/usr/lib/python2.7/site-packages/cola/context.py", line 110, in __init__
self.addrs = [self.fix_addr(_ip) for _ip in self.ips]
File "/usr/lib/python2.7/site-packages/cola/context.py", line 66, in <lambda>
fix_addr = lambda _, addr: addr if ':' in addr \
TypeError: argument of type 'NoneType' is not iterable
以上问题同配置Ubuntu 14.04都是正常的。
环境:win7 + python27
启动master和worker都正常, 然后就无法停止当前的worker了。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.