qinxuye / cola Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 537.0 505 KB

A high-level distributed crawling framework.

License: Other

Python 100.00%

cola's People

Contributors

Stargazers

Watchers

Forkers

iswangheng ballacky13 xiaojunchan bobozhengsir masdude tmactive linkerlin mwzkqmuuzkflbxm bingosummer huzichunjohn hanyprin sixtynine jingchuanpu daqing15 xcleee icelitchi franklinsun frankxie ganer shygoly wenson stamhe sudy yejingyang jerry4free fengdjhy hellove1985 vissible bollwang oneplus7 lipengyu flyyoung egocheer junenyh0405 renchaorevee xren houbl tutukl jyswpp imguazi metrina samanthaliu falconfei shangrz paulluo flowerowl vus520 dengtianlang hisuley demonwnb qcdcool bobodeng friedmannn joeromero wangfeng3769 ypengh olele cto1206 lingyics feiniao1221 gitoffice cyqlegend chenlong828 qiezi168 roottan qz267 xuzhibiaoge gf2525 darkbit001 suoluoji yiiwood liuzheng ryekee kangxiaoxue fulinmao silva6 chenxofhit huangzhiyong colfad zz198808 yangxt jwchennlp atrmat shangliuyan fashtimedotcom wzzwwbx hacder caozhzh yangmeilly xuxiaoguang1 wufulin staghill littlemonsterak xmkane zengyu1990 funnyfan xinglu guoyunsky liangxiao0315 daniel7725

cola's Issues

windows下coca无法启动分布式程序

Win7下，用coca运行分布式程序，报错如下：
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.

        This probably means that you are on Windows and you have
        forgotten to use the proper idiom in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce a Windows executable.

Traceback (most recent call last):
File "C:\Python27\Scripts\coca-script.py", line 8, in
load_entry_point('cola==0.1.0beta', 'console_scripts', 'coca')()
File "C:\Python27\lib\site-packages\cola-0.1.0beta-py2.7.egg\cola\cmdline.py",
line 38, in execute
args.func(args)
File "C:\Python27\lib\site-packages\cola-0.1.0beta-py2.7.egg\cola\commands\mas
ter.py", line 49, in run
working_dir=args.working)
File "C:\Python27\lib\site-packages\cola-0.1.0beta-py2.7.egg\cola\context.py",
line 113, in init
self.manager.start(manager_init)
File "C:\Python27\lib\multiprocessing\managers.py", line 528, in start
self._address = reader.recv()
EOFError

显示import模块错误

不知道是不是我的问题，小弟初进github。linux和python还不是很懂。安装了pyyaml，mechanize，python-dateutil，BeautifulSoup4，mongoengine，rsa，MongoDB等依赖。也在/usr/local/lib/pythonX.X/dist-packages中添加了pth文件。但是执行单机模式/contrib/wiki下python init.py命令时显示错误

Traceback (most recent call last):
File "init.py", line 28, in
from cola.core.urls import UrlPatterns, Url
ImportError: No module named cola.core.urls

想请指教一下小弟哪一步出了问题。小弟尚在学习，水平有限，问题如果太傻，请多包涵

如果微博内容有图片，如何爬取

develop分支FileBloomFilter有问题

具体情况是这样的，我Ctrl+C中断一个单机的任务，重新运行之后，会在FileBloomFilter有关的地方挂，无法启动。

然后我发现它的单元测试没有过。。

具体我也正在看。好像是它的父类BloomFilter没有初始化。。。。

实现每个用户最大抓取的微博数

我希望实现每个用户最多抓取MAX_WEIBO条微博，我的想法是：在contrib/weibo/parse.py的MicroBlogParser中添加以下代码：

    self.crawled_num_lock.acquire()
    self.bundle.current_crawled_num += inc_num
    self.crawled_num_lock.release()
    print 'Current crawled microblog num:', self.bundle.current_crawled_num
    if self.bundle.current_crawled_num > MAX_WEIBO:
        finished = True

其中inc_num是本次抓取新增的微博数，这里在bundle中添加了一个属性current_crawled_num，并且为这个操作添加了一个Lock。希望访问 self.bundle.current_crawled_num就可以得到当前对于某个用户的已经抓取的微博数。

但是这个方法似乎行不通，cola仍然在抓取，而且print 'Current crawled microblog num:', self.bundle.current_crawled_num只打印了一遍。

出现这个问题可能跟cola的内部实现有关，请问如果要实现这个需求，该如何更改代码呢？

框里

具体我空间了

develop下stop.py不能运行

Traceback (most recent call last):
File "stop.py", line 29, in
from cola.worker.recover import recover
ImportError: No module named worker.recover

weibosearch 运行问题

start to get none
是网页没找到吗还是？还有weibosearch 中的.yaml没有配置keys，但是put_starts也没看到被调用呀，求指教一下啦，谢谢啦~~~

Ctrl+C退出后，无法重新启动

第一次运行起来后，用Ctrl+C退出了。第二次python init.py启动后就一直卡在这里不动了：
/home/iot/cf/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
/home/iot/cf/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)

用户信息部分模块失效

develop分支中运行stop.py出现错误

在develp分支中，运行：
python contrib/weibo/stop.py
提示错误：
Traceback (most recent call last):
File "contrib/weibo/stop.py", line 29, in
from cola.worker.recover import recover
ImportError: No module named worker.recover

不能以单机模式運行 contrib/weibo

Wiki 中說

单机模式非常简单，只需运行contrib/wiki/init.py即可。
cd /to/path/cola/contrib/weibo
python init.py

這樣做會回報錯誤

如果先启动cola master 和 cola worker，再運行单机模式就可正常運作。

请教一个问题，就是爬取的页面你是如何解析的？

你好，我在网上看到有人说，直接抓取web页的话，有些数据时是在js里面，不好解析。
另外有人说爬取weibo.cn手机版比较方便。

我的需求是爬取一个用户的全部粉丝和全部好友，然后往下继续爬出一个用户的全部粉丝和全部好友。用这个cola不知道能不能实现？

Name or Service not known

我启动了mongod.配置了wiki.yaml，都是用的localhost和27017，但还是出现socket.gaierror: [Errno -2] Name or service not known，请教下是什么原因～

weibosearch无法运行

提示：
WARNING: QApplication was not created in the main() thread.
QObject::setParent: Cannot set parent, new parent is in a different thread

无法运行，怎么不能set parent呢？

非Ctrl+C异常退出后，程序锁死

如登录失败等情况，程序没有通过Ctrl+C的方式退出，再次启动时会锁死。stop.py不能工作，我目前是通过删除/tmp/cola文件夹的方式来解锁。

cola对多个登录账号的处理方式

如果在yaml配置文件中提供了多个登录用户名和密码，那么cola是如何使用的呢？尤其是当其中一个被封号（例如持续出错）。

关于微博抓取的线程数选择的疑问

你好，感谢你提供了这样的一个框架，It helps a lot。

我注意到你把微博抓取的instances设置为2，且由于

# cola/worker/loader.py
if master is None:
    with StandaloneWorkerJobLoader(job, root, force=force) as job_loader:
        job_loader.run()

，全局只有2个线程在抓取微博。

我在做类似爬虫的时候触发了新浪的反爬虫机制，造成每次登录必须输入验证码的情况，原因估计是并发抓取的线程数太多（16个）。于是想问下你这个线程数是怎么得出来的。

weibo模块登录失败

报错如下：
/home/kqc/.local/share/Trash/files/cola/cola/core/opener.py:74: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
login fail, reason:
login fail, reason:
Finish visiting pages count: 0
Finish visiting pages count: 0

我已经在weibo.yaml中填写了用户名（邮箱形式）和密码，而且我的微博也无需验证码。

另外，请问weibo.yaml中的start部分如何填写？是否一定要填写自己的uid？

develop分支中微博parser出错

出错提示：

Error when handle bundle: 1644564144, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1644564144&end_id=3778591994786968&_t=0&_k=1416407703559230&__rnd=1416407854641&pagebar=1&max_id=3768747372034188&page=2
'NoneType' object has no attribute 'text'
Traceback (most recent call last):
File "/home/kqc/github/cola/cola/job/executor.py", line 504, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "/home/kqc/github/cola/cola/job/executor.py", line 430, in _parse
**options).parse()
File "/home/kqc/github/cola/contrib/kweibo/parsers.py", line 207, in parse
forwards = func_div.find('a', attrs={'action-type': action_type_re("forward")}).text
AttributeError: 'NoneType' object has no attribute 'text'

另外，@chineking 能否告知你是如何查看 http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1644564144&end_id=3778591994786968&_t=0&_k=1416407703559230&__rnd=1416407854641&pagebar=1&max_id=3768747372034188&page=2 返回的数据的（用的什么工具，以后出错我也可以自行解决）？返回的数据似乎不能直接错位html直接显示。

weibosearch multithreads

It seems that QApplication() must be created in main thread, however, the default weibosearch config uses two instances which will cause one of the QApplication started in non-main thread.

weibo 模块抓取总是登陆失败

请问以下，登陆失败引起的无法抓取，这个有解决的办法吗？
我这边使用的是单机抓取。我手动登陆发现，被弹了验证码。

无法爬数据，输出start to get None

我一共尝试了两次。第一次一切正常。

第二次的时候只输出start to get None，没有得到其他内容。

后来换了一个ID测试，还是这样。

在Arch Linux上测试，Python2.7

有些账户无法提取昵称

比如这个页面，
http://weibo.com/2955709171/info

在info这一页并没有显示昵称。

遇到执行weibosearch的时候包不存在包问题

具体报错如下：
Traceback (most recent call last):
File "init.py", line 62, in
from cola.worker.loader import load_job
ImportError: No module named worker.loader

能否在README中叙述一下抓取新浪微博的思路？

我是通过selenium模拟登陆获取cookie来抓取新浪微博内容的，感觉不太优雅，不知道您采用的是什么方式？如果能对代码进行注释就跟好了。

develop下contrib/weibo无法使用

Traceback (most recent call last):
  File "__init__.py", line 56, in <module>
    ctx = Context(local_mode=True)
  File "/home/ddmbr/play/cola/cola/context.py", line 104, in __init__
    self.addrs = [self.fix_addr(_ip) for _ip in self.ips]
  File "/home/ddmbr/play/cola/cola/context.py", line 64, in <lambda>
    fix_addr = lambda _, addr: addr if ':' in addr \
TypeError: argument of type 'NoneType' is not iterable

dev抓取微博报错

抓取微博的时候不知道为什么
parsers的176行会报错 mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])

以下是错误信息
D:\cola\contrib\weibo>init.py
D:\cola\cola\core\opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
process bundle from priority 0
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
D:\cola\cola\core\opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
process bundle from priority 0
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
_options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
Error when handle bundle: 1898353550, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=14182337179320
00&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=
1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418233844764&
pagebar=0&max_id=3778673397938545&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
Error when handle bundle: 1898353550, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=14182337179320
00&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418233844764&pagebar=0&max_id=3778673397938545&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234623885&pagebar=1&max_id=3734740953372321&page=1
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=50&pre_page=1&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234624721&page=2
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234625316&pagebar=0&max_id=3656888933158899&page=2
Error when handle bundle: 1898353550, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1898353550&end_id=3786306393521083&_t=0&_k=14182337179320
00&__rnd=1418234625316&pagebar=0&max_id=3656888933158899&page=2
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&
pagebar=0&max_id=3751405376185938&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
*_options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range
get 1898353550 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1898353550&end_id=3786306393521083&_t=0&_k=1418233717932000&__rnd=1418234625316&pagebar=0&max_id=3656888933158899&page=2
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
Error when handle bundle: 3211200050, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=14182337175750
00&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1
list index out of range
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exception
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
**options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 177, in parse
mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title'])
IndexError: list index out of range

Error: 'None Type' object has no attribute 'find'

爬微博的数据，程序跑到一半报错了。
Error log:

get 1400220917 url: http://weibo.com/1400220917/follow
Error when fetch url: http://weibo.com/1746173800/follow
Error when get bundle: 1746173800
'NoneType' object has no attribute 'find'
Traceback (most recent call last):
File "/home/iot/cf/Cola/cola/cola/worker/loader.py", line 229, in _execute_bundle
*_options).parse()
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 554, in parse
return self._error(url, e)
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 91, in _error
raise e
AttributeError: 'NoneType' object has no attribute 'find'
Finish 1746173800
start to get 1804559491
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
get 1400220917 url: http://weibo.com/1400220917/follow
Error when fetch url: http://weibo.com/1400220917/follow
Error when get bundle: 1400220917
'NoneType' object has no attribute 'find'
Traceback (most recent call last):
File "/home/iot/cf/Cola/cola/cola/worker/loader.py", line 229, in _execute_bundle
*_options).parse()
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 554, in parse
return self._error(url, e)
File "/home/iot/cf/Cola/cola/contrib/weibo/parsers.py", line 91, in _error
raise e
AttributeError: 'NoneType' object has no attribute 'find'
Finish 1400220917
start to get 3283884397
Finish visiting pages count: 1878
Finish visiting pages count: 1878

dev版本weibo.yaml配置问题

在weibo.yaml配置文件中：
speed:

max: -1

single: -1

adaptive: no

请问，single和adaptive选项的含义是什么？
如果speed.max配置为20，而且在单机模式下，single的配置还有意义么？

如果程序raise exception，cola如何处理？

RT。cola会重新启动程序么？

weibosearch 运行失败

我尝试在ubuntu和windows启动weibosearch，但是在抛出一些QPixmap warning 就停下来了。。。

login fail

Hey !
This lib looks so cool, cant wait to test the weibo crawling 👍

Problem is : login fails

yop@ubuntu:~/cola$python contrib/weibo/__init__.py 
/home/clement/Dev/mitras/oldies/cola/cola/core/opener.py:74: UserWarning: gzip transfer encoding is experimental!
  self.browser.set_handle_gzip(True)
login fail
login fail

I thought maybe there was because of the weibo.yaml file, because it seems there is a typo here
https://github.com/chineking/cola/blob/master/contrib/weibo/weibo.yaml#L13

it is

  login:
    - username: # username
      password: # password

Should it be ?

  login:
    - username: # username
    - password: # password

But then I got another error

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/clement/Dev/mitras/oldies/cola/cola/worker/loader.py", line 287, in _call
    if not self._login(opener):
  File "/home/clement/Dev/mitras/oldies/cola/cola/worker/loader.py", line 169, in _login
    login_success = self.job.login_hook(opener, **kw)
  File "/home/clement/Dev/mitras/oldies/cola/contrib/weibo/__init__.py", line 37, in login_hook
    passwd = str(kw['password'])
KeyError: 'password'

Any idea?

JobWorkerRunning: There has been a running job worker.

运行\contrib\weibosearch_init_.py，始终出现JobWorkerRunning: There has been a running job worker.因而，很自然地运行\contrib\weibosearch\stop.py，IDLE上出现Force to stop? (y or n) ，敲上y后，却出现错误TypeError: recover() takes exactly 1 argument (0 given)。不知道是什么原因...

dev版本：no budget left to process

运行提示：
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
^CCatch interrupt signal, start to stop
Counters during running:
{'finishes': 1,
'pages': 800,
'secs': 527.077085018158}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 39842.86 seconds for running

请问，出现这个提示是什么意思？是没有bundle再处理了么？
我会不断的抓取用户关注，所以应该不是没有bundle。

我开了3个instance，用3个帐号。

抓去用户信息时不能判断是否是企业帐号

在抓取信息的时候暂时还不能判断是否是企业帐号，如果是企业帐号的话信息就会为空了~应该是少了个if神马的

有时还会直接catch
get 1908349515 url: http://weibo.com/1908349515/info
Error when handle bundle: 1908349515, url: http://weibo.com/1908349515/info
ValidationError (WeiboUser:54888bf6c95f801b60bce315) (site.Invalid URL: http://w
eibo.com/376765750

http://weibo.com/linuxde: ['info'])
Traceback (most recent call last):
File "D:\cola\cola\job\executor.py", line 519, in _parse_with_process_exceptio
n
res = self._parse(parser_cls, options, bundle, url)
File "D:\cola\cola\job\executor.py", line 442, in _parse
**options).parse()
File "D:\cola\contrib\weibo\parsers.py", line 605, in parse
weibo_user.save()
File "E:\Tools\Script\Python27\lib\site-packages\mongoengine\document.py", lin
e 224, in save
self.validate(clean=clean)
File "E:\Tools\Script\Python27\lib\site-packages\mongoengine\base\document.py"
, line 323, in validate
raise ValidationError(message, errors=errors)
ValidationError: ValidationError (WeiboUser:54888bf6c95f801b60bce315) (site.Inva
lid URL: http://weibo.com/376765750

instances设置为大于core个数时，会出问题，过一段时间就会停止爬取了

执行环境为core=2, instances=5(尝试过大于2，皆会报错) 执行出错log如下：
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/mnt/hgfs/crawl/cola-code/cola/job/task.py", line 235, in run
obj = self.executor.execute(self.running, is_inc=is_inc)
File "/mnt/hgfs/crawl/cola-code/cola/job/executor.py", line 416, in execute
self.mq.put(next_urls)
File "/mnt/hgfs/crawl/cola-code/cola/core/mq/__init.py", line 103, in put
self.conn.recv()
IOError: bad message length

Finish https://www.tumblr.com/svc/indash_blog/posts?tumblelog_name_or_id=ominouslester&post_id=&limit=50&offset=0
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/mnt/hgfs/crawl/cola-code/cola/job/task.py", line 235, in run
obj = self.executor.execute(self.running, is_inc=is_inc)
File "/mnt/hgfs/crawl/cola-code/cola/job/executor.py", line 416, in execute
self.mq.put(next_urls)
File "/mnt/hgfs/crawl/cola-code/cola/core/mq/__init.py", line 103, in put
self.conn.recv()
UnpicklingError: bad pickle data

get url: 1
Finish 1

cola近期的发布情况

请问作者近期有发布cola的计划么？cola一个release都还没有啊...

我最进在抓取微博，但是帐号被封的厉害，所以下一步打算加入多帐号登录功能，请问这个思路是否可行？

Bundle 模式下不能停止 Program cannot stop automatically in bundle mode

如果我的bundle 只有一個 url, parser 返回 nexturls 為 []
terminal 會顯示 start to get None，而不是自動停止

If my bundle only have one URL and parser return nexturls is [],
terminal will print out "start to get None", but I expect the program to stop automatically.

为何运行contrib/wiki下的init.py经常会出现UserWarning

内容如下，然后就什么都抓不到了
/home/dash/workspace/cola/cola/core/opener.py:74: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to get None
start to get None

json.loads(br.response().read())["data"]

如果br.response().read()读取数据失败的话，这个会产生错误，是不是应该加上异常处理？

dev版本无法登录

报错提示如下：
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, _self.__kwargs)
File "/home/kqc/github/cola/cola/job/container.py", line 123, in run
self.init()
File "/home/kqc/github/cola/cola/job/container.py", line 88, in init
self.init_tasks()
File "/home/kqc/github/cola/cola/job/container.py", line 104, in init_tasks
is_local=self.is_local, job_name=self.job_name)
File "/home/kqc/github/cola/cola/job/task.py", line 81, in __init
self.prepare()
File "/home/kqc/github/cola/cola/job/task.py", line 102, in prepare
self.executor.login()
File "/home/kqc/github/cola/cola/job/executor.py", line 151, in login
if not self._login(shuffle=random):
File "/home/kqc/github/cola/cola/job/executor.py", line 174, in _login
login_result = self.job_desc.login_hook(self.opener, **kw)
File "/home/kqc/github/cola/contrib/kweibo/init.py", line 40, in login_hook
return loginer.login()
File "/home/kqc/github/cola/contrib/kweibo/login.py", line 107, in login
json_data = json.loads(regex.search(text).group(1))
File "/usr/lib/python2.7/json/init.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Counters during running:
{}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 0.34 seconds for running

我想我是不是彻底被封了...?

新浪微博爬虫访问页面模式疑问

我现在打算抓取每个微博的评论，所以需要改动已有代码。我有一个疑问：在contrib/sina/__init__.py中定义了如下一组url模式，例如微博的模式：
Url(r'http://weibo.com/aj/mblog/mbloglist.*', 'micro_blog', MicroBlogParser),
访问http://weibo.com/aj/mblog/mbloglist.*请问你这个页面模式是怎么得到的呢？

我熟悉的方式是：先访问某个微博页面，如：http://weibo.com/p/1006061774908135/home?from=page_100606&mod=TAB#place，然后观察页面的结构采用bs4或者lxml进行抽取。

烦请指点！

develop分支爬虫无法自行终止

我用develop分支抓取微博数据，发现其现在无法自行终止，最终的（部分）输出消息如下：

start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
^CCatch interrupt signal, start to stop
Counters during running:
{'finishes': 1,
'pages': 3,
'secs': 1.501857042312622}
Processing shutting down
Shutdown finished

配置如下：

job:
db: kweibo
mode: bundle # also can be bundle
size: 50 # the destination (including bundle or url) size
speed:
max: 20 # to the cluster, -1 means no restrictions, if greater than 0, means webpages opened per minute
single: -1 # max restrictions to a single instance
adaptive: no
instances: 1
priorities: 3 # priorities queue count in mq
copies: 1 # redundant size of objects in mq
inc: yes
shuffle: no # only work in bundle mode, means the urls in a bundle will shuffle before fetching
clear: yes
error:
network:
retries: 0 # 0 means no retry, -1 means keeping on trying
span: 20 # seconds span to retry
ignore: yes # only work under bundle mode, if True will ignore this url and move to the next after several tries, or move to the next bundle
server: # like 404 or 500 error returned by server
retries: 5
span: 10
ignore: no
components:
deduper:
cls: cola.core.dedup.FileBloomFilterDeduper
（下面还有一些我自己添加的配置，应该跟这个没关系）。

单机测试新浪微博抓取时发现抓取的信息有重复且不完整

使用默认配置，抓取uid为3211200050的微博账户信息后，核对数据库中的抓取信息，与实际微博页面比较，发现数据抓取不全，比如：
抓取到的最后一条数据是

        "mid" : "3552487192125522",
        "content" : "聊天室服务分析设计 - 轩脉刃 - 博客园 http://t.cn/zY8WDNu",
        "created" : ISODate("2013-03-05T13:46:00Z"),

实际上微博页面最后一条的数据是3月4日的，不是这条。而这条信息，在数据库种出现了两次。

代码没细看，外部分析如下：

经过对比爬虫和浏览器访问时的url，怀疑该问题是爬取ajax页面参数与浏览器刷新时参数不完全一致引起的，比如浏览器ajax刷新时，使用的url为：
http://weibo.com/aj/mblog/mbloglist?_wv=5&page=2&count=50&pre_page=1&end_id=3599178829236368&_k=137372704888226&_t=0&end_msign=-1&uid=3211200050&__rnd=1373727119799

而cola爬取的页面为：
get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3599178829236368&_t=0&_k=1373769737464139&__rnd=1373769743284&pagebar=0&max_id=3582553644135684&page=2

检查了cola爬取的url中的数据，发现确实有重复数据，怀疑是爬取该页面时获取的指定获取微博范围的几个id值有点不对。

大量的start to get None

查询数据库发现没有爬取到数据

在CentOS 6中无法运行

我用的是IUS的库安装的Python 2.7（因为这个库提供pip）。

运行Master或者Worker但不提供IP时无法启动：

[root@localhost ~]# coca master -s
unknown command options

附上IP后正常启动

[root@localhost ~]# coca master -s 地址
start master at: 地址:11103

但是就算是附上IP也无法关闭：

[root@localhost ~]# coca master -k 地址
Traceback (most recent call last):
  File "/usr/bin/coca", line 9, in <module>
    load_entry_point('Cola==0.1.0b0', 'console_scripts', 'coca')()
  File "/usr/lib/python2.7/site-packages/cola/cmdline.py", line 38, in execute
    args.func(args)
  File "/usr/lib/python2.7/site-packages/cola/commands/master.py", line 50, in run
    ctx = Context(is_client=True, master_addr=args.kill)
  File "/usr/lib/python2.7/site-packages/cola/context.py", line 110, in __init__
    self.addrs = [self.fix_addr(_ip) for _ip in self.ips]
  File "/usr/lib/python2.7/site-packages/cola/context.py", line 66, in <lambda>
    fix_addr = lambda _, addr: addr if ':' in addr \
TypeError: argument of type 'NoneType' is not iterable

提交作业时也必须给出Master：

[root@localhost ~]# coca job -u app/weibo/
Traceback (most recent call last):
  File "/usr/bin/coca", line 9, in <module>
    load_entry_point('Cola==0.1.0b0', 'console_scripts', 'coca')()
  File "/usr/lib/python2.7/site-packages/cola/cmdline.py", line 38, in execute
    args.func(args)
  File "/usr/lib/python2.7/site-packages/cola/commands/job.py", line 79, in run
    ctx = Context(is_client=True, master_addr=master_addr)
  File "/usr/lib/python2.7/site-packages/cola/context.py", line 82, in __init__
    raise ValueError('Master address must be supplied when local_mode is False')
ValueError: Master address must be supplied when local_mode is False

给出Master之后：

[root@localhost ~]# coca job -m 地址:11103 -u app/weibo/
Traceback (most recent call last):
  File "/usr/bin/coca", line 9, in <module>
    load_entry_point('Cola==0.1.0b0', 'console_scripts', 'coca')()
  File "/usr/lib/python2.7/site-packages/cola/cmdline.py", line 38, in execute
    args.func(args)
  File "/usr/lib/python2.7/site-packages/cola/commands/job.py", line 79, in run
    ctx = Context(is_client=True, master_addr=master_addr)
  File "/usr/lib/python2.7/site-packages/cola/context.py", line 110, in __init__
    self.addrs = [self.fix_addr(_ip) for _ip in self.ips]
  File "/usr/lib/python2.7/site-packages/cola/context.py", line 66, in <lambda>
    fix_addr = lambda _, addr: addr if ':' in addr \
TypeError: argument of type 'NoneType' is not iterable

以上问题同配置Ubuntu 14.04都是正常的。

启动worker后无法停止..

环境：win7 + python27

启动master和worker都正常, 然后就无法停止当前的worker了。